Hi JB, Thanks for bringing this to the mailing list. I also think that this is useful in general (and that use cases for Beam are more than just classic bigdata), and that there are interesting questions here at different levels about how to do it right.
I suggest to start with the highest-level question [and discuss the particular API only after agreeing on this, possibly in a separate thread]: how to deal with the fact that Beam gives no guarantees about the environment on workers, e.g. which commands are available, which shell or even OS is being used, etc. Particularly: - Obviously different runners will have a different environment, e.g. Dataflow workers are not going to have Hadoop commands available because they are not running on a Hadoop cluster. So, pipelines and transforms developed using this connector will be necessarily non-portable between different runners. Maybe this is ok? But we need to give users a clear expectation about this. How do we phrase this expectation and where do we put it in the docs? - I'm concerned that this puts additional compatibility requirements on runners - it becomes necessary for a runner to document the environment of its workers (OS, shell, privileges, guaranteed-installed packages, access to other things on the host machine e.g. whether or not the worker runs in its own container, etc.) and to keep it stable - otherwise transforms and pipelines with this connector will be non-portable between runner versions either. Another way to deal with this is to give up and say "the environment on the workers is outside the scope of Beam; consult your runner's documentation or use your best judgment as to what the environment will be, and use this at your own risk". What do others think? On Mon, Dec 5, 2016 at 5:09 AM Jean-Baptiste Onofré <[email protected]> wrote: Hi beamers, Today, Beam is mainly focused on data processing. Since the beginning of the project, we are discussing about extending the use cases coverage via DSLs and extensions (like for machine learning), or via IO. Especially for the IO, we can see Beam use for data integration and data ingestion. In this area, I'm proposing a first IO: ExecIO: https://issues.apache.org/jira/browse/BEAM-1059 https://github.com/apache/incubator-beam/pull/1451 Actually, this IO is mainly an ExecFn that executes system commands (again, keep in mind we are discussing about data integration/ingestion and not data processing). For convenience, this ExecFn is wrapped in Read and Write (as a regular IO). Clearly, this IO/Fn depends of the worker where it runs. But it's under the user responsibility. During the review, Eugene and I discussed about: - is it an IO or just a fn ? - is it OK to have worker specific IO ? IMHO, an IO makes lot of sense to me and it's very convenient for end users. They can do something like: PCollection<String> output = pipeline.apply(ExecIO.read().withCommand("/path/to/myscript.sh")); The pipeline will execute myscript and the output pipeline will contain command execution std out/err. On the other hand, they can do: pcollection.apply(ExecIO.write()); where PCollection contains the commands to execute. Generally speaking, end users can call ExecFn wherever they want in the pipeline steps: PCollection<String> output = pipeline.apply(ParDo.of(new ExecIO.ExecFn())); The input collection contains the commands to execute, and the output collection contains the commands execution result std out/err. Generally speaking, I'm preparing several IOs more on the data integration/ingestion area than on "pure" classic big data processing. I think it would give a new "dimension" to Beam. Thoughts ? Regards JB -- Jean-Baptiste Onofré [email protected] http://blog.nanthrax.net Talend - http://www.talend.com
