Hi beamers,

Today, Beam is mainly focused on data processing.
Since the beginning of the project, we are discussing about extending the use cases coverage via DSLs and extensions (like for machine learning), or via IO.

Especially for the IO, we can see Beam use for data integration and data ingestion.

In this area, I'm proposing a first IO: ExecIO:

https://issues.apache.org/jira/browse/BEAM-1059
https://github.com/apache/incubator-beam/pull/1451

Actually, this IO is mainly an ExecFn that executes system commands (again, keep in mind we are discussing about data integration/ingestion and not data processing).

For convenience, this ExecFn is wrapped in Read and Write (as a regular IO).

Clearly, this IO/Fn depends of the worker where it runs. But it's under the user responsibility.

During the review, Eugene and I discussed about:
- is it an IO or just a fn ?
- is it OK to have worker specific IO ?

IMHO, an IO makes lot of sense to me and it's very convenient for end users. They can do something like:

PCollection<String> output = pipeline.apply(ExecIO.read().withCommand("/path/to/myscript.sh"));

The pipeline will execute myscript and the output pipeline will contain command execution std out/err.

On the other hand, they can do:

pcollection.apply(ExecIO.write());

where PCollection contains the commands to execute.

Generally speaking, end users can call ExecFn wherever they want in the pipeline steps:

PCollection<String> output = pipeline.apply(ParDo.of(new ExecIO.ExecFn()));

The input collection contains the commands to execute, and the output collection contains the commands execution result std out/err.

Generally speaking, I'm preparing several IOs more on the data integration/ingestion area than on "pure" classic big data processing. I think it would give a new "dimension" to Beam.

Thoughts ?

Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to