Hi beamers,
Today, Beam is mainly focused on data processing.
Since the beginning of the project, we are discussing about extending
the use cases coverage via DSLs and extensions (like for machine
learning), or via IO.
Especially for the IO, we can see Beam use for data integration and data
ingestion.
In this area, I'm proposing a first IO: ExecIO:
https://issues.apache.org/jira/browse/BEAM-1059
https://github.com/apache/incubator-beam/pull/1451
Actually, this IO is mainly an ExecFn that executes system commands
(again, keep in mind we are discussing about data integration/ingestion
and not data processing).
For convenience, this ExecFn is wrapped in Read and Write (as a regular IO).
Clearly, this IO/Fn depends of the worker where it runs. But it's under
the user responsibility.
During the review, Eugene and I discussed about:
- is it an IO or just a fn ?
- is it OK to have worker specific IO ?
IMHO, an IO makes lot of sense to me and it's very convenient for end
users. They can do something like:
PCollection<String> output =
pipeline.apply(ExecIO.read().withCommand("/path/to/myscript.sh"));
The pipeline will execute myscript and the output pipeline will contain
command execution std out/err.
On the other hand, they can do:
pcollection.apply(ExecIO.write());
where PCollection contains the commands to execute.
Generally speaking, end users can call ExecFn wherever they want in the
pipeline steps:
PCollection<String> output = pipeline.apply(ParDo.of(new ExecIO.ExecFn()));
The input collection contains the commands to execute, and the output
collection contains the commands execution result std out/err.
Generally speaking, I'm preparing several IOs more on the data
integration/ingestion area than on "pure" classic big data processing. I
think it would give a new "dimension" to Beam.
Thoughts ?
Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com