I would like the runner-independent, language-independent graph to have a
way to specify requirements on the environment that a DoFn runs in. This
would provide a natural way to talk about installed libraries, containers,
external services that are accessed, etc, and I think the requirement of a
particular OS with tools installed fits right in. At the crudest level,
this could be limited to a container URL.

Then the Java SDK needs a way to express these requirements. They will
generally probably be properties of a DoFn instance rather than a DoFn
class, since they may vary with instantiation parameters.

On Mon, Dec 5, 2016 at 11:51 AM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hi JB,
>
> Thanks for bringing this to the mailing list. I also think that this is
> useful in general (and that use cases for Beam are more than just classic
> bigdata), and that there are interesting questions here at different levels
> about how to do it right.
>
> I suggest to start with the highest-level question [and discuss the
> particular API only after agreeing on this, possibly in a separate thread]:
> how to deal with the fact that Beam gives no guarantees about the
> environment on workers, e.g. which commands are available, which shell or
> even OS is being used, etc. Particularly:
>
> - Obviously different runners will have a different environment, e.g.
> Dataflow workers are not going to have Hadoop commands available because
> they are not running on a Hadoop cluster. So, pipelines and transforms
> developed using this connector will be necessarily non-portable between
> different runners. Maybe this is ok? But we need to give users a clear
> expectation about this. How do we phrase this expectation and where do we
> put it in the docs?
>
> - I'm concerned that this puts additional compatibility requirements on
> runners - it becomes necessary for a runner to document the environment of
> its workers (OS, shell, privileges, guaranteed-installed packages, access
> to other things on the host machine e.g. whether or not the worker runs in
> its own container, etc.) and to keep it stable - otherwise transforms and
> pipelines with this connector will be non-portable between runner versions
> either.
>
> Another way to deal with this is to give up and say "the environment on the
> workers is outside the scope of Beam; consult your runner's documentation
> or use your best judgment as to what the environment will be, and use this
> at your own risk".
>
> What do others think?
>
> On Mon, Dec 5, 2016 at 5:09 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> Hi beamers,
>
> Today, Beam is mainly focused on data processing.
> Since the beginning of the project, we are discussing about extending
> the use cases coverage via DSLs and extensions (like for machine
> learning), or via IO.
>
> Especially for the IO, we can see Beam use for data integration and data
> ingestion.
>
> In this area, I'm proposing a first IO: ExecIO:
>
> https://issues.apache.org/jira/browse/BEAM-1059
> https://github.com/apache/incubator-beam/pull/1451
>
> Actually, this IO is mainly an ExecFn that executes system commands
> (again, keep in mind we are discussing about data integration/ingestion
> and not data processing).
>
> For convenience, this ExecFn is wrapped in Read and Write (as a regular
> IO).
>
> Clearly, this IO/Fn depends of the worker where it runs. But it's under
> the user responsibility.
>
> During the review, Eugene and I discussed about:
> - is it an IO or just a fn ?
> - is it OK to have worker specific IO ?
>
> IMHO, an IO makes lot of sense to me and it's very convenient for end
> users. They can do something like:
>
> PCollection<String> output =
> pipeline.apply(ExecIO.read().withCommand("/path/to/myscript.sh"));
>
> The pipeline will execute myscript and the output pipeline will contain
> command execution std out/err.
>
> On the other hand, they can do:
>
> pcollection.apply(ExecIO.write());
>
> where PCollection contains the commands to execute.
>
> Generally speaking, end users can call ExecFn wherever they want in the
> pipeline steps:
>
> PCollection<String> output = pipeline.apply(ParDo.of(new ExecIO.ExecFn()));
>
> The input collection contains the commands to execute, and the output
> collection contains the commands execution result std out/err.
>
> Generally speaking, I'm preparing several IOs more on the data
> integration/ingestion area than on "pure" classic big data processing. I
> think it would give a new "dimension" to Beam.
>
> Thoughts ?
>
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Reply via email to