FYI the first part of support for (direct) process based job bundle factory was merged (thanks Max!)
https://github.com/apache/beam/pull/6287 On top of that I have built a customization that runs the Python SDK worker directly: https://github.com/lyft/beam/blob/release-2.8.0-lyft/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/control/LyftProcessJobBundleFactory.java While doing that, I thought it would be nice to support a custom environment (along with the other options we already discussed) that allows users to hook in their own extensions without having to do stuff like this: https://github.com/lyft/beam/pull/6/files#diff-b74ff692340bcae0032d119a7192624cR61 Thanks, Thomas On Mon, Aug 27, 2018 at 2:43 AM Robert Bradshaw <rober...@google.com> wrote: > On Mon, Aug 27, 2018 at 11:23 AM Maximilian Michels <m...@apache.org> > wrote: > > > > Thanks for your proposal Henning. +1 for explicit environment messages. > > I'm not sure how important it is to support cross-platform pipelines. I > > can foresee future use but I wouldn't consider it essential. > > One may want to execute Tensforflow pipeline segments from the middle > of a Java pipeline, or leverage the SQL code (currently implemented in > Java) from Python or Go. In my experience, a common pipeline shape is > to do a significant amount of (often fairly trivial) filtering at the > front of a pipeline, and sophisticated analysis at the end, and the > tradeoffs of execution efficiency vs. expressiveness and prototype > friendliness are different in these two halves of the same pipeline. > > > However, it > > basically comes for free if we extend the existing environment > > information for ExecutableStage. The overhead, as you said, is > negligible. > > +1. Also it should be noted that the runner is ideally not restricted > to this set of environments; if it understands the URN it can use > whatever environment it finds appropriate. This could be especially > useful for optimal choice of environment to avoid unneeded fusion > barriers (e.g. the trivial Count transform will has a pair-with-one > before the GBK, and a sum-values after the GBK, and assuming those > operations are available in nearly every environment it would be > preferable to choose the variant according to what precedes/follows it > to allow fusion (or, possibly, even embed the operation). > > > Also agree that artifact staging is important even with process-based > > execution. The execution environment might be managed externally but we > > still want to be able to execute new pipelines without copying over > > required artifact. That said, a first version could come without > > artifact staging. > > One of the parameters passed to the script is the staging endpoint, > which it can use to do all the staging itself, so this could also be > internal to the script. I can however see wanting the ability to stage > artifacts more cheaply (e.g. symlinks), and this is actually also the > case with the docker environment (e.g. mount points). > > > On 23.08.18 18:14, Henning Rohde wrote: > > > A process-based SDK harness does not IMO imply that the host is fully > > > provisioned by the SDK/user and invoking the user command line in the > > > context of the staged files is a critical aspect for it to work. So I > > > consider staged artifact support needed. Also, I would like to suggest > > > that we move to a concrete environment proto to crystalize what is > > > actually being proposed. I'm not sure what activating a virtualenv > would > > > look like, for example. To start things off: > > > > > > message Environment { > > > string urn = 1; > > > bytes payload = 2; > > > } > > > > > > // urn == "beam:env:docker:v1" > > > message DockerPayload { > > > string container_image = 1; // implicitly linux_amd64. > > > } > > > > > > // urn == "beam:env:process:v1" > > > message ProcessPayload { > > > string os = 1; // "linux", "darwin", .. > > > string arch = 2; // "amd64", .. > > > string command_line = 3; > > > } > > > > > > // urn == "beam:env:external:v1" > > > // (no payload) > > > > > > A runner may support any subset and reject any unsupported > > > configuration. There are 3 kinds of environments that I think are > useful: > > > (1) docker: works as currently. Offers the most flexibility for SDKs > > > and users, especially when the runner is outside the control (such > > > as hosted runners). The runner starts the SDK harnesses. > > > (2) process: as discussed here. The runner starts the SDK harnesses. > > > The semantics is that the shell commandline is invoked in a directory > > > rooted in the staged artifacts with the container contract arguments. > It > > > is up to the user and runner deployment to ensure that it makes sense, > > > i.e., on windows a linux binary or bash script is not specified. > > > Executing the user command in a shell env (bash, zsh, cmd, ..) ensures > > > that paths and so on are set up:, i.e., specifying "java -jar foo" > would > > > actually work. Useful for cases where the user controls both the SDK > and > > > runner (such as locally) or when docker is not an option. Intended to > be > > > minimal and SDK/language agnostic. > > > (3) external: this is what I think Robert was alluding to. The runner > > > does not start any SDK harnesses. Instead it waits for user-controlled > > > SDK harnesses to connect. Useful for manually debugging SDK code > > > (connect from code running in a debugger) or when the user code must > run > > > in a special or privileged environment. It's runner-specific how the > SDK > > > will need to connect. > > > > > > Part of the idea of placing this information in the environment is that > > > pipelines can potentially use multiple, such as cross-windows/linux. > > > > > > Henning > > > > > > On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <t...@apache.org > > > <mailto:t...@apache.org>> wrote: > > > > > > I would see support for staging libraries as optional / nice to > have > > > since that can also be done as part of host provisioning (i.e. in > > > the Python case a virtual environment was already setup and just > > > needs to be activated). > > > > > > Depending on how the command that launches the harness is > > > configured, additional steps such as virtualenv activate or setting > > > of other environment variables can be included as well. > > > > > > > > > On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <m...@apache.org > > > <mailto:m...@apache.org>> wrote: > > > > > > Just to recap: > > > > > > From this and the other thread ("Bootstraping Beam's Job > > > Server") we > > > got sufficient evidence that process-based execution is a > > > desired feature. > > > > > > Process-based execution as an alternative to dockerized > execution > > > https://issues.apache.org/jira/browse/BEAM-5187 > > > > > > Which parts are executed as a process? > > > => The SDK harness for user code > > > > > > What configuration options are supported? > > > => Provide information about the target architecture (OS/CPU) > > > => Staging libraries, as also supported by Docker > > > => Activating a pre-existing environment (e.g. virutalenv) > > > > > > > > > On 23.08.18 14:13, Maximilian Michels wrote: > > > >> One thing to consider that we've talked about in the past. > > > It might > > > >> make sense to extend the environment proto and have the > SDK be > > > >> explicit about which kinds of environment it support > > > > > > > > +1 Encoding environment information there is a good idea. > > > > > > > >> Seems it will create a default docker url even if the > > > >> hardness_docker_image is set to None in pipeline options. > > > Shall we add > > > >> another option or honor the None in this option to support > > > the process > > > >> job? > > > > > > > > Yes, if no Docker image is set the default one will be used. > > > Currently > > > > Docker is the only way to execute pipelines with the > > > PortableRunner. If > > > > the docker_image is not set, execution won't succeed. > > > > > > > > On 22.08.18 22:59, Xinyu Liu wrote: > > > >> We are also interested in this Process JobBundleFactory as > > > we are > > > >> planning to fork a process to run python sdk in Samza > > > runner, instead > > > >> of using docker container. So this change will be helpful > to > > > us too. > > > >> On the same note, we are trying out portable_runner.py to > > > submit a > > > >> python job. Seems it will create a default docker url even > > > if the > > > >> hardness_docker_image is set to None in pipeline options. > > > Shall we add > > > >> another option or honor the None in this option to support > > > the process > > > >> job? I made some local changes right now to walk around > this. > > > >> > > > >> Thanks, > > > >> Xinyu > > > >> > > > >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde > > > <hero...@google.com <mailto:hero...@google.com> > > > >> <mailto:hero...@google.com <mailto:hero...@google.com>>> > wrote: > > > >> > > > >> By "enum" in quotes, I meant the usual open URN style > > > pattern not an > > > >> actual enum. Sorry if that wasn't clear. > > > >> > > > >> On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik > > > <lc...@google.com <mailto:lc...@google.com> > > > >> <mailto:lc...@google.com <mailto:lc...@google.com>>> > wrote: > > > >> > > > >> I would model the environment to be more free form > > > then enums > > > >> such that we have forward looking extensibility and > > > would > > > >> suggest to follow the same pattern we use on > > > PTransforms (using > > > >> an URN and a URN specific payload). Note that in > > > this case we > > > >> may want to support a list of supported > environments > > > (e.g. java, > > > >> docker, python, ...). > > > >> > > > >> On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde > > > >> <hero...@google.com <mailto:hero...@google.com> > > > <mailto:hero...@google.com <mailto:hero...@google.com>>> > wrote: > > > >> > > > >> One thing to consider that we've talked about > in > > > the past. > > > >> It might make sense to extend the environment > > > proto and have > > > >> the SDK be explicit about which kinds of > > > environment it > > > >> supports: > > > >> > > > >> > > > >> > > > > https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969 > > > > > > >> > > > >> > > > >> > > > < > https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969 > > > > > > > > >> > > > >> > > > >> This choice might impact what files are staged > > > or what not. > > > >> Some SDKs, such as Go, provide a compiled > binary > > > and _need_ > > > >> to know what the target architecture is. > Running > > > on a mac > > > >> with and without docker, say, requires a > > > different worker in > > > >> each case. If we add an "enum", we can also > > > easily add the > > > >> external idea where the SDK/user starts the SDK > > > harnesses > > > >> instead of the runner. Each runner may not > > > support all types > > > >> of environments. > > > >> > > > >> Henning > > > >> > > > >> On Tue, Aug 21, 2018 at 2:52 AM Maximilian > Michels > > > >> <m...@apache.org <mailto:m...@apache.org> > > > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote: > > > >> > > > >> For reference, here is corresponding JIRA > > > issue for this > > > >> thread: > > > >> https://issues.apache.org/jira/browse/BEAM-5187 > > > >> > > > <https://issues.apache.org/jira/browse/BEAM-5187> > > > >> > > > >> On 16.08.18 11:15, Maximilian Michels > wrote: > > > >> > Makes sense to have an option to run the > > > SDK harness > > > >> in a non-dockerized > > > >> > environment. > > > >> > > > > >> > I'm in the process of creating a Docker > > > entry point > > > >> for Flink's > > > >> > JobServer[1]. I suppose you would also > > > prefer to > > > >> execute that one > > > >> > standalone. We should make sure this is > > > also an > > > >> option. > > > >> > > > > >> > [1] > > > https://issues.apache.org/jira/browse/BEAM-4130 > > > >> > > > <https://issues.apache.org/jira/browse/BEAM-4130> > > > >> > > > > >> > On 16.08.18 07:42, Thomas Weise wrote: > > > >> >> Yes, that's the proposal. Everything > > > that would > > > >> otherwise be packaged > > > >> >> into the Docker container would need > to be > > > >> pre-installed in the host > > > >> >> environment. In the case of Python SDK, > > > that could > > > >> simply mean a > > > >> >> (frozen) virtual environment that was > > > setup when the > > > >> host was > > > >> >> provisioned - the SDK harness > > > process(es) will then > > > >> just utilize that. > > > >> >> Of course this flavor of SDK harness > > > execution could > > > >> also be useful in > > > >> >> the local development environment, > where > > > right now > > > >> someone who already > > > >> >> has the Python environment needs to > also > > > install > > > >> Docker and update a > > > >> >> container to launch a Python SDK > > > pipeline on the > > > >> Flink runner. > > > >> >> > > > >> >> On Wed, Aug 15, 2018 at 12:40 PM Daniel > > > Oliveira > > > >> <danolive...@google.com > > > <mailto:danolive...@google.com> <mailto:danolive...@google.com > > > <mailto:danolive...@google.com>> > > > >> >> <mailto:danolive...@google.com > > > <mailto:danolive...@google.com> > > > >> <mailto:danolive...@google.com > > > <mailto:danolive...@google.com>>>> wrote: > > > >> >> > > > >> >> I just want to clarify that I > > > understand this > > > >> correctly since I'm > > > >> >> not that familiar with the details > > > behind all > > > >> these execution > > > >> >> environments yet. Is the proposal > > > to create a > > > >> new JobBundleFactory > > > >> >> that instead of using Docker to > > > create the > > > >> environment that the new > > > >> >> processes will execute in, this > > > >> JobBundleFactory would execute the > > > >> >> new processes directly in the host > > > environment? > > > >> So in practice if I > > > >> >> ran a pipeline with this > > > JobBundleFactory the > > > >> SDK Harness and Runner > > > >> >> Harness would both be executing > > > directly on my > > > >> machine and would > > > >> >> depend on me having the > > > dependencies already > > > >> present on my machine? > > > >> >> > > > >> >> On Mon, Aug 13, 2018 at 5:50 PM > > > Ankur Goenka > > > >> <goe...@google.com > > > <mailto:goe...@google.com> <mailto:goe...@google.com > > > <mailto:goe...@google.com>> > > > >> >> <mailto:goe...@google.com > > > <mailto:goe...@google.com> > > > >> <mailto:goe...@google.com > > > <mailto:goe...@google.com>>>> wrote: > > > >> >> > > > >> >> Thanks for starting the > > > discussion. I will > > > >> be happy to help. > > > >> >> I agree, we should have > pluggable > > > >> SDKHarness environment Factory. > > > >> >> We can register multiple > > > Environment > > > >> factory using service > > > >> >> registry and use the > > > PipelineOption to pick > > > >> the right one on per > > > >> >> job basis. > > > >> >> > > > >> >> There are a couple of things > > > which are > > > >> require to setup before > > > >> >> launching the process. > > > >> >> > > > >> >> * Setting up the environment > > > as done in > > > >> boot.go [4] > > > >> >> * Retrieving and putting the > > > artifacts in > > > >> the right location. > > > >> >> > > > >> >> You can probably leverage > > > boot.go code to > > > >> setup the environment. > > > >> >> > > > >> >> Also, it will be useful to > > > enumerate pros > > > >> and cons of different > > > >> >> Environments to help users > > > choose the right > > > >> one. > > > >> >> > > > >> >> > > > >> >> On Mon, Aug 6, 2018 at 4:50 PM > > > Thomas Weise > > > >> <t...@apache.org <mailto:t...@apache.org> > > > <mailto:t...@apache.org <mailto:t...@apache.org>> > > > >> >> <mailto:t...@apache.org > > > <mailto:t...@apache.org> > > > >> <mailto:t...@apache.org > > > <mailto:t...@apache.org>>>> wrote: > > > >> >> > > > >> >> Hi, > > > >> >> > > > >> >> Currently the portable > > > Flink runner > > > >> only works with SDK > > > >> >> Docker containers for > execution > > > >> (DockerJobBundleFactory, > > > >> >> besides an in-process > > > (embedded) > > > >> factory option for testing > > > >> >> [1]). I'm considering > > > adding another > > > >> out of process > > > >> >> JobBundleFactory > > > implementation that > > > >> directly forks the > > > >> >> processes on the task > > > manager host, > > > >> eliminating the need for > > > >> >> Docker. This would work > > > reasonably well > > > >> in environments > > > >> >> where the dependencies (in > > > this case > > > >> Python) can easily be > > > >> >> tied into the host > > > deployment (also > > > >> within an application > > > >> >> specific Kubernetes pod). > > > >> >> > > > >> >> There was already some > > > discussion about > > > >> alternative > > > >> >> JobBundleFactory > > > implementation in [2]. > > > >> There is also a JIRA > > > >> >> to make the bundle factory > > > pluggable > > > >> [3], pending > > > >> >> availability of runner > > > level options. > > > >> >> > > > >> >> For a > > > "ProcessBundleFactory", in > > > >> addition to the Python > > > >> >> dependencies the > > > environment would also > > > >> need to have the Go > > > >> >> boot executable [4] (or a > > > substitute > > > >> thereof) to perform the > > > >> >> harness initialization. > > > >> >> > > > >> >> Is anyone else interested > > > in this SDK > > > >> execution option or > > > >> >> has already investigated > an > > > alternative > > > >> implementation? > > > >> >> > > > >> >> Thanks, > > > >> >> Thomas > > > >> >> > > > >> >> [1] > > > >> >> > > > >> > > > >> > > > > https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83 > > > > > > >> > > > >> > > > >> > > > < > https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83 > > > > > > > > >> > > > >> >> > > > >> >> [2] > > > >> >> > > > >> > > > >> > > > > https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E > > > > > > >> > > > >> > > > >> > > > < > https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E > > > > > > > > >> > > > >> >> > > > >> >> [3] > > > >> https://issues.apache.org/jira/browse/BEAM-4819 > > > >> > > > <https://issues.apache.org/jira/browse/BEAM-4819> > > > >> >> > > > >> >> [4] > > > >> > > > >> > > > > https://github.com/apache/beam/blob/master/sdks/python/container/boot.go > > > >> > > > >> > > > < > https://github.com/apache/beam/blob/master/sdks/python/container/boot.go> > > > > > > >> > > > >> >> > > > >> > > > >> -- Max > > > >> > > > >> > > > > > > > > > > -- > > > Max > > > > > > > -- > > Max >