Sorry for posting on a separate thread. Lets continue the discussion here. +1 for having URN to identify environment type. I think URN is better than 'oneof' structure as its more flexible and forward compatible.
On Fri, Aug 31, 2018 at 9:14 AM Thomas Weise <t...@apache.org> wrote: > FYI the first part of support for (direct) process based job bundle > factory was merged (thanks Max!) > > https://github.com/apache/beam/pull/6287 > > On top of that I have built a customization that runs the Python SDK > worker directly: > > > https://github.com/lyft/beam/blob/release-2.8.0-lyft/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/control/LyftProcessJobBundleFactory.java > > While doing that, I thought it would be nice to support a custom > environment (along with the other options we already discussed) that allows > users to hook in their own extensions without having to do stuff like this: > https://github.com/lyft/beam/pull/6/files#diff-b74ff692340bcae0032d119a7192624cR61 > > Thanks, > Thomas > > > > > > On Mon, Aug 27, 2018 at 2:43 AM Robert Bradshaw <rober...@google.com> > wrote: > >> On Mon, Aug 27, 2018 at 11:23 AM Maximilian Michels <m...@apache.org> >> wrote: >> > >> > Thanks for your proposal Henning. +1 for explicit environment messages. >> > I'm not sure how important it is to support cross-platform pipelines. I >> > can foresee future use but I wouldn't consider it essential. >> >> One may want to execute Tensforflow pipeline segments from the middle >> of a Java pipeline, or leverage the SQL code (currently implemented in >> Java) from Python or Go. In my experience, a common pipeline shape is >> to do a significant amount of (often fairly trivial) filtering at the >> front of a pipeline, and sophisticated analysis at the end, and the >> tradeoffs of execution efficiency vs. expressiveness and prototype >> friendliness are different in these two halves of the same pipeline. >> >> > However, it >> > basically comes for free if we extend the existing environment >> > information for ExecutableStage. The overhead, as you said, is >> negligible. >> >> +1. Also it should be noted that the runner is ideally not restricted >> to this set of environments; if it understands the URN it can use >> whatever environment it finds appropriate. This could be especially >> useful for optimal choice of environment to avoid unneeded fusion >> barriers (e.g. the trivial Count transform will has a pair-with-one >> before the GBK, and a sum-values after the GBK, and assuming those >> operations are available in nearly every environment it would be >> preferable to choose the variant according to what precedes/follows it >> to allow fusion (or, possibly, even embed the operation). >> >> > Also agree that artifact staging is important even with process-based >> > execution. The execution environment might be managed externally but we >> > still want to be able to execute new pipelines without copying over >> > required artifact. That said, a first version could come without >> > artifact staging. >> >> One of the parameters passed to the script is the staging endpoint, >> which it can use to do all the staging itself, so this could also be >> internal to the script. I can however see wanting the ability to stage >> artifacts more cheaply (e.g. symlinks), and this is actually also the >> case with the docker environment (e.g. mount points). >> >> > On 23.08.18 18:14, Henning Rohde wrote: >> > > A process-based SDK harness does not IMO imply that the host is fully >> > > provisioned by the SDK/user and invoking the user command line in the >> > > context of the staged files is a critical aspect for it to work. So I >> > > consider staged artifact support needed. Also, I would like to suggest >> > > that we move to a concrete environment proto to crystalize what is >> > > actually being proposed. I'm not sure what activating a virtualenv >> would >> > > look like, for example. To start things off: >> > > >> > > message Environment { >> > > string urn = 1; >> > > bytes payload = 2; >> > > } >> > > >> > > // urn == "beam:env:docker:v1" >> > > message DockerPayload { >> > > string container_image = 1; // implicitly linux_amd64. >> > > } >> > > >> > > // urn == "beam:env:process:v1" >> > > message ProcessPayload { >> > > string os = 1; // "linux", "darwin", .. >> > > string arch = 2; // "amd64", .. >> > > string command_line = 3; >> > > } >> > > >> > > // urn == "beam:env:external:v1" >> > > // (no payload) >> > > >> > > A runner may support any subset and reject any unsupported >> > > configuration. There are 3 kinds of environments that I think are >> useful: >> > > (1) docker: works as currently. Offers the most flexibility for SDKs >> > > and users, especially when the runner is outside the control (such >> > > as hosted runners). The runner starts the SDK harnesses. >> > > (2) process: as discussed here. The runner starts the SDK harnesses. >> > > The semantics is that the shell commandline is invoked in a directory >> > > rooted in the staged artifacts with the container contract arguments. >> It >> > > is up to the user and runner deployment to ensure that it makes sense, >> > > i.e., on windows a linux binary or bash script is not specified. >> > > Executing the user command in a shell env (bash, zsh, cmd, ..) ensures >> > > that paths and so on are set up:, i.e., specifying "java -jar foo" >> would >> > > actually work. Useful for cases where the user controls both the SDK >> and >> > > runner (such as locally) or when docker is not an option. Intended to >> be >> > > minimal and SDK/language agnostic. >> > > (3) external: this is what I think Robert was alluding to. The >> runner >> > > does not start any SDK harnesses. Instead it waits for user-controlled >> > > SDK harnesses to connect. Useful for manually debugging SDK code >> > > (connect from code running in a debugger) or when the user code must >> run >> > > in a special or privileged environment. It's runner-specific how the >> SDK >> > > will need to connect. >> > > >> > > Part of the idea of placing this information in the environment is >> that >> > > pipelines can potentially use multiple, such as cross-windows/linux. >> > > >> > > Henning >> > > >> > > On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <t...@apache.org >> > > <mailto:t...@apache.org>> wrote: >> > > >> > > I would see support for staging libraries as optional / nice to >> have >> > > since that can also be done as part of host provisioning (i.e. in >> > > the Python case a virtual environment was already setup and just >> > > needs to be activated). >> > > >> > > Depending on how the command that launches the harness is >> > > configured, additional steps such as virtualenv activate or >> setting >> > > of other environment variables can be included as well. >> > > >> > > >> > > On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels < >> m...@apache.org >> > > <mailto:m...@apache.org>> wrote: >> > > >> > > Just to recap: >> > > >> > > From this and the other thread ("Bootstraping Beam's Job >> > > Server") we >> > > got sufficient evidence that process-based execution is a >> > > desired feature. >> > > >> > > Process-based execution as an alternative to dockerized >> execution >> > > https://issues.apache.org/jira/browse/BEAM-5187 >> > > >> > > Which parts are executed as a process? >> > > => The SDK harness for user code >> > > >> > > What configuration options are supported? >> > > => Provide information about the target architecture (OS/CPU) >> > > => Staging libraries, as also supported by Docker >> > > => Activating a pre-existing environment (e.g. virutalenv) >> > > >> > > >> > > On 23.08.18 14:13, Maximilian Michels wrote: >> > > >> One thing to consider that we've talked about in the past. >> > > It might >> > > >> make sense to extend the environment proto and have the >> SDK be >> > > >> explicit about which kinds of environment it support >> > > > >> > > > +1 Encoding environment information there is a good idea. >> > > > >> > > >> Seems it will create a default docker url even if the >> > > >> hardness_docker_image is set to None in pipeline options. >> > > Shall we add >> > > >> another option or honor the None in this option to support >> > > the process >> > > >> job? >> > > > >> > > > Yes, if no Docker image is set the default one will be >> used. >> > > Currently >> > > > Docker is the only way to execute pipelines with the >> > > PortableRunner. If >> > > > the docker_image is not set, execution won't succeed. >> > > > >> > > > On 22.08.18 22:59, Xinyu Liu wrote: >> > > >> We are also interested in this Process JobBundleFactory as >> > > we are >> > > >> planning to fork a process to run python sdk in Samza >> > > runner, instead >> > > >> of using docker container. So this change will be helpful >> to >> > > us too. >> > > >> On the same note, we are trying out portable_runner.py to >> > > submit a >> > > >> python job. Seems it will create a default docker url even >> > > if the >> > > >> hardness_docker_image is set to None in pipeline options. >> > > Shall we add >> > > >> another option or honor the None in this option to support >> > > the process >> > > >> job? I made some local changes right now to walk around >> this. >> > > >> >> > > >> Thanks, >> > > >> Xinyu >> > > >> >> > > >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde >> > > <hero...@google.com <mailto:hero...@google.com> >> > > >> <mailto:hero...@google.com <mailto:hero...@google.com>>> >> wrote: >> > > >> >> > > >> By "enum" in quotes, I meant the usual open URN style >> > > pattern not an >> > > >> actual enum. Sorry if that wasn't clear. >> > > >> >> > > >> On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik >> > > <lc...@google.com <mailto:lc...@google.com> >> > > >> <mailto:lc...@google.com <mailto:lc...@google.com>>> >> wrote: >> > > >> >> > > >> I would model the environment to be more free form >> > > then enums >> > > >> such that we have forward looking extensibility >> and >> > > would >> > > >> suggest to follow the same pattern we use on >> > > PTransforms (using >> > > >> an URN and a URN specific payload). Note that in >> > > this case we >> > > >> may want to support a list of supported >> environments >> > > (e.g. java, >> > > >> docker, python, ...). >> > > >> >> > > >> On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde >> > > >> <hero...@google.com <mailto:hero...@google.com> >> > > <mailto:hero...@google.com <mailto:hero...@google.com>>> >> wrote: >> > > >> >> > > >> One thing to consider that we've talked about >> in >> > > the past. >> > > >> It might make sense to extend the environment >> > > proto and have >> > > >> the SDK be explicit about which kinds of >> > > environment it >> > > >> supports: >> > > >> >> > > >> >> > > >> >> > > >> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969 >> > > >> > > >> >> > > >> >> > > >> >> > > < >> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969 >> > >> > > >> > > >> >> > > >> >> > > >> This choice might impact what files are staged >> > > or what not. >> > > >> Some SDKs, such as Go, provide a compiled >> binary >> > > and _need_ >> > > >> to know what the target architecture is. >> Running >> > > on a mac >> > > >> with and without docker, say, requires a >> > > different worker in >> > > >> each case. If we add an "enum", we can also >> > > easily add the >> > > >> external idea where the SDK/user starts the >> SDK >> > > harnesses >> > > >> instead of the runner. Each runner may not >> > > support all types >> > > >> of environments. >> > > >> >> > > >> Henning >> > > >> >> > > >> On Tue, Aug 21, 2018 at 2:52 AM Maximilian >> Michels >> > > >> <m...@apache.org <mailto:m...@apache.org> >> > > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote: >> > > >> >> > > >> For reference, here is corresponding JIRA >> > > issue for this >> > > >> thread: >> > > >> https://issues.apache.org/jira/browse/BEAM-5187 >> > > >> >> > > <https://issues.apache.org/jira/browse/BEAM-5187> >> > > >> >> > > >> On 16.08.18 11:15, Maximilian Michels >> wrote: >> > > >> > Makes sense to have an option to run >> the >> > > SDK harness >> > > >> in a non-dockerized >> > > >> > environment. >> > > >> > >> > > >> > I'm in the process of creating a Docker >> > > entry point >> > > >> for Flink's >> > > >> > JobServer[1]. I suppose you would also >> > > prefer to >> > > >> execute that one >> > > >> > standalone. We should make sure this is >> > > also an >> > > >> option. >> > > >> > >> > > >> > [1] >> > > https://issues.apache.org/jira/browse/BEAM-4130 >> > > >> >> > > <https://issues.apache.org/jira/browse/BEAM-4130> >> > > >> > >> > > >> > On 16.08.18 07:42, Thomas Weise wrote: >> > > >> >> Yes, that's the proposal. Everything >> > > that would >> > > >> otherwise be packaged >> > > >> >> into the Docker container would need >> to be >> > > >> pre-installed in the host >> > > >> >> environment. In the case of Python >> SDK, >> > > that could >> > > >> simply mean a >> > > >> >> (frozen) virtual environment that was >> > > setup when the >> > > >> host was >> > > >> >> provisioned - the SDK harness >> > > process(es) will then >> > > >> just utilize that. >> > > >> >> Of course this flavor of SDK harness >> > > execution could >> > > >> also be useful in >> > > >> >> the local development environment, >> where >> > > right now >> > > >> someone who already >> > > >> >> has the Python environment needs to >> also >> > > install >> > > >> Docker and update a >> > > >> >> container to launch a Python SDK >> > > pipeline on the >> > > >> Flink runner. >> > > >> >> >> > > >> >> On Wed, Aug 15, 2018 at 12:40 PM >> Daniel >> > > Oliveira >> > > >> <danolive...@google.com >> > > <mailto:danolive...@google.com> <mailto: >> danolive...@google.com >> > > <mailto:danolive...@google.com>> >> > > >> >> <mailto:danolive...@google.com >> > > <mailto:danolive...@google.com> >> > > >> <mailto:danolive...@google.com >> > > <mailto:danolive...@google.com>>>> wrote: >> > > >> >> >> > > >> >> I just want to clarify that I >> > > understand this >> > > >> correctly since I'm >> > > >> >> not that familiar with the >> details >> > > behind all >> > > >> these execution >> > > >> >> environments yet. Is the proposal >> > > to create a >> > > >> new JobBundleFactory >> > > >> >> that instead of using Docker to >> > > create the >> > > >> environment that the new >> > > >> >> processes will execute in, this >> > > >> JobBundleFactory would execute the >> > > >> >> new processes directly in the >> host >> > > environment? >> > > >> So in practice if I >> > > >> >> ran a pipeline with this >> > > JobBundleFactory the >> > > >> SDK Harness and Runner >> > > >> >> Harness would both be executing >> > > directly on my >> > > >> machine and would >> > > >> >> depend on me having the >> > > dependencies already >> > > >> present on my machine? >> > > >> >> >> > > >> >> On Mon, Aug 13, 2018 at 5:50 PM >> > > Ankur Goenka >> > > >> <goe...@google.com >> > > <mailto:goe...@google.com> <mailto:goe...@google.com >> > > <mailto:goe...@google.com>> >> > > >> >> <mailto:goe...@google.com >> > > <mailto:goe...@google.com> >> > > >> <mailto:goe...@google.com >> > > <mailto:goe...@google.com>>>> wrote: >> > > >> >> >> > > >> >> Thanks for starting the >> > > discussion. I will >> > > >> be happy to help. >> > > >> >> I agree, we should have >> pluggable >> > > >> SDKHarness environment Factory. >> > > >> >> We can register multiple >> > > Environment >> > > >> factory using service >> > > >> >> registry and use the >> > > PipelineOption to pick >> > > >> the right one on per >> > > >> >> job basis. >> > > >> >> >> > > >> >> There are a couple of things >> > > which are >> > > >> require to setup before >> > > >> >> launching the process. >> > > >> >> >> > > >> >> * Setting up the >> environment >> > > as done in >> > > >> boot.go [4] >> > > >> >> * Retrieving and putting >> the >> > > artifacts in >> > > >> the right location. >> > > >> >> >> > > >> >> You can probably leverage >> > > boot.go code to >> > > >> setup the environment. >> > > >> >> >> > > >> >> Also, it will be useful to >> > > enumerate pros >> > > >> and cons of different >> > > >> >> Environments to help users >> > > choose the right >> > > >> one. >> > > >> >> >> > > >> >> >> > > >> >> On Mon, Aug 6, 2018 at 4:50 >> PM >> > > Thomas Weise >> > > >> <t...@apache.org <mailto:t...@apache.org> >> > > <mailto:t...@apache.org <mailto:t...@apache.org>> >> > > >> >> <mailto:t...@apache.org >> > > <mailto:t...@apache.org> >> > > >> <mailto:t...@apache.org >> > > <mailto:t...@apache.org>>>> wrote: >> > > >> >> >> > > >> >> Hi, >> > > >> >> >> > > >> >> Currently the portable >> > > Flink runner >> > > >> only works with SDK >> > > >> >> Docker containers for >> execution >> > > >> (DockerJobBundleFactory, >> > > >> >> besides an in-process >> > > (embedded) >> > > >> factory option for testing >> > > >> >> [1]). I'm considering >> > > adding another >> > > >> out of process >> > > >> >> JobBundleFactory >> > > implementation that >> > > >> directly forks the >> > > >> >> processes on the task >> > > manager host, >> > > >> eliminating the need for >> > > >> >> Docker. This would work >> > > reasonably well >> > > >> in environments >> > > >> >> where the dependencies >> (in >> > > this case >> > > >> Python) can easily be >> > > >> >> tied into the host >> > > deployment (also >> > > >> within an application >> > > >> >> specific Kubernetes pod). >> > > >> >> >> > > >> >> There was already some >> > > discussion about >> > > >> alternative >> > > >> >> JobBundleFactory >> > > implementation in [2]. >> > > >> There is also a JIRA >> > > >> >> to make the bundle >> factory >> > > pluggable >> > > >> [3], pending >> > > >> >> availability of runner >> > > level options. >> > > >> >> >> > > >> >> For a >> > > "ProcessBundleFactory", in >> > > >> addition to the Python >> > > >> >> dependencies the >> > > environment would also >> > > >> need to have the Go >> > > >> >> boot executable [4] (or a >> > > substitute >> > > >> thereof) to perform the >> > > >> >> harness initialization. >> > > >> >> >> > > >> >> Is anyone else interested >> > > in this SDK >> > > >> execution option or >> > > >> >> has already investigated >> an >> > > alternative >> > > >> implementation? >> > > >> >> >> > > >> >> Thanks, >> > > >> >> Thomas >> > > >> >> >> > > >> >> [1] >> > > >> >> >> > > >> >> > > >> >> > > >> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83 >> > > >> > > >> >> > > >> >> > > >> >> > > < >> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83 >> > >> > > >> > > >> >> > > >> >> >> > > >> >> [2] >> > > >> >> >> > > >> >> > > >> >> > > >> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E >> > > >> > > >> >> > > >> >> > > >> >> > > < >> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E >> > >> > > >> > > >> >> > > >> >> >> > > >> >> [3] >> > > >> https://issues.apache.org/jira/browse/BEAM-4819 >> > > >> >> > > <https://issues.apache.org/jira/browse/BEAM-4819> >> > > >> >> >> > > >> >> [4] >> > > >> >> > > >> >> > > >> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go >> > > >> >> > > >> >> > > < >> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go> >> > > >> > > >> >> > > >> >> >> > > >> >> > > >> -- Max >> > > >> >> > > >> >> > > > >> > > >> > > -- >> > > Max >> > > >> > >> > -- >> > Max >> >