Re: Process JobBundleFactory for portable runner

Ankur Goenka Fri, 31 Aug 2018 18:40:08 -0700

Sorry for posting on a separate thread. Lets continue the discussion here.
+1 for having URN to identify environment type. I think URN is better than
'oneof' structure as its more flexible and forward compatible.


On Fri, Aug 31, 2018 at 9:14 AM Thomas Weise <t...@apache.org> wrote:

> FYI the first part of support for (direct) process based job bundle
> factory was merged (thanks Max!)
>
> https://github.com/apache/beam/pull/6287
>
> On top of that I have built a customization that runs the Python SDK
> worker directly:
>
>
> https://github.com/lyft/beam/blob/release-2.8.0-lyft/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/control/LyftProcessJobBundleFactory.java
>
> While doing that, I thought it would be nice to support a custom
> environment (along with the other options we already discussed) that allows
> users to hook in their own extensions without having to do stuff like this:
> https://github.com/lyft/beam/pull/6/files#diff-b74ff692340bcae0032d119a7192624cR61
>
> Thanks,
> Thomas
>
>
>
>
>
> On Mon, Aug 27, 2018 at 2:43 AM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Mon, Aug 27, 2018 at 11:23 AM Maximilian Michels <m...@apache.org>
>> wrote:
>> >
>> > Thanks for your proposal Henning. +1 for explicit environment messages.
>> > I'm not sure how important it is to support cross-platform pipelines. I
>> > can foresee future use but I wouldn't consider it essential.
>>
>> One may want to execute Tensforflow pipeline segments from the middle
>> of a Java pipeline, or leverage the SQL code (currently implemented in
>> Java) from Python or Go. In my experience, a common pipeline shape is
>> to do a significant amount of (often fairly trivial) filtering at the
>> front of a pipeline, and sophisticated analysis at the end, and the
>> tradeoffs of execution efficiency vs. expressiveness and prototype
>> friendliness are different in these two halves of the same pipeline.
>>
>> > However, it
>> > basically comes for free if we extend the existing environment
>> > information for ExecutableStage. The overhead, as you said, is
>> negligible.
>>
>> +1. Also it should be noted that the runner is ideally not restricted
>> to this set of environments; if it understands the URN it can use
>> whatever environment it finds appropriate. This could be especially
>> useful for optimal choice of environment to avoid unneeded fusion
>> barriers (e.g. the trivial Count transform will has a pair-with-one
>> before the GBK, and a sum-values after the GBK, and assuming those
>> operations are available in nearly every environment it would be
>> preferable to choose the variant according to what precedes/follows it
>> to allow fusion (or, possibly, even embed the operation).
>>
>> > Also agree that artifact staging is important even with process-based
>> > execution. The execution environment might be managed externally but we
>> > still want to be able to execute new pipelines without copying over
>> > required artifact. That said, a first version could come without
>> > artifact staging.
>>
>> One of the parameters passed to the script is the staging endpoint,
>> which it can use to do all the staging itself, so this could also be
>> internal to the script. I can however see wanting the ability to stage
>> artifacts more cheaply (e.g. symlinks), and this is actually also the
>> case with the docker environment (e.g. mount points).
>>
>> > On 23.08.18 18:14, Henning Rohde wrote:
>> > > A process-based SDK harness does not IMO imply that the host is fully
>> > > provisioned by the SDK/user and invoking the user command line in the
>> > > context of the staged files is a critical aspect for it to work. So I
>> > > consider staged artifact support needed. Also, I would like to suggest
>> > > that we move to a concrete environment proto to crystalize what is
>> > > actually being proposed. I'm not sure what activating a virtualenv
>> would
>> > > look like, for example. To start things off:
>> > >
>> > > message Environment {
>> > >    string urn = 1;
>> > >    bytes payload = 2;
>> > > }
>> > >
>> > > // urn == "beam:env:docker:v1"
>> > > message DockerPayload {
>> > >    string container_image = 1;  // implicitly linux_amd64.
>> > > }
>> > >
>> > > // urn == "beam:env:process:v1"
>> > > message ProcessPayload {
>> > >    string os = 1;  // "linux", "darwin", ..
>> > >    string arch = 2;  // "amd64", ..
>> > >    string command_line = 3;
>> > > }
>> > >
>> > > // urn == "beam:env:external:v1"
>> > > // (no payload)
>> > >
>> > > A runner may support any subset and reject any unsupported
>> > > configuration. There are 3 kinds of environments that I think are
>> useful:
>> > >   (1) docker: works as currently. Offers the most flexibility for SDKs
>> > > and users, especially when the runner is outside the control (such
>> > > as hosted runners). The runner starts the SDK harnesses.
>> > >   (2) process: as discussed here. The runner starts the SDK harnesses.
>> > > The semantics is that the shell commandline is invoked in a directory
>> > > rooted in the staged artifacts with the container contract arguments.
>> It
>> > > is up to the user and runner deployment to ensure that it makes sense,
>> > > i.e., on windows a linux binary or bash script is not specified.
>> > > Executing the user command in a shell env (bash, zsh, cmd, ..) ensures
>> > > that paths and so on are set up:, i.e., specifying "java -jar foo"
>> would
>> > > actually work. Useful for cases where the user controls both the SDK
>> and
>> > > runner (such as locally) or when docker is not an option. Intended to
>> be
>> > > minimal and SDK/language agnostic.
>> > >   (3) external: this is what I think Robert was alluding to. The
>> runner
>> > > does not start any SDK harnesses. Instead it waits for user-controlled
>> > > SDK harnesses to connect. Useful for manually debugging SDK code
>> > > (connect from code running in a debugger) or when the user code must
>> run
>> > > in a special or privileged environment. It's runner-specific how the
>> SDK
>> > > will need to connect.
>> > >
>> > > Part of the idea of placing this information in the environment is
>> that
>> > > pipelines can potentially use multiple, such as cross-windows/linux.
>> > >
>> > > Henning
>> > >
>> > > On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise <t...@apache.org
>> > > <mailto:t...@apache.org>> wrote:
>> > >
>> > >     I would see support for staging libraries as optional / nice to
>> have
>> > >     since that can also be done as part of host provisioning (i.e. in
>> > >     the Python case a virtual environment was already setup and just
>> > >     needs to be activated).
>> > >
>> > >     Depending on how the command that launches the harness is
>> > >     configured, additional steps such as virtualenv activate or
>> setting
>> > >     of other environment variables can be included as well.
>> > >
>> > >
>> > >     On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels <
>> m...@apache.org
>> > >     <mailto:m...@apache.org>> wrote:
>> > >
>> > >         Just to recap:
>> > >
>> > >           From this and the other thread ("Bootstraping Beam's Job
>> > >         Server") we
>> > >         got sufficient evidence that process-based execution is a
>> > >         desired feature.
>> > >
>> > >         Process-based execution as an alternative to dockerized
>> execution
>> > >         https://issues.apache.org/jira/browse/BEAM-5187
>> > >
>> > >         Which parts are executed as a process?
>> > >         => The SDK harness for user code
>> > >
>> > >         What configuration options are supported?
>> > >         => Provide information about the target architecture (OS/CPU)
>> > >         => Staging libraries, as also supported by Docker
>> > >         => Activating a pre-existing environment (e.g. virutalenv)
>> > >
>> > >
>> > >         On 23.08.18 14:13, Maximilian Michels wrote:
>> > >          >> One thing to consider that we've talked about in the past.
>> > >         It might
>> > >          >> make sense to extend the environment proto and have the
>> SDK be
>> > >          >> explicit about which kinds of environment it support
>> > >          >
>> > >          > +1 Encoding environment information there is a good idea.
>> > >          >
>> > >          >> Seems it will create a default docker url even if the
>> > >          >> hardness_docker_image is set to None in pipeline options.
>> > >         Shall we add
>> > >          >> another option or honor the None in this option to support
>> > >         the process
>> > >          >> job?
>> > >          >
>> > >          > Yes, if no Docker image is set the default one will be
>> used.
>> > >         Currently
>> > >          > Docker is the only way to execute pipelines with the
>> > >         PortableRunner. If
>> > >          > the docker_image is not set, execution won't succeed.
>> > >          >
>> > >          > On 22.08.18 22:59, Xinyu Liu wrote:
>> > >          >> We are also interested in this Process JobBundleFactory as
>> > >         we are
>> > >          >> planning to fork a process to run python sdk in Samza
>> > >         runner, instead
>> > >          >> of using docker container. So this change will be helpful
>> to
>> > >         us too.
>> > >          >> On the same note, we are trying out portable_runner.py to
>> > >         submit a
>> > >          >> python job. Seems it will create a default docker url even
>> > >         if the
>> > >          >> hardness_docker_image is set to None in pipeline options.
>> > >         Shall we add
>> > >          >> another option or honor the None in this option to support
>> > >         the process
>> > >          >> job? I made some local changes right now to walk around
>> this.
>> > >          >>
>> > >          >> Thanks,
>> > >          >> Xinyu
>> > >          >>
>> > >          >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde
>> > >         <hero...@google.com <mailto:hero...@google.com>
>> > >          >> <mailto:hero...@google.com <mailto:hero...@google.com>>>
>> wrote:
>> > >          >>
>> > >          >>     By "enum" in quotes, I meant the usual open URN style
>> > >         pattern not an
>> > >          >>     actual enum. Sorry if that wasn't clear.
>> > >          >>
>> > >          >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik
>> > >         <lc...@google.com <mailto:lc...@google.com>
>> > >          >>     <mailto:lc...@google.com <mailto:lc...@google.com>>>
>> wrote:
>> > >          >>
>> > >          >>         I would model the environment to be more free form
>> > >         then enums
>> > >          >>         such that we have forward looking extensibility
>> and
>> > >         would
>> > >          >>         suggest to follow the same pattern we use on
>> > >         PTransforms (using
>> > >          >>         an URN and a URN specific payload). Note that in
>> > >         this case we
>> > >          >>         may want to support a list of supported
>> environments
>> > >         (e.g. java,
>> > >          >>         docker, python, ...).
>> > >          >>
>> > >          >>         On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde
>> > >          >>         <hero...@google.com <mailto:hero...@google.com>
>> > >         <mailto:hero...@google.com <mailto:hero...@google.com>>>
>> wrote:
>> > >          >>
>> > >          >>             One thing to consider that we've talked about
>> in
>> > >         the past.
>> > >          >>             It might make sense to extend the environment
>> > >         proto and have
>> > >          >>             the SDK be explicit about which kinds of
>> > >         environment it
>> > >          >>             supports:
>> > >          >>
>> > >          >>
>> > >          >>
>> > >
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>> > >
>> > >          >>
>> > >          >>
>> > >          >>
>> > >         <
>> https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
>> >
>> > >
>> > >          >>
>> > >          >>
>> > >          >>             This choice might impact what files are staged
>> > >         or what not.
>> > >          >>             Some SDKs, such as Go, provide a compiled
>> binary
>> > >         and _need_
>> > >          >>             to know what the target architecture is.
>> Running
>> > >         on a mac
>> > >          >>             with and without docker, say, requires a
>> > >         different worker in
>> > >          >>             each case. If we add an "enum", we can also
>> > >         easily add the
>> > >          >>             external idea where the SDK/user starts the
>> SDK
>> > >         harnesses
>> > >          >>             instead of the runner. Each runner may not
>> > >         support all types
>> > >          >>             of environments.
>> > >          >>
>> > >          >>             Henning
>> > >          >>
>> > >          >>             On Tue, Aug 21, 2018 at 2:52 AM Maximilian
>> Michels
>> > >          >>             <m...@apache.org <mailto:m...@apache.org>
>> > >         <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
>> > >          >>
>> > >          >>                 For reference, here is corresponding JIRA
>> > >         issue for this
>> > >          >>                 thread:
>> > >          >> https://issues.apache.org/jira/browse/BEAM-5187
>> > >          >>
>> > >         <https://issues.apache.org/jira/browse/BEAM-5187>
>> > >          >>
>> > >          >>                 On 16.08.18 11:15, Maximilian Michels
>> wrote:
>> > >          >>                  > Makes sense to have an option to run
>> the
>> > >         SDK harness
>> > >          >>                 in a non-dockerized
>> > >          >>                  > environment.
>> > >          >>                  >
>> > >          >>                  > I'm in the process of creating a Docker
>> > >         entry point
>> > >          >>                 for Flink's
>> > >          >>                  > JobServer[1]. I suppose you would also
>> > >         prefer to
>> > >          >>                 execute that one
>> > >          >>                  > standalone. We should make sure this is
>> > >         also an
>> > >          >> option.
>> > >          >>                  >
>> > >          >>                  > [1]
>> > >         https://issues.apache.org/jira/browse/BEAM-4130
>> > >          >>
>> > >         <https://issues.apache.org/jira/browse/BEAM-4130>
>> > >          >>                  >
>> > >          >>                  > On 16.08.18 07:42, Thomas Weise wrote:
>> > >          >>                  >> Yes, that's the proposal. Everything
>> > >         that would
>> > >          >>                 otherwise be packaged
>> > >          >>                  >> into the Docker container would need
>> to be
>> > >          >>                 pre-installed in the host
>> > >          >>                  >> environment. In the case of Python
>> SDK,
>> > >         that could
>> > >          >>                 simply mean a
>> > >          >>                  >> (frozen) virtual environment that was
>> > >         setup when the
>> > >          >>                 host was
>> > >          >>                  >> provisioned - the SDK harness
>> > >         process(es) will then
>> > >          >>                 just utilize that.
>> > >          >>                  >> Of course this flavor of SDK harness
>> > >         execution could
>> > >          >>                 also be useful in
>> > >          >>                  >> the local development environment,
>> where
>> > >         right now
>> > >          >>                 someone who already
>> > >          >>                  >> has the Python environment needs to
>> also
>> > >         install
>> > >          >>                 Docker and update a
>> > >          >>                  >> container to launch a Python SDK
>> > >         pipeline on the
>> > >          >>                 Flink runner.
>> > >          >>                  >>
>> > >          >>                  >> On Wed, Aug 15, 2018 at 12:40 PM
>> Daniel
>> > >         Oliveira
>> > >          >>                 <danolive...@google.com
>> > >         <mailto:danolive...@google.com> <mailto:
>> danolive...@google.com
>> > >         <mailto:danolive...@google.com>>
>> > >          >>                  >> <mailto:danolive...@google.com
>> > >         <mailto:danolive...@google.com>
>> > >          >>                 <mailto:danolive...@google.com
>> > >         <mailto:danolive...@google.com>>>> wrote:
>> > >          >>                  >>
>> > >          >>                  >>      I just want to clarify that I
>> > >         understand this
>> > >          >>                 correctly since I'm
>> > >          >>                  >>      not that familiar with the
>> details
>> > >         behind all
>> > >          >>                 these execution
>> > >          >>                  >>      environments yet. Is the proposal
>> > >         to create a
>> > >          >>                 new JobBundleFactory
>> > >          >>                  >>      that instead of using Docker to
>> > >         create the
>> > >          >>                 environment that the new
>> > >          >>                  >>      processes will execute in, this
>> > >          >>                 JobBundleFactory would execute the
>> > >          >>                  >>      new processes directly in the
>> host
>> > >         environment?
>> > >          >>                 So in practice if I
>> > >          >>                  >>      ran a pipeline with this
>> > >         JobBundleFactory the
>> > >          >>                 SDK Harness and Runner
>> > >          >>                  >>      Harness would both be executing
>> > >         directly on my
>> > >          >>                 machine and would
>> > >          >>                  >>      depend on me having the
>> > >         dependencies already
>> > >          >>                 present on my machine?
>> > >          >>                  >>
>> > >          >>                  >>      On Mon, Aug 13, 2018 at 5:50 PM
>> > >         Ankur Goenka
>> > >          >>                 <goe...@google.com
>> > >         <mailto:goe...@google.com> <mailto:goe...@google.com
>> > >         <mailto:goe...@google.com>>
>> > >          >>                  >>      <mailto:goe...@google.com
>> > >         <mailto:goe...@google.com>
>> > >          >>                 <mailto:goe...@google.com
>> > >         <mailto:goe...@google.com>>>> wrote:
>> > >          >>                  >>
>> > >          >>                  >>          Thanks for starting the
>> > >         discussion. I will
>> > >          >>                 be happy to help.
>> > >          >>                  >>          I agree, we should have
>> pluggable
>> > >          >>                 SDKHarness environment Factory.
>> > >          >>                  >>          We can register multiple
>> > >         Environment
>> > >          >>                 factory using service
>> > >          >>                  >>          registry and use the
>> > >         PipelineOption to pick
>> > >          >>                 the right one on per
>> > >          >>                  >>          job basis.
>> > >          >>                  >>
>> > >          >>                  >>          There are a couple of things
>> > >         which are
>> > >          >>                 require to setup before
>> > >          >>                  >>          launching the process.
>> > >          >>                  >>
>> > >          >>                  >>            * Setting up the
>> environment
>> > >         as done in
>> > >          >>                 boot.go [4]
>> > >          >>                  >>            * Retrieving and putting
>> the
>> > >         artifacts in
>> > >          >>                 the right location.
>> > >          >>                  >>
>> > >          >>                  >>          You can probably leverage
>> > >         boot.go code to
>> > >          >>                 setup the environment.
>> > >          >>                  >>
>> > >          >>                  >>          Also, it will be useful to
>> > >         enumerate pros
>> > >          >>                 and cons of different
>> > >          >>                  >>          Environments to help users
>> > >         choose the right
>> > >          >>                 one.
>> > >          >>                  >>
>> > >          >>                  >>
>> > >          >>                  >>          On Mon, Aug 6, 2018 at 4:50
>> PM
>> > >         Thomas Weise
>> > >          >>                 <t...@apache.org <mailto:t...@apache.org>
>> > >         <mailto:t...@apache.org <mailto:t...@apache.org>>
>> > >          >>                  >>          <mailto:t...@apache.org
>> > >         <mailto:t...@apache.org>
>> > >          >>                 <mailto:t...@apache.org
>> > >         <mailto:t...@apache.org>>>> wrote:
>> > >          >>                  >>
>> > >          >>                  >>              Hi,
>> > >          >>                  >>
>> > >          >>                  >>              Currently the portable
>> > >         Flink runner
>> > >          >>                 only works with SDK
>> > >          >>                  >>              Docker containers for
>> execution
>> > >          >>                 (DockerJobBundleFactory,
>> > >          >>                  >>              besides an in-process
>> > >         (embedded)
>> > >          >>                 factory option for testing
>> > >          >>                  >>              [1]). I'm considering
>> > >         adding another
>> > >          >>                 out of process
>> > >          >>                  >>              JobBundleFactory
>> > >         implementation that
>> > >          >>                 directly forks the
>> > >          >>                  >>              processes on the task
>> > >         manager host,
>> > >          >>                 eliminating the need for
>> > >          >>                  >>              Docker. This would work
>> > >         reasonably well
>> > >          >>                 in environments
>> > >          >>                  >>              where the dependencies
>> (in
>> > >         this case
>> > >          >>                 Python) can easily be
>> > >          >>                  >>              tied into the host
>> > >         deployment (also
>> > >          >>                 within an application
>> > >          >>                  >>              specific Kubernetes pod).
>> > >          >>                  >>
>> > >          >>                  >>              There was already some
>> > >         discussion about
>> > >          >>                 alternative
>> > >          >>                  >>              JobBundleFactory
>> > >         implementation in [2].
>> > >          >>                 There is also a JIRA
>> > >          >>                  >>              to make the bundle
>> factory
>> > >         pluggable
>> > >          >>                 [3], pending
>> > >          >>                  >>              availability of runner
>> > >         level options.
>> > >          >>                  >>
>> > >          >>                  >>              For a
>> > >         "ProcessBundleFactory", in
>> > >          >>                 addition to the Python
>> > >          >>                  >>              dependencies the
>> > >         environment would also
>> > >          >>                 need to have the Go
>> > >          >>                  >>              boot executable [4] (or a
>> > >         substitute
>> > >          >>                 thereof) to perform the
>> > >          >>                  >>              harness initialization.
>> > >          >>                  >>
>> > >          >>                  >>              Is anyone else interested
>> > >         in this SDK
>> > >          >>                 execution option or
>> > >          >>                  >>              has already investigated
>> an
>> > >         alternative
>> > >          >>                 implementation?
>> > >          >>                  >>
>> > >          >>                  >>              Thanks,
>> > >          >>                  >>              Thomas
>> > >          >>                  >>
>> > >          >>                  >>              [1]
>> > >          >>                  >>
>> > >          >>
>> > >          >>
>> > >
>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>> > >
>> > >          >>
>> > >          >>
>> > >          >>
>> > >         <
>> https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>> >
>> > >
>> > >          >>
>> > >          >>                  >>
>> > >          >>                  >>              [2]
>> > >          >>                  >>
>> > >          >>
>> > >          >>
>> > >
>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>> > >
>> > >          >>
>> > >          >>
>> > >          >>
>> > >         <
>> https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
>> >
>> > >
>> > >          >>
>> > >          >>                  >>
>> > >          >>                  >>              [3]
>> > >          >> https://issues.apache.org/jira/browse/BEAM-4819
>> > >          >>
>> > >         <https://issues.apache.org/jira/browse/BEAM-4819>
>> > >          >>                  >>
>> > >          >>                  >>              [4]
>> > >          >>
>> > >          >>
>> > >
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
>> > >          >>
>> > >          >>
>> > >         <
>> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>
>> > >
>> > >          >>
>> > >          >>                  >>
>> > >          >>
>> > >          >>                 --                 Max
>> > >          >>
>> > >          >>
>> > >          >
>> > >
>> > >         --
>> > >         Max
>> > >
>> >
>> > --
>> > Max
>>
>

Re: Process JobBundleFactory for portable runner

Reply via email to