Re: Process JobBundleFactory for portable runner

Maximilian Michels Mon, 03 Sep 2018 03:34:37 -0700

+1 I think it makes sense to consolidate yours and Henning's proposalfor the Environment changes. I've created a JIRA to collect the ideas.We can go ahead and implement this next.


https://issues.apache.org/jira/browse/BEAM-5288


On 01.09.18 03:38, Ankur Goenka wrote:

Sorry for posting on a separate thread. Lets continue the discussion here.

+1 for having URN to identify environment type. I think URN is betterthan 'oneof' structure as its more flexible and forward compatible.

On Fri, Aug 31, 2018 at 9:14 AM Thomas Weise <[email protected]<mailto:[email protected]>> wrote:


    FYI the first part of support for (direct) process based job bundle
    factory was merged (thanks Max!)

    https://github.com/apache/beam/pull/6287

    On top of that I have built a customization that runs the Python SDK
    worker directly:

    
https://github.com/lyft/beam/blob/release-2.8.0-lyft/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/control/LyftProcessJobBundleFactory.java

    While doing that, I thought it would be nice to support a custom
    environment (along with the other options we already discussed) that
    allows users to hook in their own extensions without having to do
    stuff like this:
    
https://github.com/lyft/beam/pull/6/files#diff-b74ff692340bcae0032d119a7192624cR61

    Thanks,
    Thomas




    On Mon, Aug 27, 2018 at 2:43 AM Robert Bradshaw <[email protected]
    <mailto:[email protected]>> wrote:

        On Mon, Aug 27, 2018 at 11:23 AM Maximilian Michels
        <[email protected] <mailto:[email protected]>> wrote:
         >
         > Thanks for your proposal Henning. +1 for explicit environment
        messages.
         > I'm not sure how important it is to support cross-platform
        pipelines. I
         > can foresee future use but I wouldn't consider it essential.

        One may want to execute Tensforflow pipeline segments from the
        middle
        of a Java pipeline, or leverage the SQL code (currently
        implemented in
        Java) from Python or Go. In my experience, a common pipeline
        shape is
        to do a significant amount of (often fairly trivial) filtering
        at the
        front of a pipeline, and sophisticated analysis at the end, and the
        tradeoffs of execution efficiency vs. expressiveness and prototype
        friendliness are different in these two halves of the same pipeline.

         > However, it
         > basically comes for free if we extend the existing environment
         > information for ExecutableStage. The overhead, as you said,
        is negligible.

        +1. Also it should be noted that the runner is ideally not
        restricted
        to this set of environments; if it understands the URN it can use
        whatever environment it finds appropriate. This could be especially
        useful for optimal choice of environment to avoid unneeded fusion
        barriers (e.g. the trivial Count transform will has a pair-with-one
        before the GBK, and a sum-values after the GBK, and assuming those
        operations are available in nearly every environment it would be
        preferable to choose the variant according to what
        precedes/follows it
        to allow fusion (or, possibly, even embed the operation).

         > Also agree that artifact staging is important even with
        process-based
         > execution. The execution environment might be managed
        externally but we
         > still want to be able to execute new pipelines without
        copying over
         > required artifact. That said, a first version could come without
         > artifact staging.

        One of the parameters passed to the script is the staging endpoint,
        which it can use to do all the staging itself, so this could also be
        internal to the script. I can however see wanting the ability to
        stage
        artifacts more cheaply (e.g. symlinks), and this is actually
        also the
        case with the docker environment (e.g. mount points).

         > On 23.08.18 18:14, Henning Rohde wrote:
         > > A process-based SDK harness does not IMO imply that the
        host is fully
         > > provisioned by the SDK/user and invoking the user command
        line in the
         > > context of the staged files is a critical aspect for it to
        work. So I
         > > consider staged artifact support needed. Also, I would like
        to suggest
         > > that we move to a concrete environment proto to crystalize
        what is
         > > actually being proposed. I'm not sure what activating a
        virtualenv would
         > > look like, for example. To start things off:
         > >
         > > message Environment {
         > >    string urn = 1;
         > >    bytes payload = 2;
         > > }
         > >
         > > // urn == "beam:env:docker:v1"
         > > message DockerPayload {
         > >    string container_image = 1;  // implicitly linux_amd64.
         > > }
         > >
         > > // urn == "beam:env:process:v1"
         > > message ProcessPayload {
         > >    string os = 1;  // "linux", "darwin", ..
         > >    string arch = 2;  // "amd64", ..
         > >    string command_line = 3;
         > > }
         > >
         > > // urn == "beam:env:external:v1"
         > > // (no payload)
         > >
         > > A runner may support any subset and reject any unsupported
         > > configuration. There are 3 kinds of environments that I
        think are useful:
         > >   (1) docker: works as currently. Offers the most
        flexibility for SDKs
         > > and users, especially when the runner is outside the
        control (such
         > > as hosted runners). The runner starts the SDK harnesses.
         > >   (2) process: as discussed here. The runner starts the SDK
        harnesses.
         > > The semantics is that the shell commandline is invoked in a
        directory
         > > rooted in the staged artifacts with the container contract
        arguments. It
         > > is up to the user and runner deployment to ensure that it
        makes sense,
         > > i.e., on windows a linux binary or bash script is not
        specified.
         > > Executing the user command in a shell env (bash, zsh, cmd,
        ..) ensures
         > > that paths and so on are set up:, i.e., specifying "java
        -jar foo" would
         > > actually work. Useful for cases where the user controls
        both the SDK and
         > > runner (such as locally) or when docker is not an option.
        Intended to be
         > > minimal and SDK/language agnostic.
         > >   (3) external: this is what I think Robert was alluding
        to. The runner
         > > does not start any SDK harnesses. Instead it waits for
        user-controlled
         > > SDK harnesses to connect. Useful for manually debugging SDK
        code
         > > (connect from code running in a debugger) or when the user
        code must run
         > > in a special or privileged environment. It's
        runner-specific how the SDK
         > > will need to connect.
         > >
         > > Part of the idea of placing this information in the
        environment is that
         > > pipelines can potentially use multiple, such as
        cross-windows/linux.
         > >
         > > Henning
         > >
         > > On Thu, Aug 23, 2018 at 6:44 AM Thomas Weise
        <[email protected] <mailto:[email protected]>
         > > <mailto:[email protected] <mailto:[email protected]>>> wrote:
         > >
         > >     I would see support for staging libraries as optional /
        nice to have
         > >     since that can also be done as part of host
        provisioning (i.e. in
         > >     the Python case a virtual environment was already setup
        and just
         > >     needs to be activated).
         > >
         > >     Depending on how the command that launches the harness is
         > >     configured, additional steps such as virtualenv
        activate or setting
         > >     of other environment variables can be included as well.
         > >
         > >
         > >     On Thu, Aug 23, 2018 at 5:15 AM Maximilian Michels
        <[email protected] <mailto:[email protected]>
         > >     <mailto:[email protected] <mailto:[email protected]>>> wrote:
         > >
         > >         Just to recap:
         > >
         > >           From this and the other thread ("Bootstraping
        Beam's Job
         > >         Server") we
         > >         got sufficient evidence that process-based
        execution is a
         > >         desired feature.
         > >
         > >         Process-based execution as an alternative to
        dockerized execution
         > > https://issues.apache.org/jira/browse/BEAM-5187
         > >
         > >         Which parts are executed as a process?
         > >         => The SDK harness for user code
         > >
         > >         What configuration options are supported?
         > >         => Provide information about the target
        architecture (OS/CPU)
         > >         => Staging libraries, as also supported by Docker
         > >         => Activating a pre-existing environment (e.g.
        virutalenv)
         > >
         > >
         > >         On 23.08.18 14:13, Maximilian Michels wrote:
         > >          >> One thing to consider that we've talked about
        in the past.
         > >         It might
         > >          >> make sense to extend the environment proto and
        have the SDK be
         > >          >> explicit about which kinds of environment it
        support
         > >          >
         > >          > +1 Encoding environment information there is a
        good idea.
         > >          >
         > >          >> Seems it will create a default docker url even
        if the
         > >          >> hardness_docker_image is set to None in
        pipeline options.
         > >         Shall we add
         > >          >> another option or honor the None in this option
        to support
         > >         the process
         > >          >> job?
         > >          >
         > >          > Yes, if no Docker image is set the default one
        will be used.
         > >         Currently
         > >          > Docker is the only way to execute pipelines with the
         > >         PortableRunner. If
         > >          > the docker_image is not set, execution won't
        succeed.
         > >          >
         > >          > On 22.08.18 22:59, Xinyu Liu wrote:
         > >          >> We are also interested in this Process
        JobBundleFactory as
         > >         we are
         > >          >> planning to fork a process to run python sdk in
        Samza
         > >         runner, instead
         > >          >> of using docker container. So this change will
        be helpful to
         > >         us too.
         > >          >> On the same note, we are trying out
        portable_runner.py to
         > >         submit a
         > >          >> python job. Seems it will create a default
        docker url even
         > >         if the
         > >          >> hardness_docker_image is set to None in
        pipeline options.
         > >         Shall we add
         > >          >> another option or honor the None in this option
        to support
         > >         the process
         > >          >> job? I made some local changes right now to
        walk around this.
         > >          >>
         > >          >> Thanks,
         > >          >> Xinyu
         > >          >>
         > >          >> On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde
         > >         <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
         > >          >> <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>> wrote:
         > >          >>
         > >          >>     By "enum" in quotes, I meant the usual open
        URN style
         > >         pattern not an
         > >          >>     actual enum. Sorry if that wasn't clear.
         > >          >>
         > >          >>     On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik
         > >         <[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
         > >          >>     <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>> wrote:
         > >          >>
         > >          >>         I would model the environment to be
        more free form
         > >         then enums
         > >          >>         such that we have forward looking
        extensibility and
         > >         would
         > >          >>         suggest to follow the same pattern we
        use on
         > >         PTransforms (using
         > >          >>         an URN and a URN specific payload).
        Note that in
         > >         this case we
         > >          >>         may want to support a list of supported
        environments
         > >         (e.g. java,
         > >          >>         docker, python, ...).
         > >          >>
         > >          >>         On Tue, Aug 21, 2018 at 10:37 AM
        Henning Rohde
         > >          >>         <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>
         > >         <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>> wrote:
         > >          >>
         > >          >>             One thing to consider that we've
        talked about in
         > >         the past.
         > >          >>             It might make sense to extend the
        environment
         > >         proto and have
         > >          >>             the SDK be explicit about which
        kinds of
         > >         environment it
         > >          >>             supports:
         > >          >>
         > >          >>
         > >          >>
         > >
        
https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969
         > >
         > >          >>
         > >          >>
         > >          >>

> > <https://github.com/apache/beam/blob/8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/main/proto/beam_runner_api.proto#L969>

         > >
         > >          >>
         > >          >>
         > >          >>             This choice might impact what files
        are staged
         > >         or what not.
         > >          >>             Some SDKs, such as Go, provide a
        compiled binary
         > >         and _need_
         > >          >>             to know what the target
        architecture is. Running
         > >         on a mac
         > >          >>             with and without docker, say,
        requires a
         > >         different worker in
         > >          >>             each case. If we add an "enum", we
        can also
         > >         easily add the
         > >          >>             external idea where the SDK/user
        starts the SDK
         > >         harnesses
         > >          >>             instead of the runner. Each runner
        may not
         > >         support all types
         > >          >>             of environments.
         > >          >>
         > >          >>             Henning
         > >          >>
         > >          >>             On Tue, Aug 21, 2018 at 2:52 AM
        Maximilian Michels
         > >          >>             <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>
         > >         <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>> wrote:
         > >          >>
         > >          >>                 For reference, here is
        corresponding JIRA
         > >         issue for this
         > >          >>                 thread:
         > >          >> https://issues.apache.org/jira/browse/BEAM-5187
         > >          >>
         > >         <https://issues.apache.org/jira/browse/BEAM-5187>
         > >          >>
         > >          >>                 On 16.08.18 11:15, Maximilian
        Michels wrote:
         > >          >>                  > Makes sense to have an
        option to run the
         > >         SDK harness
         > >          >>                 in a non-dockerized
         > >          >>                  > environment.
         > >          >>                  >
         > >          >>                  > I'm in the process of
        creating a Docker
         > >         entry point
         > >          >>                 for Flink's
         > >          >>                  > JobServer[1]. I suppose you
        would also
         > >         prefer to
         > >          >>                 execute that one
         > >          >>                  > standalone. We should make
        sure this is
         > >         also an
         > >          >> option.
         > >          >>                  >
         > >          >>                  > [1]
         > > https://issues.apache.org/jira/browse/BEAM-4130
         > >          >>
         > >         <https://issues.apache.org/jira/browse/BEAM-4130>
         > >          >>                  >
         > >          >>                  > On 16.08.18 07:42, Thomas
        Weise wrote:
         > >          >>                  >> Yes, that's the proposal.
        Everything
         > >         that would
         > >          >>                 otherwise be packaged
         > >          >>                  >> into the Docker container
        would need to be
         > >          >>                 pre-installed in the host
         > >          >>                  >> environment. In the case of
        Python SDK,
         > >         that could
         > >          >>                 simply mean a
         > >          >>                  >> (frozen) virtual
        environment that was
         > >         setup when the
         > >          >>                 host was
         > >          >>                  >> provisioned - the SDK harness
         > >         process(es) will then
         > >          >>                 just utilize that.
         > >          >>                  >> Of course this flavor of
        SDK harness
         > >         execution could
         > >          >>                 also be useful in
         > >          >>                  >> the local development
        environment, where
         > >         right now
         > >          >>                 someone who already
         > >          >>                  >> has the Python environment
        needs to also
         > >         install
         > >          >>                 Docker and update a
         > >          >>                  >> container to launch a
        Python SDK
         > >         pipeline on the
         > >          >>                 Flink runner.
         > >          >>                  >>
         > >          >>                  >> On Wed, Aug 15, 2018 at
        12:40 PM Daniel
         > >         Oliveira
         > >          >>                 <[email protected]
        <mailto:[email protected]>
         > >         <mailto:[email protected]
        <mailto:[email protected]>> <mailto:[email protected]
        <mailto:[email protected]>
         > >         <mailto:[email protected]
        <mailto:[email protected]>>>
         > >          >>                  >>
        <mailto:[email protected] <mailto:[email protected]>
         > >         <mailto:[email protected]
        <mailto:[email protected]>>
         > >          >>                 <mailto:[email protected]
        <mailto:[email protected]>
         > >         <mailto:[email protected]
        <mailto:[email protected]>>>>> wrote:
         > >          >>                  >>
         > >          >>                  >>      I just want to clarify
        that I
         > >         understand this
         > >          >>                 correctly since I'm
         > >          >>                  >>      not that familiar with
        the details
         > >         behind all
         > >          >>                 these execution
         > >          >>                  >>      environments yet. Is
        the proposal
         > >         to create a
         > >          >>                 new JobBundleFactory
         > >          >>                  >>      that instead of using
        Docker to
         > >         create the
         > >          >>                 environment that the new
         > >          >>                  >>      processes will execute
        in, this
         > >          >>                 JobBundleFactory would execute the
         > >          >>                  >>      new processes directly
        in the host
         > >         environment?
         > >          >>                 So in practice if I
         > >          >>                  >>      ran a pipeline with this
         > >         JobBundleFactory the
         > >          >>                 SDK Harness and Runner
         > >          >>                  >>      Harness would both be
        executing
         > >         directly on my
         > >          >>                 machine and would
         > >          >>                  >>      depend on me having the
         > >         dependencies already
         > >          >>                 present on my machine?
         > >          >>                  >>
         > >          >>                  >>      On Mon, Aug 13, 2018
        at 5:50 PM
         > >         Ankur Goenka
         > >          >>                 <[email protected]
        <mailto:[email protected]>
         > >         <mailto:[email protected]
        <mailto:[email protected]>> <mailto:[email protected]
        <mailto:[email protected]>
         > >         <mailto:[email protected] <mailto:[email protected]>>>

> > >> >><mailto:[email protected] <mailto:[email protected]>

         > >         <mailto:[email protected] <mailto:[email protected]>>
         > >          >>                 <mailto:[email protected]
        <mailto:[email protected]>
         > >         <mailto:[email protected]
        <mailto:[email protected]>>>>> wrote:
         > >          >>                  >>
         > >          >>                  >>          Thanks for
        starting the
         > >         discussion. I will
         > >          >>                 be happy to help.
         > >          >>                  >>          I agree, we should
        have pluggable
         > >          >>                 SDKHarness environment Factory.
         > >          >>                  >>          We can register
        multiple
         > >         Environment
         > >          >>                 factory using service
         > >          >>                  >>          registry and use the
         > >         PipelineOption to pick
         > >          >>                 the right one on per
         > >          >>                  >>          job basis.
         > >          >>                  >>
         > >          >>                  >>          There are a couple
        of things
         > >         which are
         > >          >>                 require to setup before
         > >          >>                  >>          launching the process.
         > >          >>                  >>
         > >          >>                  >>            * Setting up the
        environment
         > >         as done in
         > >          >>                 boot.go [4]
         > >          >>                  >>            * Retrieving and
        putting the
         > >         artifacts in
         > >          >>                 the right location.
         > >          >>                  >>
         > >          >>                  >>          You can probably
        leverage
         > >         boot.go code to
         > >          >>                 setup the environment.
         > >          >>                  >>
         > >          >>                  >>          Also, it will be
        useful to
         > >         enumerate pros
         > >          >>                 and cons of different
         > >          >>                  >>          Environments to
        help users
         > >         choose the right
         > >          >>                 one.
         > >          >>                  >>
         > >          >>                  >>
         > >          >>                  >>          On Mon, Aug 6,
        2018 at 4:50 PM
         > >         Thomas Weise
         > >          >>                 <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>
         > >         <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>

> > >> >><mailto:[email protected] <mailto:[email protected]>

         > >         <mailto:[email protected] <mailto:[email protected]>>
         > >          >>                 <mailto:[email protected]
        <mailto:[email protected]>
         > >         <mailto:[email protected] <mailto:[email protected]>>>>>
        wrote:
         > >          >>                  >>
         > >          >>                  >>              Hi,
         > >          >>                  >>
         > >          >>                  >>              Currently the
        portable
         > >         Flink runner
         > >          >>                 only works with SDK
         > >          >>                  >>              Docker
        containers for execution
         > >          >>                 (DockerJobBundleFactory,
         > >          >>                  >>              besides an
        in-process
         > >         (embedded)
         > >          >>                 factory option for testing
         > >          >>                  >>              [1]). I'm
        considering
         > >         adding another
         > >          >>                 out of process
         > >          >>                  >>              JobBundleFactory
         > >         implementation that
         > >          >>                 directly forks the
         > >          >>                  >>              processes on
        the task
         > >         manager host,
         > >          >>                 eliminating the need for
         > >          >>                  >>              Docker. This
        would work
         > >         reasonably well
         > >          >>                 in environments
         > >          >>                  >>              where the
        dependencies (in
         > >         this case
         > >          >>                 Python) can easily be
         > >          >>                  >>              tied into the host
         > >         deployment (also
         > >          >>                 within an application
         > >          >>                  >>              specific
        Kubernetes pod).
         > >          >>                  >>
         > >          >>                  >>              There was
        already some
         > >         discussion about
         > >          >>                 alternative
         > >          >>                  >>              JobBundleFactory
         > >         implementation in [2].
         > >          >>                 There is also a JIRA
         > >          >>                  >>              to make the
        bundle factory
         > >         pluggable
         > >          >>                 [3], pending
         > >          >>                  >>              availability
        of runner
         > >         level options.
         > >          >>                  >>
         > >          >>                  >>              For a
         > >         "ProcessBundleFactory", in
         > >          >>                 addition to the Python
         > >          >>                  >>              dependencies the
         > >         environment would also
         > >          >>                 need to have the Go
         > >          >>                  >>              boot
        executable [4] (or a
         > >         substitute
         > >          >>                 thereof) to perform the
         > >          >>                  >>              harness
        initialization.
         > >          >>                  >>
         > >          >>                  >>              Is anyone else
        interested
         > >         in this SDK
         > >          >>                 execution option or
         > >          >>                  >>              has already
        investigated an
         > >         alternative
         > >          >>                 implementation?
         > >          >>                  >>
         > >          >>                  >>              Thanks,
         > >          >>                  >>              Thomas
         > >          >>                  >>
         > >          >>                  >>              [1]
         > >          >>                  >>
         > >          >>
         > >          >>
         > >
        
https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
         > >
         > >          >>
         > >          >>
         > >          >>

> > <https://github.com/apache/beam/blob/7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83>

         > >
         > >          >>
         > >          >>                  >>
         > >          >>                  >>              [2]
         > >          >>                  >>
         > >          >>
         > >          >>
         > >
        
https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E
         > >
         > >          >>
         > >          >>
         > >          >>

> > <https://lists.apache.org/thread.html/d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%3Cdev.beam.apache.org%3E>

         > >
         > >          >>
         > >          >>                  >>
         > >          >>                  >>              [3]
         > >          >> https://issues.apache.org/jira/browse/BEAM-4819
         > >          >>
         > >         <https://issues.apache.org/jira/browse/BEAM-4819>
         > >          >>                  >>
         > >          >>                  >>              [4]
         > >          >>
         > >          >>
         > >
        https://github.com/apache/beam/blob/master/sdks/python/container/boot.go
         > >          >>
         > >          >>

> > <https://github.com/apache/beam/blob/master/sdks/python/container/boot.go>

         > >
         > >          >>
         > >          >>                  >>
         > >          >>
         > >          >>                 --                 Max
         > >          >>
         > >          >>
         > >          >
         > >
         > >         --
         > >         Max
         > >
         >
         > --
         > Max

Re: Process JobBundleFactory for portable runner

Reply via email to