Re: Process JobBundleFactory for portable runner

Xinyu Liu Wed, 22 Aug 2018 13:59:51 -0700

We are also interested in this Process JobBundleFactory as we are planning
to fork a process to run python sdk in Samza runner, instead of using
docker container. So this change will be helpful to us too. On the same
note, we are trying out portable_runner.py to submit a python job. Seems it
will create a default docker url even if the hardness_docker_image is set
to None in pipeline options. Shall we add another option or honor the None
in this option to support the process job? I made some local changes right
now to walk around this.


Thanks,
Xinyu

On Tue, Aug 21, 2018 at 12:25 PM, Henning Rohde <[email protected]> wrote:

> By "enum" in quotes, I meant the usual open URN style pattern not an
> actual enum. Sorry if that wasn't clear.
>
> On Tue, Aug 21, 2018 at 11:51 AM Lukasz Cwik <[email protected]> wrote:
>
>> I would model the environment to be more free form then enums such that
>> we have forward looking extensibility and would suggest to follow the same
>> pattern we use on PTransforms (using an URN and a URN specific payload).
>> Note that in this case we may want to support a list of supported
>> environments (e.g. java, docker, python, ...).
>>
>> On Tue, Aug 21, 2018 at 10:37 AM Henning Rohde <[email protected]>
>> wrote:
>>
>>> One thing to consider that we've talked about in the past. It might make
>>> sense to extend the environment proto and have the SDK be explicit about
>>> which kinds of environment it supports:
>>>
>>>         https://github.com/apache/beam/blob/
>>> 8c4f4babc0b0d55e7bddefa3f9f9ba65d21ef139/model/pipeline/src/
>>> main/proto/beam_runner_api.proto#L969
>>>
>>> This choice might impact what files are staged or what not. Some SDKs,
>>> such as Go, provide a compiled binary and _need_ to know what the target
>>> architecture is. Running on a mac with and without docker, say, requires a
>>> different worker in each case. If we add an "enum", we can also easily add
>>> the external idea where the SDK/user starts the SDK harnesses instead of
>>> the runner. Each runner may not support all types of environments.
>>>
>>> Henning
>>>
>>> On Tue, Aug 21, 2018 at 2:52 AM Maximilian Michels <[email protected]>
>>> wrote:
>>>
>>>> For reference, here is corresponding JIRA issue for this thread:
>>>> https://issues.apache.org/jira/browse/BEAM-5187
>>>>
>>>> On 16.08.18 11:15, Maximilian Michels wrote:
>>>> > Makes sense to have an option to run the SDK harness in a
>>>> non-dockerized
>>>> > environment.
>>>> >
>>>> > I'm in the process of creating a Docker entry point for Flink's
>>>> > JobServer[1]. I suppose you would also prefer to execute that one
>>>> > standalone. We should make sure this is also an option.
>>>> >
>>>> > [1] https://issues.apache.org/jira/browse/BEAM-4130
>>>> >
>>>> > On 16.08.18 07:42, Thomas Weise wrote:
>>>> >> Yes, that's the proposal. Everything that would otherwise be packaged
>>>> >> into the Docker container would need to be pre-installed in the host
>>>> >> environment. In the case of Python SDK, that could simply mean a
>>>> >> (frozen) virtual environment that was setup when the host was
>>>> >> provisioned - the SDK harness process(es) will then just utilize
>>>> that.
>>>> >> Of course this flavor of SDK harness execution could also be useful
>>>> in
>>>> >> the local development environment, where right now someone who
>>>> already
>>>> >> has the Python environment needs to also install Docker and update a
>>>> >> container to launch a Python SDK pipeline on the Flink runner.
>>>> >>
>>>> >> On Wed, Aug 15, 2018 at 12:40 PM Daniel Oliveira <
>>>> [email protected]
>>>> >> <mailto:[email protected]>> wrote:
>>>> >>
>>>> >>      I just want to clarify that I understand this correctly since
>>>> I'm
>>>> >>      not that familiar with the details behind all these execution
>>>> >>      environments yet. Is the proposal to create a new
>>>> JobBundleFactory
>>>> >>      that instead of using Docker to create the environment that the
>>>> new
>>>> >>      processes will execute in, this JobBundleFactory would execute
>>>> the
>>>> >>      new processes directly in the host environment? So in practice
>>>> if I
>>>> >>      ran a pipeline with this JobBundleFactory the SDK Harness and
>>>> Runner
>>>> >>      Harness would both be executing directly on my machine and would
>>>> >>      depend on me having the dependencies already present on my
>>>> machine?
>>>> >>
>>>> >>      On Mon, Aug 13, 2018 at 5:50 PM Ankur Goenka <[email protected]
>>>> >>      <mailto:[email protected]>> wrote:
>>>> >>
>>>> >>          Thanks for starting the discussion. I will be happy to help.
>>>> >>          I agree, we should have pluggable SDKHarness environment
>>>> Factory.
>>>> >>          We can register multiple Environment factory using service
>>>> >>          registry and use the PipelineOption to pick the right one
>>>> on per
>>>> >>          job basis.
>>>> >>
>>>> >>          There are a couple of things which are require to setup
>>>> before
>>>> >>          launching the process.
>>>> >>
>>>> >>            * Setting up the environment as done in boot.go [4]
>>>> >>            * Retrieving and putting the artifacts in the right
>>>> location.
>>>> >>
>>>> >>          You can probably leverage boot.go code to setup the
>>>> environment.
>>>> >>
>>>> >>          Also, it will be useful to enumerate pros and cons of
>>>> different
>>>> >>          Environments to help users choose the right one.
>>>> >>
>>>> >>
>>>> >>          On Mon, Aug 6, 2018 at 4:50 PM Thomas Weise <[email protected]
>>>> >>          <mailto:[email protected]>> wrote:
>>>> >>
>>>> >>              Hi,
>>>> >>
>>>> >>              Currently the portable Flink runner only works with SDK
>>>> >>              Docker containers for execution (DockerJobBundleFactory,
>>>> >>              besides an in-process (embedded) factory option for
>>>> testing
>>>> >>              [1]). I'm considering adding another out of process
>>>> >>              JobBundleFactory implementation that directly forks the
>>>> >>              processes on the task manager host, eliminating the
>>>> need for
>>>> >>              Docker. This would work reasonably well in environments
>>>> >>              where the dependencies (in this case Python) can easily
>>>> be
>>>> >>              tied into the host deployment (also within an
>>>> application
>>>> >>              specific Kubernetes pod).
>>>> >>
>>>> >>              There was already some discussion about alternative
>>>> >>              JobBundleFactory implementation in [2]. There is also a
>>>> JIRA
>>>> >>              to make the bundle factory pluggable [3], pending
>>>> >>              availability of runner level options.
>>>> >>
>>>> >>              For a "ProcessBundleFactory", in addition to the Python
>>>> >>              dependencies the environment would also need to have
>>>> the Go
>>>> >>              boot executable [4] (or a substitute thereof) to
>>>> perform the
>>>> >>              harness initialization.
>>>> >>
>>>> >>              Is anyone else interested in this SDK execution option
>>>> or
>>>> >>              has already investigated an alternative implementation?
>>>> >>
>>>> >>              Thanks,
>>>> >>              Thomas
>>>> >>
>>>> >>              [1]
>>>> >>              https://github.com/apache/beam/blob/
>>>> 7958a379b0a37a89edc3a6ae4b5bc82fda41fcd6/runners/flink/src/
>>>> test/java/org/apache/beam/runners/flink/PortableExecutionTest.java#L83
>>>> >>
>>>> >>              [2]
>>>> >>              https://lists.apache.org/thread.html/
>>>> d6b6fde764796de31996db9bb5f9de3e7aaf0ab29b99d0adb52ac508@%
>>>> 3Cdev.beam.apache.org%3E
>>>> >>
>>>> >>              [3] https://issues.apache.org/jira/browse/BEAM-4819
>>>> >>
>>>> >>              [4] https://github.com/apache/
>>>> beam/blob/master/sdks/python/container/boot.go
>>>> >>
>>>>
>>>> --
>>>> Max
>>>>
>>>

Re: Process JobBundleFactory for portable runner

Reply via email to