There are indeed lots of possibilities for interesting docker alternatives
with different tradeoffs and capabilities, but in generally both the runner
as well as the SDK must support them for it to work. As mentioned, docker
(as used in the container contract) is meant as a flexible main option but
not necessarily the only option. I see no problem with certain
pipeline-SDK-runner combinations additionally supporting a specialized
setup. Pipeline can be a factor, because that some transforms might depend
on aspects of the runtime environment -- such as system libraries or
shelling out to a /bin/foo.

The worker boot code is tied to the current container contract, so
pre-launched workers would presumably not use that code path and are not be
bound by its assumptions. In particular, such a setup might want to invert
who initiates the connection from the SDK worker to the runner. Pipeline
options and global state in the SDK and user functions process might make
it difficult to safely reuse worker processes across pipelines, but also
doable in certain scenarios.

Henning

On Tue, May 8, 2018 at 3:51 PM Thomas Weise <t...@apache.org> wrote:

>
>
> On Sat, May 5, 2018 at 3:58 PM, Robert Bradshaw <rober...@google.com>
> wrote:
>
>>
>> I would welcome changes to
>>
>> https://github.com/apache/beam/blob/v2.4.0/model/pipeline/src/main/proto/beam_runner_api.proto#L730
>> that would provide alternatives to docker (one of which comes to mind is
>> "I
>> already brought up a worker(s) for you (which could be the same process
>> that handled pipeline construction in testing scenarios), here's how to
>> connect to it/them.") Another option, which would seem to appeal to you in
>> particular, would be "the worker code is linked into the runner's binary,
>> use this process as the worker" (though note even for java-on-java, it can
>> be advantageous to shield the worker and runner code from each others
>> environments, dependencies, and version requirements.) This latter should
>> still likely use the FnApi to talk to itself (either over GRPC on local
>> ports, or possibly better via direct function calls eliminating the RPC
>> overhead altogether--this is how the fast local runner in Python works).
>> There may be runner environments well controlled enough that "start up the
>> workers" could be specified as "run this command line." We should make
>> this
>> environment message extensible to other alternatives than "docker
>> container
>> url," though of course we don't want the set of options to grow too large
>> or we loose the promise of portability unless every runner supports every
>> protocol.
>>
>>
> The pre-launched worker would be an interesting option, which might work
> well for a sidecar deployment.
>
> The current worker boot code though makes the assumption that the runner
> endpoint to phone home to is known when the process is launched. That
> doesn't work so well with a runner that establishes its endpoint
> dynamically. Also, the assumption is baked in that a worker will only serve
> a single pipeline (provisioning API etc.).
>
> Thanks,
> Thomas
>
>

Reply via email to