Re: Environments for External Transforms

Lukasz Cwik Fri, 24 May 2019 14:49:34 -0700

Dataflow has been doing something similar in this route where it is trying
to get rid of the driver program running on the users machine. If you can
get the expansion service to launch and run an environment to perform the
expansion, you could also get it to create and submit a job as well
returning data around the running job.


On Thu, May 23, 2019 at 7:47 AM Thomas Weise <[email protected]> wrote:

>
>
> On Thu, May 23, 2019 at 3:46 AM Maximilian Michels <[email protected]> wrote:
>
>> >  Writing a new transform involves updating the expansion service to
>> include their new transform.
>>
>> Would it be conceivable that the expansion is performed via the
>> environment? That would solve the problem of updating the expansion
>> service, although it adds additional complexity for bringing up the
>> environment.
>>
>>
> Which environment would be used to perform the expansion? I think this is
> an interesting option, as long as it does not introduce a hard dependency
> on docker.
>
>
>> On 23.05.19 11:31, Robert Bradshaw wrote:
>> > On Wed, May 22, 2019 at 6:17 PM Maximilian Michels <[email protected]
>> > <mailto:[email protected]>> wrote:
>> >
>> >     Hi,
>> >
>> >     Robert and me were discussing on the subject of user-specified
>> >     environments for external transforms [1]. We couldn't decide whether
>> >     users should have direct control over the environment when they use
>> an
>> >     external transform in their pipeline.
>> >
>> >     In my mind, it is quite natural that the Expansion Service is a
>> >     long-running service that gets started with a list of available
>> >     environments.
>> >
>> >
>> > +1.
>> >
>> > IMHO, the expansion service should be expected to provide valid
>> > environments for the transforms it vendors. Removing this expectation
>> > seems wrong. Making it cheap to specify non-default dependencies
>> without
>> > building (publishing, etc.) a docker image is probably key to making
>> > this work well (and also allowing more powerful environment
>> introspection).
>> >
>> >     Such a list can be outdated and users may write transforms
>> >     for a new environment they want to use in their pipeline.
>> >
>> >
>> > This is the part that I'm having trouble following. Writing a new
>> > transform involves updating the expansion service to include their new
>> > transform. The author of a transform (in other words, the one who
>> > defines its expansion and implementation) is in the position to name
>> its
>> > dependencies, etc. and the user of the transform (the one invoking it)
>> > is not in a generally good position to know what environments would be
>> > valid.
>> >
>> >     The easiest
>> >     way would be to allow to pass the environment with the transform.
>> >
>> >
>> > What this allows is using existing transforms in new environments.
>> There
>> > are possibly some usecases for this, e.g. expansion of a given
>> transform
>> > may be compatible with ether version X or version Y of a library, left
>> > up to the discretion of the caller, but I think that this is really
>> just
>> > a deficiency in our environment specifications (e.g. it one should be
>> > able to express this flexibility in the returned environment).
>> >
>> >     Note
>> >     that we already give users control over the "main" environment via
>> the
>> >     PortablePipelineOptions, so this wouldn't be an entirely new
>> concept.
>> >
>> >
>> > Yes, the author of a pipeline/transform chooses the environment in
>> which
>> > those transforms execute.
>> >
>> >     The contrary position is that the Expansion Service should have full
>> >     control over which environment is chosen. Going back to the
>> discussion
>> >     about artifact staging [2], this could enable to perform more
>> >     optimizations, such as merging environments or detecting conflicts.
>> >     However, this only works if this information has been provided
>> upfront
>> >     to the Expansion Service. It wouldn't be impossible to provide these
>> >     hints alongside with the environment like suggested in the previous
>> >     paragraph.
>> >
>> >     Any opinions? Should we allow users to optionally specify an
>> >     environment
>> >     for external transforms?
>> >
>> >     Thanks,
>> >     Max
>> >
>> >     [1] https://github.com/apache/beam/pull/8639
>> >     [2]
>> >
>> https://lists.apache.org/thread.html/6fcee7047f53cf1c0636fb65367ef70842016d57effe2e5795c4137d@%3Cdev.beam.apache.org%3E
>> >
>>
>

Re: Environments for External Transforms

Reply via email to