Re: Artifact staging in cross-language pipelines

Chamikara Jayalath Thu, 18 Apr 2019 18:02:36 -0700

On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath <chamik...@google.com>
wrote:


> Thanks for raising the concern about credentials Ankur, I agree that this
> is a significant issue.
>
> On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik <lc...@google.com> wrote:
>
>> I can understand the concern about credentials, the same access concern
>> will exist for several cross language transforms (mostly IOs) since some
>> will need access to credentials to read/write to an external service.
>>
>> Are there any ideas on how credential propagation could work to these IOs?
>>
>
> There are some cases where existing IO transforms need credentials to
> access remote resources, for example, size estimation, validation, etc. But
> usually these are optional (or transform can be configured to not perform
> these functions).
>

To clarify, I'm only talking about transform expansion here. Many IO
transforms need read/write access to remote services at run time. So
probably we need to figure out a way to propagate these credentials anyways.


>
>
>> Can we use these mechanisms for staging?
>>
>
> I think we'll have to find a way to do one of (1) propagate credentials to
> other SDKs (2) allow users to configure SDK containers to have necessary
> credentials (3) do the artifact staging from the pipeline SDK environment
> which already have credentials. I prefer (1) or (2) since this will given a
> transform same feature set whether used directly (in the same SDK language
> as the transform) or remotely but it might be hard to do this for an
> arbitrary service that a transform might connect to considering the number
> of ways users can configure credentials (after an offline discussion with
> Ankur).
>
>
>>
>>
>
>> On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka <goe...@google.com> wrote:
>>
>>> I agree that the Expansion service knows about the artifacts required
>>> for a cross language transform and having a prepackage folder/Zip for
>>> transforms based on language makes sense.
>>>
>>> One think to note here is that expansion service might not have the same
>>> access privilege as the pipeline author and hence might not be able to
>>> stage artifacts by itself.
>>> Keeping this in mind I am leaning towards making Expansion service
>>> provide all the required artifacts to the user and let the user stage the
>>> artifacts as regular artifacts.
>>> At this time, we only have Beam File System based artifact staging which
>>> users local credentials to access different file systems. Even a docker
>>> based expansion service running on local machine might not have the same
>>> access privileges.
>>>
>>> In brief this is what I am leaning toward.
>>> User call for pipeline submission -> Expansion service provide cross
>>> language transforms and relevant artifacts to the Sdk -> Sdk Submits the
>>> pipeline to Jobserver and Stages user and cross language artifacts to
>>> artifacts staging service
>>>
>>>
>>> On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath <chamik...@google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>>> Note that Max did ask whether making the expansion service do the
>>>>> staging made sense, and my first line was agreeing with that direction and
>>>>> expanding on how it could be done (so this is really Max's idea or from
>>>>> whomever he got the idea from).
>>>>>
>>>>
>>>> +1 to what Max said then :)
>>>>
>>>>
>>>>>
>>>>> I believe a lot of the value of the expansion service is not having
>>>>> users need to be aware of all the SDK specific dependencies when they are
>>>>> trying to create a pipeline, only the "user" who is launching the 
>>>>> expansion
>>>>> service may need to. And in that case we can have a prepackaged expansion
>>>>> service application that does what most users would want (e.g. expansion
>>>>> service as a docker container, a single bundled jar, ...). We (the Apache
>>>>> Beam community) could choose to host a default implementation of the
>>>>> expansion service as well.
>>>>>
>>>>
>>>> I'm not against this. But I think this is a secondary more advanced
>>>> use-case. For a Beam users that needs to use a Java transform that they
>>>> already have in a Python pipeline, we should provide a way to allow
>>>> starting up a expansion service (with dependencies needed for that) and
>>>> running a pipeline that uses this external Java transform (with
>>>> dependencies that are needed at runtime). Probably, it'll be enough to
>>>> allow providing all dependencies when starting up the expansion service and
>>>> allow expansion service to do the staging of jars are well. I don't see a
>>>> need to include the list of jars in the ExpansionResponse sent to the
>>>> Python SDK.
>>>>
>>>>
>>>>>
>>>>> On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath <
>>>>> chamik...@google.com> wrote:
>>>>>
>>>>>> I think there are two kind of dependencies we have to consider.
>>>>>>
>>>>>> (1) Dependencies that are needed to expand the transform.
>>>>>>
>>>>>> These have to be provided when we start the expansion service so that
>>>>>> available external transforms are correctly registered with the expansion
>>>>>> service.
>>>>>>
>>>>>> (2) Dependencies that are not needed at expansion but may be needed
>>>>>> at runtime.
>>>>>>
>>>>>> I think in both cases, users have to provide these dependencies
>>>>>> either when expansion service is started or when a pipeline is being
>>>>>> executed.
>>>>>>
>>>>>> Max, I'm not sure why expansion service will need to provide
>>>>>> dependencies to the user since user will already be aware of these. Are 
>>>>>> you
>>>>>> talking about a expansion service that is readily available that will be
>>>>>> used by many Beam users ? I think such a (possibly long running) service
>>>>>> will have to maintain a repository of transforms and should have 
>>>>>> mechanism
>>>>>> for registering new transforms and discovering already registered
>>>>>> transforms etc. I think there's more design work needed to make transform
>>>>>> expansion service support such use-cases. Currently, I think allowing
>>>>>> pipeline author to provide the jars when starting the expansion service 
>>>>>> and
>>>>>> when executing the pipeline will be adequate.
>>>>>>
>>>>>> Regarding the entity that will perform the staging, I like Luke's
>>>>>> idea of allowing expansion service to do the staging (of jars provided by
>>>>>> the user). Notion of artifacts and how they are extracted/represented is
>>>>>> SDK dependent. So if the pipeline SDK tries to do this we have to add n x
>>>>>> (n -1) configurations (for n SDKs).
>>>>>>
>>>>>> - Cham
>>>>>>
>>>>>> On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik <lc...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> We can expose the artifact staging endpoint and artifact token to
>>>>>>> allow the expansion service to upload any resources its environment may
>>>>>>> need. For example, the expansion service for the Beam Java SDK would be
>>>>>>> able to upload jars.
>>>>>>>
>>>>>>> In the "docker" environment, the Apache Beam Java SDK harness
>>>>>>> container would fetch the relevant artifacts for itself and be able to
>>>>>>> execute the pipeline. (Note that a docker environment could skip all 
>>>>>>> this
>>>>>>> artifact staging if the docker environment contained all necessary
>>>>>>> artifacts).
>>>>>>>
>>>>>>> For the existing "external" environment, it should already come with
>>>>>>> all the resources prepackaged wherever "external" points to. The 
>>>>>>> "process"
>>>>>>> based environment could choose to use the artifact staging service to 
>>>>>>> fetch
>>>>>>> those resources associated with its process or it could follow the same
>>>>>>> pattern that "external" would do and already contain all the prepackaged
>>>>>>> resources. Note that both "external" and "process" will require the
>>>>>>> instance of the expansion service to be specialized for those 
>>>>>>> environments
>>>>>>> which is why the default should for the expansion service to be the
>>>>>>> "docker" environment.
>>>>>>>
>>>>>>> Note that a major reason for going with docker containers as the
>>>>>>> environment that all runners should support is that containers provides 
>>>>>>> a
>>>>>>> solution for this exact issue. Both the "process" and "external"
>>>>>>> environments are explicitly limiting and expanding their capabilities 
>>>>>>> will
>>>>>>> quickly have us building something like a docker container because we'll
>>>>>>> quickly find ourselves solving the same problems that docker containers
>>>>>>> provide (resources, file layout, permissions, ...)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 18, 2019 at 11:21 AM Maximilian Michels <m...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> We have previously merged support for configuring transforms across
>>>>>>>> languages. Please see Cham's summary on the discussion [1]. There
>>>>>>>> is
>>>>>>>> also a design document [2].
>>>>>>>>
>>>>>>>> Subsequently, we've added wrappers for cross-language transforms to
>>>>>>>> the
>>>>>>>> Python SDK, i.e. GenerateSequence, ReadFromKafka, and there is a
>>>>>>>> pending
>>>>>>>> PR [1] for WriteToKafka. All of them utilize Java transforms via
>>>>>>>> cross-language configuration.
>>>>>>>>
>>>>>>>> That is all pretty exciting :)
>>>>>>>>
>>>>>>>> We still have some issues to solve, one being how to stage artifact
>>>>>>>> from
>>>>>>>> a foreign environment. When we run external transforms which are
>>>>>>>> part of
>>>>>>>> Beam's core (e.g. GenerateSequence), we have them available in the
>>>>>>>> SDK
>>>>>>>> Harness. However, when they are not (e.g. KafkaIO) we need to stage
>>>>>>>> the
>>>>>>>> necessary files.
>>>>>>>>
>>>>>>>> For my PR [3] I've naively added ":beam-sdks-java-io-kafka" to the
>>>>>>>> SDK
>>>>>>>> Harness which caused dependency problems [4]. Those could be
>>>>>>>> resolved
>>>>>>>> but the bigger question is how to stage artifacts for external
>>>>>>>> transforms programmatically?
>>>>>>>>
>>>>>>>> Heejong has solved this by adding a "--jar_package" option to the
>>>>>>>> Python
>>>>>>>> SDK to stage Java files [5]. I think that is a better solution than
>>>>>>>> adding required Jars to the SDK Harness directly, but it is not
>>>>>>>> very
>>>>>>>> convenient for users.
>>>>>>>>
>>>>>>>> I've discussed this today with Thomas and we both figured that the
>>>>>>>> expansion service needs to provide a list of required Jars with the
>>>>>>>> ExpansionResponse it provides. It's not entirely clear, how we
>>>>>>>> determine
>>>>>>>> which artifacts are necessary for an external transform. We could
>>>>>>>> just
>>>>>>>> dump the entire classpath like we do in PipelineResources for Java
>>>>>>>> pipelines. This provides many unneeded classes but would work.
>>>>>>>>
>>>>>>>> Do you think it makes sense for the expansion service to provide
>>>>>>>> the
>>>>>>>> artifacts? Perhaps you have a better idea how to resolve the
>>>>>>>> staging
>>>>>>>> problem in cross-language pipelines?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Max
>>>>>>>>
>>>>>>>> [1]
>>>>>>>>
>>>>>>>> https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E
>>>>>>>>
>>>>>>>> [2] https://s.apache.org/beam-cross-language-io
>>>>>>>>
>>>>>>>> [3] https://github.com/apache/beam/pull/8322#discussion_r276336748
>>>>>>>>
>>>>>>>> [4] Dependency graph for beam-runners-direct-java:
>>>>>>>>
>>>>>>>> beam-runners-direct-java -> sdks-java-harness ->
>>>>>>>> beam-sdks-java-io-kafka
>>>>>>>> -> beam-runners-direct-java ... the cycle continues
>>>>>>>>
>>>>>>>> Beam-runners-direct-java depends on sdks-java-harness due
>>>>>>>> to the infamous Universal Local Runner. Beam-sdks-java-io-kafka
>>>>>>>> depends
>>>>>>>> on beam-runners-direct-java for running tests.
>>>>>>>>
>>>>>>>> [5] https://github.com/apache/beam/pull/8340
>>>>>>>>
>>>>>>>

Re: Artifact staging in cross-language pipelines

Reply via email to