On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath <chamik...@google.com> wrote:
> Thanks for raising the concern about credentials Ankur, I agree that this > is a significant issue. > > On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik <lc...@google.com> wrote: > >> I can understand the concern about credentials, the same access concern >> will exist for several cross language transforms (mostly IOs) since some >> will need access to credentials to read/write to an external service. >> >> Are there any ideas on how credential propagation could work to these IOs? >> > > There are some cases where existing IO transforms need credentials to > access remote resources, for example, size estimation, validation, etc. But > usually these are optional (or transform can be configured to not perform > these functions). > To clarify, I'm only talking about transform expansion here. Many IO transforms need read/write access to remote services at run time. So probably we need to figure out a way to propagate these credentials anyways. > > >> Can we use these mechanisms for staging? >> > > I think we'll have to find a way to do one of (1) propagate credentials to > other SDKs (2) allow users to configure SDK containers to have necessary > credentials (3) do the artifact staging from the pipeline SDK environment > which already have credentials. I prefer (1) or (2) since this will given a > transform same feature set whether used directly (in the same SDK language > as the transform) or remotely but it might be hard to do this for an > arbitrary service that a transform might connect to considering the number > of ways users can configure credentials (after an offline discussion with > Ankur). > > >> >> > >> On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka <goe...@google.com> wrote: >> >>> I agree that the Expansion service knows about the artifacts required >>> for a cross language transform and having a prepackage folder/Zip for >>> transforms based on language makes sense. >>> >>> One think to note here is that expansion service might not have the same >>> access privilege as the pipeline author and hence might not be able to >>> stage artifacts by itself. >>> Keeping this in mind I am leaning towards making Expansion service >>> provide all the required artifacts to the user and let the user stage the >>> artifacts as regular artifacts. >>> At this time, we only have Beam File System based artifact staging which >>> users local credentials to access different file systems. Even a docker >>> based expansion service running on local machine might not have the same >>> access privileges. >>> >>> In brief this is what I am leaning toward. >>> User call for pipeline submission -> Expansion service provide cross >>> language transforms and relevant artifacts to the Sdk -> Sdk Submits the >>> pipeline to Jobserver and Stages user and cross language artifacts to >>> artifacts staging service >>> >>> >>> On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath <chamik...@google.com> >>> wrote: >>> >>>> >>>> >>>> On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik <lc...@google.com> wrote: >>>> >>>>> Note that Max did ask whether making the expansion service do the >>>>> staging made sense, and my first line was agreeing with that direction and >>>>> expanding on how it could be done (so this is really Max's idea or from >>>>> whomever he got the idea from). >>>>> >>>> >>>> +1 to what Max said then :) >>>> >>>> >>>>> >>>>> I believe a lot of the value of the expansion service is not having >>>>> users need to be aware of all the SDK specific dependencies when they are >>>>> trying to create a pipeline, only the "user" who is launching the >>>>> expansion >>>>> service may need to. And in that case we can have a prepackaged expansion >>>>> service application that does what most users would want (e.g. expansion >>>>> service as a docker container, a single bundled jar, ...). We (the Apache >>>>> Beam community) could choose to host a default implementation of the >>>>> expansion service as well. >>>>> >>>> >>>> I'm not against this. But I think this is a secondary more advanced >>>> use-case. For a Beam users that needs to use a Java transform that they >>>> already have in a Python pipeline, we should provide a way to allow >>>> starting up a expansion service (with dependencies needed for that) and >>>> running a pipeline that uses this external Java transform (with >>>> dependencies that are needed at runtime). Probably, it'll be enough to >>>> allow providing all dependencies when starting up the expansion service and >>>> allow expansion service to do the staging of jars are well. I don't see a >>>> need to include the list of jars in the ExpansionResponse sent to the >>>> Python SDK. >>>> >>>> >>>>> >>>>> On Thu, Apr 18, 2019 at 2:02 PM Chamikara Jayalath < >>>>> chamik...@google.com> wrote: >>>>> >>>>>> I think there are two kind of dependencies we have to consider. >>>>>> >>>>>> (1) Dependencies that are needed to expand the transform. >>>>>> >>>>>> These have to be provided when we start the expansion service so that >>>>>> available external transforms are correctly registered with the expansion >>>>>> service. >>>>>> >>>>>> (2) Dependencies that are not needed at expansion but may be needed >>>>>> at runtime. >>>>>> >>>>>> I think in both cases, users have to provide these dependencies >>>>>> either when expansion service is started or when a pipeline is being >>>>>> executed. >>>>>> >>>>>> Max, I'm not sure why expansion service will need to provide >>>>>> dependencies to the user since user will already be aware of these. Are >>>>>> you >>>>>> talking about a expansion service that is readily available that will be >>>>>> used by many Beam users ? I think such a (possibly long running) service >>>>>> will have to maintain a repository of transforms and should have >>>>>> mechanism >>>>>> for registering new transforms and discovering already registered >>>>>> transforms etc. I think there's more design work needed to make transform >>>>>> expansion service support such use-cases. Currently, I think allowing >>>>>> pipeline author to provide the jars when starting the expansion service >>>>>> and >>>>>> when executing the pipeline will be adequate. >>>>>> >>>>>> Regarding the entity that will perform the staging, I like Luke's >>>>>> idea of allowing expansion service to do the staging (of jars provided by >>>>>> the user). Notion of artifacts and how they are extracted/represented is >>>>>> SDK dependent. So if the pipeline SDK tries to do this we have to add n x >>>>>> (n -1) configurations (for n SDKs). >>>>>> >>>>>> - Cham >>>>>> >>>>>> On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik <lc...@google.com> >>>>>> wrote: >>>>>> >>>>>>> We can expose the artifact staging endpoint and artifact token to >>>>>>> allow the expansion service to upload any resources its environment may >>>>>>> need. For example, the expansion service for the Beam Java SDK would be >>>>>>> able to upload jars. >>>>>>> >>>>>>> In the "docker" environment, the Apache Beam Java SDK harness >>>>>>> container would fetch the relevant artifacts for itself and be able to >>>>>>> execute the pipeline. (Note that a docker environment could skip all >>>>>>> this >>>>>>> artifact staging if the docker environment contained all necessary >>>>>>> artifacts). >>>>>>> >>>>>>> For the existing "external" environment, it should already come with >>>>>>> all the resources prepackaged wherever "external" points to. The >>>>>>> "process" >>>>>>> based environment could choose to use the artifact staging service to >>>>>>> fetch >>>>>>> those resources associated with its process or it could follow the same >>>>>>> pattern that "external" would do and already contain all the prepackaged >>>>>>> resources. Note that both "external" and "process" will require the >>>>>>> instance of the expansion service to be specialized for those >>>>>>> environments >>>>>>> which is why the default should for the expansion service to be the >>>>>>> "docker" environment. >>>>>>> >>>>>>> Note that a major reason for going with docker containers as the >>>>>>> environment that all runners should support is that containers provides >>>>>>> a >>>>>>> solution for this exact issue. Both the "process" and "external" >>>>>>> environments are explicitly limiting and expanding their capabilities >>>>>>> will >>>>>>> quickly have us building something like a docker container because we'll >>>>>>> quickly find ourselves solving the same problems that docker containers >>>>>>> provide (resources, file layout, permissions, ...) >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 18, 2019 at 11:21 AM Maximilian Michels <m...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi everyone, >>>>>>>> >>>>>>>> We have previously merged support for configuring transforms across >>>>>>>> languages. Please see Cham's summary on the discussion [1]. There >>>>>>>> is >>>>>>>> also a design document [2]. >>>>>>>> >>>>>>>> Subsequently, we've added wrappers for cross-language transforms to >>>>>>>> the >>>>>>>> Python SDK, i.e. GenerateSequence, ReadFromKafka, and there is a >>>>>>>> pending >>>>>>>> PR [1] for WriteToKafka. All of them utilize Java transforms via >>>>>>>> cross-language configuration. >>>>>>>> >>>>>>>> That is all pretty exciting :) >>>>>>>> >>>>>>>> We still have some issues to solve, one being how to stage artifact >>>>>>>> from >>>>>>>> a foreign environment. When we run external transforms which are >>>>>>>> part of >>>>>>>> Beam's core (e.g. GenerateSequence), we have them available in the >>>>>>>> SDK >>>>>>>> Harness. However, when they are not (e.g. KafkaIO) we need to stage >>>>>>>> the >>>>>>>> necessary files. >>>>>>>> >>>>>>>> For my PR [3] I've naively added ":beam-sdks-java-io-kafka" to the >>>>>>>> SDK >>>>>>>> Harness which caused dependency problems [4]. Those could be >>>>>>>> resolved >>>>>>>> but the bigger question is how to stage artifacts for external >>>>>>>> transforms programmatically? >>>>>>>> >>>>>>>> Heejong has solved this by adding a "--jar_package" option to the >>>>>>>> Python >>>>>>>> SDK to stage Java files [5]. I think that is a better solution than >>>>>>>> adding required Jars to the SDK Harness directly, but it is not >>>>>>>> very >>>>>>>> convenient for users. >>>>>>>> >>>>>>>> I've discussed this today with Thomas and we both figured that the >>>>>>>> expansion service needs to provide a list of required Jars with the >>>>>>>> ExpansionResponse it provides. It's not entirely clear, how we >>>>>>>> determine >>>>>>>> which artifacts are necessary for an external transform. We could >>>>>>>> just >>>>>>>> dump the entire classpath like we do in PipelineResources for Java >>>>>>>> pipelines. This provides many unneeded classes but would work. >>>>>>>> >>>>>>>> Do you think it makes sense for the expansion service to provide >>>>>>>> the >>>>>>>> artifacts? Perhaps you have a better idea how to resolve the >>>>>>>> staging >>>>>>>> problem in cross-language pipelines? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Max >>>>>>>> >>>>>>>> [1] >>>>>>>> >>>>>>>> https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E >>>>>>>> >>>>>>>> [2] https://s.apache.org/beam-cross-language-io >>>>>>>> >>>>>>>> [3] https://github.com/apache/beam/pull/8322#discussion_r276336748 >>>>>>>> >>>>>>>> [4] Dependency graph for beam-runners-direct-java: >>>>>>>> >>>>>>>> beam-runners-direct-java -> sdks-java-harness -> >>>>>>>> beam-sdks-java-io-kafka >>>>>>>> -> beam-runners-direct-java ... the cycle continues >>>>>>>> >>>>>>>> Beam-runners-direct-java depends on sdks-java-harness due >>>>>>>> to the infamous Universal Local Runner. Beam-sdks-java-io-kafka >>>>>>>> depends >>>>>>>> on beam-runners-direct-java for running tests. >>>>>>>> >>>>>>>> [5] https://github.com/apache/beam/pull/8340 >>>>>>>> >>>>>>>