We can expose the artifact staging endpoint and artifact token to allow the expansion service to upload any resources its environment may need. For example, the expansion service for the Beam Java SDK would be able to upload jars.
In the "docker" environment, the Apache Beam Java SDK harness container would fetch the relevant artifacts for itself and be able to execute the pipeline. (Note that a docker environment could skip all this artifact staging if the docker environment contained all necessary artifacts). For the existing "external" environment, it should already come with all the resources prepackaged wherever "external" points to. The "process" based environment could choose to use the artifact staging service to fetch those resources associated with its process or it could follow the same pattern that "external" would do and already contain all the prepackaged resources. Note that both "external" and "process" will require the instance of the expansion service to be specialized for those environments which is why the default should for the expansion service to be the "docker" environment. Note that a major reason for going with docker containers as the environment that all runners should support is that containers provides a solution for this exact issue. Both the "process" and "external" environments are explicitly limiting and expanding their capabilities will quickly have us building something like a docker container because we'll quickly find ourselves solving the same problems that docker containers provide (resources, file layout, permissions, ...) On Thu, Apr 18, 2019 at 11:21 AM Maximilian Michels <[email protected]> wrote: > Hi everyone, > > We have previously merged support for configuring transforms across > languages. Please see Cham's summary on the discussion [1]. There is > also a design document [2]. > > Subsequently, we've added wrappers for cross-language transforms to the > Python SDK, i.e. GenerateSequence, ReadFromKafka, and there is a pending > PR [1] for WriteToKafka. All of them utilize Java transforms via > cross-language configuration. > > That is all pretty exciting :) > > We still have some issues to solve, one being how to stage artifact from > a foreign environment. When we run external transforms which are part of > Beam's core (e.g. GenerateSequence), we have them available in the SDK > Harness. However, when they are not (e.g. KafkaIO) we need to stage the > necessary files. > > For my PR [3] I've naively added ":beam-sdks-java-io-kafka" to the SDK > Harness which caused dependency problems [4]. Those could be resolved > but the bigger question is how to stage artifacts for external > transforms programmatically? > > Heejong has solved this by adding a "--jar_package" option to the Python > SDK to stage Java files [5]. I think that is a better solution than > adding required Jars to the SDK Harness directly, but it is not very > convenient for users. > > I've discussed this today with Thomas and we both figured that the > expansion service needs to provide a list of required Jars with the > ExpansionResponse it provides. It's not entirely clear, how we determine > which artifacts are necessary for an external transform. We could just > dump the entire classpath like we do in PipelineResources for Java > pipelines. This provides many unneeded classes but would work. > > Do you think it makes sense for the expansion service to provide the > artifacts? Perhaps you have a better idea how to resolve the staging > problem in cross-language pipelines? > > Thanks, > Max > > [1] > > https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E > > [2] https://s.apache.org/beam-cross-language-io > > [3] https://github.com/apache/beam/pull/8322#discussion_r276336748 > > [4] Dependency graph for beam-runners-direct-java: > > beam-runners-direct-java -> sdks-java-harness -> beam-sdks-java-io-kafka > -> beam-runners-direct-java ... the cycle continues > > Beam-runners-direct-java depends on sdks-java-harness due > to the infamous Universal Local Runner. Beam-sdks-java-io-kafka depends > on beam-runners-direct-java for running tests. > > [5] https://github.com/apache/beam/pull/8340 >
