Agree on adding the 5.5 and the resolution of conflicts/duplicates could be done by either the runner or the artifact staging service.
On Tue, Apr 30, 2019 at 10:03 AM Chamikara Jayalath <chamik...@google.com> wrote: > > On Fri, Apr 26, 2019 at 4:14 PM Lukasz Cwik <lc...@google.com> wrote: > >> We should stick with URN + payload + artifact metadata[1] where the only >> mandatory one that all SDKs and expansion services understand is the >> "bytes" artifact type. This allows us to add optional URNs for file://, >> http://, Maven, PyPi, ... in the future. I would make the artifact >> staging service use the same URN + payload mechanism to get compatibility >> of artifacts across the different services and also have the artifact >> staging service be able to be queried for the list of artifact types it >> supports. Finally, we would need to have environments enumerate the >> artifact types that they support. >> >> Having everyone have the same "artifact" representation would be >> beneficial since: >> a) Python environments could install dependencies from a requirements.txt >> file (something that the Google Cloud Dataflow Python docker container >> allows for today) >> b) It provides an extensible and versioned mechanism for SDKs, >> environments, and artifact staging/retrieval services to support additional >> artifact types >> c) Allow for expressing a canonical representation of an artifact like a >> Maven package so a runner could merge environments that the runner deems >> compatible. >> >> The flow I could see is: >> 1) (optional) query artifact staging service for supported artifact types >> 2) SDK request expansion service to expand transform passing in a list of >> artifact types the SDK and artifact staging service support, the expansion >> service returns a list of artifact types limited to those supported types + >> any supported by the environment >> 3) SDK converts any artifact types that the artifact staging service or >> environment doesn't understand, e.g. pulls down Maven dependencies and >> converts them to "bytes" artifacts >> 4) SDK sends artifacts to artifact staging service >> 5) Artifact staging service converts any artifacts to types that the >> environment understands >> 6) Environment is started and gets artifacts from the artifact retrieval >> service. >> > > This is a very interesting proposal. I would add: > (5.5) artifact staging service resolves conflicts/duplicates for artifacts > needed by different transforms of the same pipeline > > BTW what are the next steps here ? Heejong or Max, will one of you be able > to come up with a detailed proposal around this ? > > In the meantime I suggest we add temporary pipeline options for staging > Java dependencies from Python (and vice versa) to unblock development and > testing of rest of the cross-language transforms stack. For example, > https://github.com/apache/beam/pull/8340 > > Thanks, > Cham > > >> >> On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels <m...@apache.org> >>> wrote: >>> > >>> > Good idea to let the client expose an artifact staging service that the >>> > ExpansionService could use to stage artifacts. This solves two >>> problems: >>> > >>> > (1) The Expansion Service not being able to access the Job Server >>> > artifact staging service >>> > (2) The client not having access to the dependencies returned by the >>> > Expansion Server >>> > >>> > The downside is that it adds an additional indirection. The alternative >>> > to let the client handle staging the artifacts returned by the >>> Expansion >>> > Server is more transparent and easier to implement. >>> >>> The other downside is that it may not always be possible for the >>> expansion service to connect to the artifact staging service (e.g. >>> when constructing a pipeline locally against a remote expansion >>> service). >>> >> >> Just to make sure, your saying the expansion service would return all the >> artifacts (bytes, urls, ...) as part of the response since the expansion >> service wouldn't be able to connect to the SDK that is running locally >> either. >> >> >>> > Ideally, the Expansion Service won't return any dependencies because >>> the >>> > environment already contains the required dependencies. We could make >>> it >>> > a requirement for the expansion to be performed inside an environment. >>> > Then we would already ensure during expansion time that the runtime >>> > dependencies are available. >>> >>> Yes, it's cleanest if the expansion service provides an environment >>> without all the dependencies provided. Interesting idea to make this a >>> property of the expansion service itself. >>> >> >> I had thought this too but an opaque docker container that was built on >> top of a base Beam docker container would be very difficult for a runner to >> introspect and check to see if its compatible to allow for fusion across >> PTransforms. I think artifacts need to be communicated in their canonical >> representation. >> >> >>> > > In this case, the runner would (as >>> > > requested by its configuration) be free to merge environments it >>> > > deemed compatible, including swapping out beam-java-X for >>> > > beam-java-embedded if it considers itself compatible with the >>> > > dependency list. >>> > >>> > Could you explain how that would work in practice? >>> >>> Say one has a pipeline with environments >>> >>> A: beam-java-sdk-2.12-docker >>> B: beam-java-sdk-2.12-docker + dep1 >>> C: beam-java-sdk-2.12-docker + dep2 >>> D: beam-java-sdk-2.12-docker + dep3 >>> >>> A runner could (conceivably) be intelligent enough to know that dep1 >>> and dep2 are indeed compatible, and run A, B, and C in a single >>> beam-java-sdk-2.12-docker + dep1 + dep2 environment (with the >>> corresponding fusion and lower overhead benefits). If a certain >>> pipeline option is set, it might further note that dep1 and dep2 are >>> compatible with its own workers, which are build against sdk-2.12, and >>> choose to run these in embedded + dep1 + dep2 environment. >>> >> >> We have been talking about the expansion service and cross language >> transforms a lot lately but I believe it will initially come at the cost of >> poor fusion of transforms since "merging" environments that are compatible >> is a difficult problem since it brings up many of the dependency management >> issues (e.g. diamond dependency issues). >> >> 1: >> https://github.com/apache/beam/blob/516cdb6401d9fb7adb004de472771fb1fb3a92af/model/job-management/src/main/proto/beam_artifact_api.proto#L56 >> >