Re: Artifact staging in cross-language pipelines

Lukasz Cwik Tue, 30 Apr 2019 14:26:19 -0700

Agree on adding the 5.5 and the resolution of conflicts/duplicates could be
done by either the runner or the artifact staging service.


On Tue, Apr 30, 2019 at 10:03 AM Chamikara Jayalath <chamik...@google.com>
wrote:

>
> On Fri, Apr 26, 2019 at 4:14 PM Lukasz Cwik <lc...@google.com> wrote:
>
>> We should stick with URN + payload + artifact metadata[1] where the only
>> mandatory one that all SDKs and expansion services understand is the
>> "bytes" artifact type. This allows us to add optional URNs for file://,
>> http://, Maven, PyPi, ... in the future. I would make the artifact
>> staging service use the same URN + payload mechanism to get compatibility
>> of artifacts across the different services and also have the artifact
>> staging service be able to be queried for the list of artifact types it
>> supports. Finally, we would need to have environments enumerate the
>> artifact types that they support.
>>
>> Having everyone have the same "artifact" representation would be
>> beneficial since:
>> a) Python environments could install dependencies from a requirements.txt
>> file (something that the Google Cloud Dataflow Python docker container
>> allows for today)
>> b) It provides an extensible and versioned mechanism for SDKs,
>> environments, and artifact staging/retrieval services to support additional
>> artifact types
>> c) Allow for expressing a canonical representation of an artifact like a
>> Maven package so a runner could merge environments that the runner deems
>> compatible.
>>
>> The flow I could see is:
>> 1) (optional) query artifact staging service for supported artifact types
>> 2) SDK request expansion service to expand transform passing in a list of
>> artifact types the SDK and artifact staging service support, the expansion
>> service returns a list of artifact types limited to those supported types +
>> any supported by the environment
>> 3) SDK converts any artifact types that the artifact staging service or
>> environment doesn't understand, e.g. pulls down Maven dependencies and
>> converts them to "bytes" artifacts
>> 4) SDK sends artifacts to artifact staging service
>> 5) Artifact staging service converts any artifacts to types that the
>> environment understands
>> 6) Environment is started and gets artifacts from the artifact retrieval
>> service.
>>
>
> This is a very interesting proposal. I would add:
> (5.5) artifact staging service resolves conflicts/duplicates for artifacts
> needed by different transforms of the same pipeline
>
> BTW what are the next steps here ? Heejong or Max, will one of you be able
> to come up with a detailed proposal around this ?
>
> In the meantime I suggest we add temporary pipeline options for staging
> Java dependencies from Python (and vice versa) to unblock development and
> testing of rest of the cross-language transforms stack. For example,
> https://github.com/apache/beam/pull/8340
>
> Thanks,
> Cham
>
>
>>
>> On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels <m...@apache.org>
>>> wrote:
>>> >
>>> > Good idea to let the client expose an artifact staging service that the
>>> > ExpansionService could use to stage artifacts. This solves two
>>> problems:
>>> >
>>> > (1) The Expansion Service not being able to access the Job Server
>>> > artifact staging service
>>> > (2) The client not having access to the dependencies returned by the
>>> > Expansion Server
>>> >
>>> > The downside is that it adds an additional indirection. The alternative
>>> > to let the client handle staging the artifacts returned by the
>>> Expansion
>>> > Server is more transparent and easier to implement.
>>>
>>> The other downside is that it may not always be possible for the
>>> expansion service to connect to the artifact staging service (e.g.
>>> when constructing a pipeline locally against a remote expansion
>>> service).
>>>
>>
>> Just to make sure, your saying the expansion service would return all the
>> artifacts (bytes, urls, ...) as part of the response since the expansion
>> service wouldn't be able to connect to the SDK that is running locally
>> either.
>>
>>
>>> > Ideally, the Expansion Service won't return any dependencies because
>>> the
>>> > environment already contains the required dependencies. We could make
>>> it
>>> > a requirement for the expansion to be performed inside an environment.
>>> > Then we would already ensure during expansion time that the runtime
>>> > dependencies are available.
>>>
>>> Yes, it's cleanest if the expansion service provides an environment
>>> without all the dependencies provided. Interesting idea to make this a
>>> property of the expansion service itself.
>>>
>>
>> I had thought this too but an opaque docker container that was built on
>> top of a base Beam docker container would be very difficult for a runner to
>> introspect and check to see if its compatible to allow for fusion across
>> PTransforms. I think artifacts need to be communicated in their canonical
>> representation.
>>
>>
>>> > > In this case, the runner would (as
>>> > > requested by its configuration) be free to merge environments it
>>> > > deemed compatible, including swapping out beam-java-X for
>>> > > beam-java-embedded if it considers itself compatible with the
>>> > > dependency list.
>>> >
>>> > Could you explain how that would work in practice?
>>>
>>> Say one has a pipeline with environments
>>>
>>> A: beam-java-sdk-2.12-docker
>>> B: beam-java-sdk-2.12-docker + dep1
>>> C: beam-java-sdk-2.12-docker + dep2
>>> D: beam-java-sdk-2.12-docker + dep3
>>>
>>> A runner could (conceivably) be intelligent enough to know that dep1
>>> and dep2 are indeed compatible, and run A, B, and C in a single
>>> beam-java-sdk-2.12-docker + dep1 + dep2 environment (with the
>>> corresponding fusion and lower overhead benefits). If a certain
>>> pipeline option is set, it might further note that dep1 and dep2 are
>>> compatible with its own workers, which are build against sdk-2.12, and
>>> choose to run these in embedded + dep1 + dep2 environment.
>>>
>>
>> We have been talking about the expansion service and cross language
>> transforms a lot lately but I believe it will initially come at the cost of
>> poor fusion of transforms since "merging" environments that are compatible
>> is a difficult problem since it brings up many of the dependency management
>> issues (e.g. diamond dependency issues).
>>
>> 1:
>> https://github.com/apache/beam/blob/516cdb6401d9fb7adb004de472771fb1fb3a92af/model/job-management/src/main/proto/beam_artifact_api.proto#L56
>>
>

Re: Artifact staging in cross-language pipelines

Reply via email to