Re: Artifact staging in cross-language pipelines

Heejong Lee Tue, 17 Dec 2019 17:04:39 -0800

Hi,

I wrote the draft on implementation plan[1]. The summary is on the first
page. Any help would be appreciated!


[1]:
https://docs.google.com/document/d/1L7MJcfyy9mg2Ahfw5XPhUeBe-dyvAPMOYOiFA1-kAog/edit?usp=sharing

On Thu, Dec 12, 2019 at 5:02 PM Heejong Lee <[email protected]> wrote:

> I'm brushing up memory by revisiting the doc[1] and it seems like we've
> already reached the consensus on the bigger picture. I would start drafting
> the implementation plan.
>
> [1]:
> https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing
>
> On Tue, Nov 26, 2019 at 3:54 AM Maximilian Michels <[email protected]> wrote:
>
>> Hey Heejong,
>>
>> I don't think so. It would be great to push this forward.
>>
>> Thanks,
>> Max
>>
>> On 26.11.19 02:49, Heejong Lee wrote:
>> > Hi,
>> >
>> > Is anyone actively working on artifact staging extension for
>> > cross-language pipelines? I'm thinking I can contribute to it in coming
>> > Dec. If anyone has any progress on this and needs help, please let me
>> know.
>> >
>> > Thanks,
>> >
>> > On Wed, Jun 12, 2019 at 2:42 AM Ismaël Mejía <[email protected]
>> > <mailto:[email protected]>> wrote:
>> >
>> >     Can you please add this to the design documents webpage.
>> >     https://beam.apache.org/contribute/design-documents/
>> >
>> >     On Wed, May 8, 2019 at 7:29 PM Chamikara Jayalath
>> >     <[email protected] <mailto:[email protected]>> wrote:
>> >      >
>> >      >
>> >      >
>> >      > On Tue, May 7, 2019 at 10:21 AM Maximilian Michels
>> >     <[email protected] <mailto:[email protected]>> wrote:
>> >      >>
>> >      >> Here's the first draft:
>> >      >>
>> >
>> https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing
>> >      >>
>> >      >> It's rather high-level. We may want to add more details once we
>> have
>> >      >> finalized the design. Feel free to make comments and edits.
>> >      >
>> >      >
>> >      > Thanks Max. Added some comments.
>> >      >
>> >      >>
>> >      >>
>> >      >> > All of this goes back to the idea that I think the listing of
>> >      >> > artifacts (or more general dependencies) should be a property
>> >     of the
>> >      >> > environment themselves.
>> >      >>
>> >      >> +1 I came to the same conclusion while thinking about how to
>> store
>> >      >> artifact information for deferred execution of the pipeline.
>> >      >>
>> >      >> -Max
>> >      >>
>> >      >> On 07.05.19 18:10, Robert Bradshaw wrote:
>> >      >> > Looking forward to your writeup, Max. In the meantime, some
>> >     comments below.
>> >      >> >
>> >      >> >
>> >      >> > From: Lukasz Cwik <[email protected] <mailto:[email protected]
>> >>
>> >      >> > Date: Thu, May 2, 2019 at 6:45 PM
>> >      >> > To: dev
>> >      >> >
>> >      >> >>
>> >      >> >>
>> >      >> >> On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw
>> >     <[email protected] <mailto:[email protected]>> wrote:
>> >      >> >>>
>> >      >> >>> On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik
>> >     <[email protected] <mailto:[email protected]>> wrote:
>> >      >> >>>>
>> >      >> >>>> We should stick with URN + payload + artifact metadata[1]
>> >     where the only mandatory one that all SDKs and expansion services
>> >     understand is the "bytes" artifact type. This allows us to add
>> >     optional URNs for file://, http://, Maven, PyPi, ... in the future.
>> >     I would make the artifact staging service use the same URN + payload
>> >     mechanism to get compatibility of artifacts across the different
>> >     services and also have the artifact staging service be able to be
>> >     queried for the list of artifact types it supports.
>> >      >> >>>
>> >      >> >>> +1
>> >      >> >>>
>> >      >> >>>> Finally, we would need to have environments enumerate the
>> >     artifact types that they support.
>> >      >> >>>
>> >      >> >>> Meaning at runtime, or as another field statically set in
>> >     the proto?
>> >      >> >>
>> >      >> >>
>> >      >> >> I don't believe runners/SDKs should have to know what
>> >     artifacts each environment supports at runtime and instead have
>> >     environments enumerate them explicitly in the proto. I have been
>> >     thinking about a more general "capabilities" block on environments
>> >     which allow them to enumerate URNs that the environment understands.
>> >     This would include artifact type URNs, PTransform URNs, coder URNs,
>> >     ... I haven't proposed anything specific down this line yet because
>> >     I was wondering how environment resources (CPU, min memory, hardware
>> >     like GPU, AWS/GCP/Azure/... machine types) should/could tie into
>> this.
>> >      >> >>
>> >      >> >>>
>> >      >> >>>> Having everyone have the same "artifact" representation
>> >     would be beneficial since:
>> >      >> >>>> a) Python environments could install dependencies from a
>> >     requirements.txt file (something that the Google Cloud Dataflow
>> >     Python docker container allows for today)
>> >      >> >>>> b) It provides an extensible and versioned mechanism for
>> >     SDKs, environments, and artifact staging/retrieval services to
>> >     support additional artifact types
>> >      >> >>>> c) Allow for expressing a canonical representation of an
>> >     artifact like a Maven package so a runner could merge environments
>> >     that the runner deems compatible.
>> >      >> >>>>
>> >      >> >>>> The flow I could see is:
>> >      >> >>>> 1) (optional) query artifact staging service for supported
>> >     artifact types
>> >      >> >>>> 2) SDK request expansion service to expand transform
>> >     passing in a list of artifact types the SDK and artifact staging
>> >     service support, the expansion service returns a list of artifact
>> >     types limited to those supported types + any supported by the
>> >     environment
>> >      >> >>>
>> >      >> >>> The crux of the issue seems to be how the expansion service
>> >     returns
>> >      >> >>> the artifacts themselves. Is this going with the approach
>> >     that the
>> >      >> >>> caller of the expansion service must host an artifact
>> >     staging service?
>> >      >> >>
>> >      >> >>
>> >      >> >> The caller would not need to host an artifact staging service
>> >     (but would become effectively a proxy service, see my comment below
>> >     for more details) as I would have expected this to be part of the
>> >     expansion service response.
>> >      >> >>
>> >      >> >>>
>> >      >> >>> There is also the question here is how the returned
>> >     artifacts get
>> >      >> >>> attached to the various environments, or whether they get
>> >     implicitly
>> >      >> >>> applied to all returned stages (which need not have a
>> consistent
>> >      >> >>> environment)?
>> >      >> >>
>> >      >> >>
>> >      >> >> I would suggest returning additional information that says
>> >     what artifact is for which environment. Applying all artifacts to
>> >     all environments is likely to cause issues since some environments
>> >     may not understand certain artifact types or may get conflicting
>> >     versions of artifacts. I would see this happening since an expansion
>> >     service that aggregates other expansion services seems likely, for
>> >     example:
>> >      >> >>                               /-> ExpansionSerivce(Python)
>> >      >> >> ExpansionService(Aggregator) --> ExpansionService(Java)
>> >      >> >>                               \-> ExpansionSerivce(Go)
>> >      >> >
>> >      >> > All of this goes back to the idea that I think the listing of
>> >      >> > artifacts (or more general dependencies) should be a property
>> >     of the
>> >      >> > environment themselves.
>> >      >> >
>> >      >> >>>> 3) SDK converts any artifact types that the artifact
>> >     staging service or environment doesn't understand, e.g. pulls down
>> >     Maven dependencies and converts them to "bytes" artifacts
>> >      >> >>>
>> >      >> >>> Here I think we're conflating two things. The "type" of an
>> >     artifact is
>> >      >> >>> both (1) how to fetch the bytes and (2) how to interpret
>> >     them (e.g. is
>> >      >> >>> this a jar file, or a pip tarball, or just some data needed
>> >     by a DoFn,
>> >      >> >>> or ...) Only (1) can be freely transmuted.
>> >      >> >>
>> >      >> >>
>> >      >> >> Your right. Thinking about this some more, general artifact
>> >     conversion is unlikely to be practical because how to interpret an
>> >     artifact is environment dependent. For example, a requirements.txt
>> >     used to install pip packages for a Python docker container depends
>> >     on the filesystem layout of that specific docker container. One
>> >     could simulate doing a pip install on the same filesystem, see the
>> >     diff and then of all the packages in requirements.txt but this
>> >     quickly becomes impractical.
>> >      >> >>
>> >      >> >>>
>> >      >> >>>> 4) SDK sends artifacts to artifact staging service
>> >      >> >>>> 5) Artifact staging service converts any artifacts to types
>> >     that the environment understands
>> >      >> >>>> 6) Environment is started and gets artifacts from the
>> >     artifact retrieval service.
>> >      >> >>>>
>> >      >> >>>> On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw
>> >     <[email protected] <mailto:[email protected]>> wrote:
>> >      >> >>>>>
>> >      >> >>>>> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels
>> >     <[email protected] <mailto:[email protected]>> wrote:
>> >      >> >>>>>>
>> >      >> >>>>>> Good idea to let the client expose an artifact staging
>> >     service that the
>> >      >> >>>>>> ExpansionService could use to stage artifacts. This
>> >     solves two problems:
>> >      >> >>>>>>
>> >      >> >>>>>> (1) The Expansion Service not being able to access the
>> >     Job Server
>> >      >> >>>>>> artifact staging service
>> >      >> >>>>>> (2) The client not having access to the dependencies
>> >     returned by the
>> >      >> >>>>>> Expansion Server
>> >      >> >>>>>>
>> >      >> >>>>>> The downside is that it adds an additional indirection.
>> >     The alternative
>> >      >> >>>>>> to let the client handle staging the artifacts returned
>> >     by the Expansion
>> >      >> >>>>>> Server is more transparent and easier to implement.
>> >      >> >>>>>
>> >      >> >>>>> The other downside is that it may not always be possible
>> >     for the
>> >      >> >>>>> expansion service to connect to the artifact staging
>> >     service (e.g.
>> >      >> >>>>> when constructing a pipeline locally against a remote
>> >     expansion
>> >      >> >>>>> service).
>> >      >> >>>>
>> >      >> >>>> Just to make sure, your saying the expansion service would
>> >     return all the artifacts (bytes, urls, ...) as part of the response
>> >     since the expansion service wouldn't be able to connect to the SDK
>> >     that is running locally either.
>> >      >> >>>
>> >      >> >>> Yes. Well, more I'm asking how the expansion service would
>> >     return any
>> >      >> >>> artifacts.
>> >      >> >>>
>> >      >> >>> What we have is
>> >      >> >>>
>> >      >> >>> Runner <--- SDK ---> Expansion service.
>> >      >> >>>
>> >      >> >>> Where the unidirectional arrow means "instantiates a
>> >     connection with"
>> >      >> >>> and the other direction (and missing arrows) may not be
>> >     possible.
>> >      >> >>
>> >      >> >>
>> >      >> >> I believe the ExpansionService Expand request should become a
>> >     unidirectional stream back to the caller so that artifacts could be
>> >     sent back to the SDK (effectively mirroring the artifact staging
>> >     service API). So the expansion response would stream back a bunch
>> >     artifact data messages and also the expansion response containing
>> >     PTransform information.
>> >      >> >
>> >      >> > +1.
>> >      >> >
>> >      >> >>>>>> Ideally, the Expansion Service won't return any
>> >     dependencies because the
>> >      >> >>>>>> environment already contains the required dependencies.
>> >     We could make it
>> >      >> >>>>>> a requirement for the expansion to be performed inside an
>> >     environment.
>> >      >> >>>>>> Then we would already ensure during expansion time that
>> >     the runtime
>> >      >> >>>>>> dependencies are available.
>> >      >> >>>>>
>> >      >> >>>>> Yes, it's cleanest if the expansion service provides an
>> >     environment
>> >      >> >>>>> without all the dependencies provided. Interesting idea to
>> >     make this a
>> >      >> >>>>> property of the expansion service itself.
>> >      >> >>>>
>> >      >> >>>> I had thought this too but an opaque docker container that
>> >     was built on top of a base Beam docker container would be very
>> >     difficult for a runner to introspect and check to see if its
>> >     compatible to allow for fusion across PTransforms. I think artifacts
>> >     need to be communicated in their canonical representation.
>> >      >> >>>
>> >      >> >>> It's clean (from the specification point of view), but
>> >     doesn't allow
>> >      >> >>> for good introspection/fusion (aside from one being a base
>> >     of another,
>> >      >> >>> perhaps).
>> >      >> >>>
>> >      >> >>>>>>> In this case, the runner would (as
>> >      >> >>>>>>> requested by its configuration) be free to merge
>> >     environments it
>> >      >> >>>>>>> deemed compatible, including swapping out beam-java-X
>> for
>> >      >> >>>>>>> beam-java-embedded if it considers itself compatible
>> >     with the
>> >      >> >>>>>>> dependency list.
>> >      >> >>>>>>
>> >      >> >>>>>> Could you explain how that would work in practice?
>> >      >> >>>>>
>> >      >> >>>>> Say one has a pipeline with environments
>> >      >> >>>>>
>> >      >> >>>>> A: beam-java-sdk-2.12-docker
>> >      >> >>>>> B: beam-java-sdk-2.12-docker + dep1
>> >      >> >>>>> C: beam-java-sdk-2.12-docker + dep2
>> >      >> >>>>> D: beam-java-sdk-2.12-docker + dep3
>> >      >> >>>>>
>> >      >> >>>>> A runner could (conceivably) be intelligent enough to know
>> >     that dep1
>> >      >> >>>>> and dep2 are indeed compatible, and run A, B, and C in a
>> >     single
>> >      >> >>>>> beam-java-sdk-2.12-docker + dep1 + dep2 environment (with
>> the
>> >      >> >>>>> corresponding fusion and lower overhead benefits). If a
>> >     certain
>> >      >> >>>>> pipeline option is set, it might further note that dep1
>> >     and dep2 are
>> >      >> >>>>> compatible with its own workers, which are build against
>> >     sdk-2.12, and
>> >      >> >>>>> choose to run these in embedded + dep1 + dep2 environment.
>> >      >> >>>>
>> >      >> >>>> We have been talking about the expansion service and cross
>> >     language transforms a lot lately but I believe it will initially
>> >     come at the cost of poor fusion of transforms since "merging"
>> >     environments that are compatible is a difficult problem since it
>> >     brings up many of the dependency management issues (e.g. diamond
>> >     dependency issues).
>> >      >> >>>
>> >      >> >>> I agree. I think expansion services offering "kitchen-sink"
>> >      >> >>> containers, when possible, can go far here. If we could at
>> least
>> >      >> >>> recognize when one environment/set of deps is a superset of
>> >     another,
>> >      >> >>> that could be an easy case that would yield a lot of benefit
>> >     as well.
>> >      >> >>
>> >      >> >>
>> >      >> >> +1
>> >
>>
>

Re: Artifact staging in cross-language pipelines

Reply via email to