Re: Artifact staging in cross-language pipelines

Maximilian Michels Tue, 26 Nov 2019 03:55:10 -0800

Hey Heejong,

I don't think so. It would be great to push this forward.


Thanks,
Max

On 26.11.19 02:49, Heejong Lee wrote:

Hi,

Is anyone actively working on artifact staging extension forcross-language pipelines? I'm thinking I can contribute to it in comingDec. If anyone has any progress on this and needs help, please let me know.


Thanks,

On Wed, Jun 12, 2019 at 2:42 AM Ismaël Mejía <ieme...@gmail.com<mailto:ieme...@gmail.com>> wrote:


    Can you please add this to the design documents webpage.
    https://beam.apache.org/contribute/design-documents/

    On Wed, May 8, 2019 at 7:29 PM Chamikara Jayalath
    <chamik...@google.com <mailto:chamik...@google.com>> wrote:
     >
     >
     >
     > On Tue, May 7, 2019 at 10:21 AM Maximilian Michels
    <m...@apache.org <mailto:m...@apache.org>> wrote:
     >>
     >> Here's the first draft:
     >>
    
https://docs.google.com/document/d/1XaiNekAY2sptuQRIXpjGAyaYdSc-wlJ-VKjl04c8N48/edit?usp=sharing
     >>
     >> It's rather high-level. We may want to add more details once we have
     >> finalized the design. Feel free to make comments and edits.
     >
     >
     > Thanks Max. Added some comments.
     >
     >>
     >>
     >> > All of this goes back to the idea that I think the listing of
     >> > artifacts (or more general dependencies) should be a property
    of the
     >> > environment themselves.
     >>
     >> +1 I came to the same conclusion while thinking about how to store
     >> artifact information for deferred execution of the pipeline.
     >>
     >> -Max
     >>
     >> On 07.05.19 18:10, Robert Bradshaw wrote:
     >> > Looking forward to your writeup, Max. In the meantime, some
    comments below.
     >> >
     >> >
     >> > From: Lukasz Cwik <lc...@google.com <mailto:lc...@google.com>>
     >> > Date: Thu, May 2, 2019 at 6:45 PM
     >> > To: dev
     >> >
     >> >>
     >> >>
     >> >> On Thu, May 2, 2019 at 7:20 AM Robert Bradshaw
    <rober...@google.com <mailto:rober...@google.com>> wrote:
     >> >>>
     >> >>> On Sat, Apr 27, 2019 at 1:14 AM Lukasz Cwik
    <lc...@google.com <mailto:lc...@google.com>> wrote:
     >> >>>>
     >> >>>> We should stick with URN + payload + artifact metadata[1]
    where the only mandatory one that all SDKs and expansion services
    understand is the "bytes" artifact type. This allows us to add
    optional URNs for file://, http://, Maven, PyPi, ... in the future.
    I would make the artifact staging service use the same URN + payload
    mechanism to get compatibility of artifacts across the different
    services and also have the artifact staging service be able to be
    queried for the list of artifact types it supports.
     >> >>>
     >> >>> +1
     >> >>>
     >> >>>> Finally, we would need to have environments enumerate the
    artifact types that they support.
     >> >>>
     >> >>> Meaning at runtime, or as another field statically set in
    the proto?
     >> >>
     >> >>
     >> >> I don't believe runners/SDKs should have to know what
    artifacts each environment supports at runtime and instead have
    environments enumerate them explicitly in the proto. I have been
    thinking about a more general "capabilities" block on environments
    which allow them to enumerate URNs that the environment understands.
    This would include artifact type URNs, PTransform URNs, coder URNs,
    ... I haven't proposed anything specific down this line yet because
    I was wondering how environment resources (CPU, min memory, hardware
    like GPU, AWS/GCP/Azure/... machine types) should/could tie into this.
     >> >>
     >> >>>
     >> >>>> Having everyone have the same "artifact" representation
    would be beneficial since:
     >> >>>> a) Python environments could install dependencies from a
    requirements.txt file (something that the Google Cloud Dataflow
    Python docker container allows for today)
     >> >>>> b) It provides an extensible and versioned mechanism for
    SDKs, environments, and artifact staging/retrieval services to
    support additional artifact types
     >> >>>> c) Allow for expressing a canonical representation of an
    artifact like a Maven package so a runner could merge environments
    that the runner deems compatible.
     >> >>>>
     >> >>>> The flow I could see is:
     >> >>>> 1) (optional) query artifact staging service for supported
    artifact types
     >> >>>> 2) SDK request expansion service to expand transform
    passing in a list of artifact types the SDK and artifact staging
    service support, the expansion service returns a list of artifact
    types limited to those supported types + any supported by the
    environment
     >> >>>
     >> >>> The crux of the issue seems to be how the expansion service
    returns
     >> >>> the artifacts themselves. Is this going with the approach
    that the
     >> >>> caller of the expansion service must host an artifact
    staging service?
     >> >>
     >> >>
     >> >> The caller would not need to host an artifact staging service
    (but would become effectively a proxy service, see my comment below
    for more details) as I would have expected this to be part of the
    expansion service response.
     >> >>
     >> >>>
     >> >>> There is also the question here is how the returned
    artifacts get
     >> >>> attached to the various environments, or whether they get
    implicitly
     >> >>> applied to all returned stages (which need not have a consistent
     >> >>> environment)?
     >> >>
     >> >>
     >> >> I would suggest returning additional information that says
    what artifact is for which environment. Applying all artifacts to
    all environments is likely to cause issues since some environments
    may not understand certain artifact types or may get conflicting
    versions of artifacts. I would see this happening since an expansion
    service that aggregates other expansion services seems likely, for
    example:
     >> >>                               /-> ExpansionSerivce(Python)
     >> >> ExpansionService(Aggregator) --> ExpansionService(Java)
     >> >>                               \-> ExpansionSerivce(Go)
     >> >
     >> > All of this goes back to the idea that I think the listing of
     >> > artifacts (or more general dependencies) should be a property
    of the
     >> > environment themselves.
     >> >
     >> >>>> 3) SDK converts any artifact types that the artifact
    staging service or environment doesn't understand, e.g. pulls down
    Maven dependencies and converts them to "bytes" artifacts
     >> >>>
     >> >>> Here I think we're conflating two things. The "type" of an
    artifact is
     >> >>> both (1) how to fetch the bytes and (2) how to interpret
    them (e.g. is
     >> >>> this a jar file, or a pip tarball, or just some data needed
    by a DoFn,
     >> >>> or ...) Only (1) can be freely transmuted.
     >> >>
     >> >>
     >> >> Your right. Thinking about this some more, general artifact
    conversion is unlikely to be practical because how to interpret an
    artifact is environment dependent. For example, a requirements.txt
    used to install pip packages for a Python docker container depends
    on the filesystem layout of that specific docker container. One
    could simulate doing a pip install on the same filesystem, see the
    diff and then of all the packages in requirements.txt but this
    quickly becomes impractical.
     >> >>
     >> >>>
     >> >>>> 4) SDK sends artifacts to artifact staging service
     >> >>>> 5) Artifact staging service converts any artifacts to types
    that the environment understands
     >> >>>> 6) Environment is started and gets artifacts from the
    artifact retrieval service.
     >> >>>>
     >> >>>> On Wed, Apr 24, 2019 at 4:44 AM Robert Bradshaw
    <rober...@google.com <mailto:rober...@google.com>> wrote:
     >> >>>>>
     >> >>>>> On Wed, Apr 24, 2019 at 12:21 PM Maximilian Michels
    <m...@apache.org <mailto:m...@apache.org>> wrote:
     >> >>>>>>
     >> >>>>>> Good idea to let the client expose an artifact staging
    service that the
     >> >>>>>> ExpansionService could use to stage artifacts. This
    solves two problems:
     >> >>>>>>
     >> >>>>>> (1) The Expansion Service not being able to access the
    Job Server
     >> >>>>>> artifact staging service
     >> >>>>>> (2) The client not having access to the dependencies
    returned by the
     >> >>>>>> Expansion Server
     >> >>>>>>
     >> >>>>>> The downside is that it adds an additional indirection.
    The alternative
     >> >>>>>> to let the client handle staging the artifacts returned
    by the Expansion
     >> >>>>>> Server is more transparent and easier to implement.
     >> >>>>>
     >> >>>>> The other downside is that it may not always be possible
    for the
     >> >>>>> expansion service to connect to the artifact staging
    service (e.g.
     >> >>>>> when constructing a pipeline locally against a remote
    expansion
     >> >>>>> service).
     >> >>>>
     >> >>>> Just to make sure, your saying the expansion service would
    return all the artifacts (bytes, urls, ...) as part of the response
    since the expansion service wouldn't be able to connect to the SDK
    that is running locally either.
     >> >>>
     >> >>> Yes. Well, more I'm asking how the expansion service would
    return any
     >> >>> artifacts.
     >> >>>
     >> >>> What we have is
     >> >>>
     >> >>> Runner <--- SDK ---> Expansion service.
     >> >>>
     >> >>> Where the unidirectional arrow means "instantiates a
    connection with"
     >> >>> and the other direction (and missing arrows) may not be
    possible.
     >> >>
     >> >>
     >> >> I believe the ExpansionService Expand request should become a
    unidirectional stream back to the caller so that artifacts could be
    sent back to the SDK (effectively mirroring the artifact staging
    service API). So the expansion response would stream back a bunch
    artifact data messages and also the expansion response containing
    PTransform information.
     >> >
     >> > +1.
     >> >
     >> >>>>>> Ideally, the Expansion Service won't return any
    dependencies because the
     >> >>>>>> environment already contains the required dependencies.
    We could make it
     >> >>>>>> a requirement for the expansion to be performed inside an
    environment.
     >> >>>>>> Then we would already ensure during expansion time that
    the runtime
     >> >>>>>> dependencies are available.
     >> >>>>>
     >> >>>>> Yes, it's cleanest if the expansion service provides an
    environment
     >> >>>>> without all the dependencies provided. Interesting idea to
    make this a
     >> >>>>> property of the expansion service itself.
     >> >>>>
     >> >>>> I had thought this too but an opaque docker container that
    was built on top of a base Beam docker container would be very
    difficult for a runner to introspect and check to see if its
    compatible to allow for fusion across PTransforms. I think artifacts
    need to be communicated in their canonical representation.
     >> >>>
     >> >>> It's clean (from the specification point of view), but
    doesn't allow
     >> >>> for good introspection/fusion (aside from one being a base
    of another,
     >> >>> perhaps).
     >> >>>
     >> >>>>>>> In this case, the runner would (as
     >> >>>>>>> requested by its configuration) be free to merge
    environments it
     >> >>>>>>> deemed compatible, including swapping out beam-java-X for
     >> >>>>>>> beam-java-embedded if it considers itself compatible
    with the
     >> >>>>>>> dependency list.
     >> >>>>>>
     >> >>>>>> Could you explain how that would work in practice?
     >> >>>>>
     >> >>>>> Say one has a pipeline with environments
     >> >>>>>
     >> >>>>> A: beam-java-sdk-2.12-docker
     >> >>>>> B: beam-java-sdk-2.12-docker + dep1
     >> >>>>> C: beam-java-sdk-2.12-docker + dep2
     >> >>>>> D: beam-java-sdk-2.12-docker + dep3
     >> >>>>>
     >> >>>>> A runner could (conceivably) be intelligent enough to know
    that dep1
     >> >>>>> and dep2 are indeed compatible, and run A, B, and C in a
    single
     >> >>>>> beam-java-sdk-2.12-docker + dep1 + dep2 environment (with the
     >> >>>>> corresponding fusion and lower overhead benefits). If a
    certain
     >> >>>>> pipeline option is set, it might further note that dep1
    and dep2 are
     >> >>>>> compatible with its own workers, which are build against
    sdk-2.12, and
     >> >>>>> choose to run these in embedded + dep1 + dep2 environment.
     >> >>>>
     >> >>>> We have been talking about the expansion service and cross
    language transforms a lot lately but I believe it will initially
    come at the cost of poor fusion of transforms since "merging"
    environments that are compatible is a difficult problem since it
    brings up many of the dependency management issues (e.g. diamond
    dependency issues).
     >> >>>
     >> >>> I agree. I think expansion services offering "kitchen-sink"
     >> >>> containers, when possible, can go far here. If we could at least
     >> >>> recognize when one environment/set of deps is a superset of
    another,
     >> >>> that could be an easy case that would yield a lot of benefit
    as well.
     >> >>
     >> >>
     >> >> +1

Re: Artifact staging in cross-language pipelines

Reply via email to