Re: Artifact staging in cross-language pipelines

Maximilian Michels Wed, 24 Apr 2019 03:22:29 -0700

Good idea to let the client expose an artifact staging service that theExpansionService could use to stage artifacts. This solves two problems:

(1) The Expansion Service not being able to access the Job Serverartifact staging service(2) The client not having access to the dependencies returned by theExpansion Server

The downside is that it adds an additional indirection. The alternativeto let the client handle staging the artifacts returned by the ExpansionServer is more transparent and easier to implement.

Ideally, the Expansion Service won't return any dependencies because theenvironment already contains the required dependencies. We could make ita requirement for the expansion to be performed inside an environment.Then we would already ensure during expansion time that the runtimedependencies are available.

In this case, the runner would (as
requested by its configuration) be free to merge environments it
deemed compatible, including swapping out beam-java-X for
beam-java-embedded if it considers itself compatible with the
dependency list.


Could you explain how that would work in practice?

-Max

On 24.04.19 04:11, Heejong Lee wrote:

2019년 4월 23일 (화) 오전 2:07, Robert Bradshaw <rober...@google.com<mailto:rober...@google.com>>님이 작성:


    I've been out, so coming a bit late to the discussion, but here's my
    thoughts.

    The expansion service absolutely needs to be able to provide the
    dependencies for the transform(s) it expands. It seems the default,
    foolproof way of doing this is via the environment, which can be a
    docker image with all the required dependencies. More than this an
    (arguably important, but possibly messy) optimization.

    The standard way to provide artifacts outside of the environment is
    via the artifact staging service. Of course, the expansion service may
    not have access to the (final) artifact staging service (due to
    permissions, locality, or it may not even be started up yet) but the
    SDK invoking the expansion service could offer an artifact staging
    environment for the SDK to publish artifacts to. However, there are
    some difficulties here, in particular avoiding name collisions with
    staged artifacts, assigning semantic meaning to the artifacts (e.g.
    should jar files get automatically placed in the classpath, or Python
    packages recognized and installed at startup). The alternative is
    going with a (type, pointer) scheme for naming dependencies; if we go
    this route I think we should consider migrating all artifact staging
    to this style. I am concerned that the "file" version will be less
    than useful for what will become the most convenient expansion
    services (namely, hosted and docker image). I am still at a loss,
    however, as to how to solve the diamond dependency problem among
    dependencies--perhaps the information is there if one walks
    maven/pypi/go modules/... but do we expect every runner to know about
    every packaging platform? This also wouldn't solve the issue if fat
    jars are used as dependencies. The only safe thing to do here is to
    force distinct dependency sets to live in different environments,
    which could be too conservative.

    This all leads me to think that perhaps the environment itself should
    be docker image (often one of "vanilla" beam-java-x.y ones) +
    dependency list, rather than have the dependency/artifact list as some
    kind of data off to the side. In this case, the runner would (as
    requested by its configuration) be free to merge environments it
    deemed compatible, including swapping out beam-java-X for
    beam-java-embedded if it considers itself compatible with the
    dependency list.

Like this idea to build multiple docker environments on top of a bareminimum SDK harness container and allow runners to pick a suitable onebased on a dependency list.




    I agree with Thomas that we'll want to make expansion services, and
    the transforms they offer, more discoverable. The whole lifetime cycle
    of expansion services is something that has yet to be fully fleshed
    out, and may influence some of these decisions.

    As for adding --jar_package to the Python SDK, this seems really
    specific to calling java-from-python (would we have O(n^2) such
    options?) as well as out-of-place for a Python user to specify. I
    would really hope we can figure out a more generic solution. If we
    need this option in the meantime, let's at least make it clear
    (probably in the name) that it's temporary.

Good points. I second that we need a more generic solution thanpython-to-java specific option. I think instead of naming differently wecan make --jar_package a secondary option under --experiment in themeantime. WDYT?



    On Tue, Apr 23, 2019 at 1:08 AM Thomas Weise <t...@apache.org
    <mailto:t...@apache.org>> wrote:
     >
     > One more suggestion:
     >
     > It would be nice to be able to select the environment for the
    external transforms. For example, I would like to be able to use
    EMBEDDED for Flink. That's implicit for sources which are runner
    native unbounded read translations, but it should also be possible
    for writes. That would then be similar to how pipelines are packaged
    and run with the "legacy" runner.
     >
     > Thomas
     >
     >
     > On Mon, Apr 22, 2019 at 1:18 PM Ankur Goenka <goe...@google.com
    <mailto:goe...@google.com>> wrote:
     >>
     >> Great discussion!
     >> I have a few points around the structure of proto but that is
    less important as it can evolve.
     >> However, I think that artifact compatibility is another
    important aspect to look at.
     >> Example: TransformA uses Guava 1.6>< 1.7, TransformB uses
    1.8><1.9 and TransformC uses 1.6><1.8. As sdk provide the
    environment for each transform, it can not simply say
    EnvironmentJava for both TransformA and TransformB as the
    dependencies are not compatible.
     >> We should have separate environment associated with TransformA
    and TransformB in this case.
     >>
     >> To support this case, we need 2 things.
     >> 1: Granular metadata about the dependency including type.
     >> 2: Complete list of the transforms to be expanded.
     >>
     >> Elaboration:
     >> The compatibility check can be done in a crude way if we provide
    all the metadata about the dependency to expansion service.
     >> Also, the expansion service should expand all the applicable
    transforms in a single call so that it knows about incompatibility
    and create separate environments for these transforms. So in the
    above example, expansion service will associate EnvA to TransformA
    and EnvB to TransformB and EnvA to TransformC. This will ofcource
    require changes to Expansion service proto but giving all the
    information to expansion service will make it support more case and
    make it a bit more future proof.
     >>
     >>
     >> On Mon, Apr 22, 2019 at 10:16 AM Maximilian Michels
    <m...@apache.org <mailto:m...@apache.org>> wrote:
     >>>
     >>> Thanks for the summary Cham. All makes sense. I agree that we
    want to
     >>> keep the option to manually specify artifacts.
     >>>
     >>> > There are few unanswered questions though.
     >>> > (1) In what form will a transform author specify dependencies
    ? For example, URL to a Maven repo, URL to a local file, blob ?
     >>>
     >>> Going forward, we probably want to support multiple ways. For
    now, we
     >>> could stick with a URL-based approach with support for
    different file
     >>> systems. In the future a list of packages to retrieve from
    Maven/PyPi
     >>> would be useful.
     >>>
     >> We can ask user for (type, metadata). For maven it can be
    something like (MAVEN, {groupId:com.google.guava, artifactId: guava,
    version: 19}) or (FILE, file://myfile)
     >> To begin with, we can only support a few types like File and can
    add more types in future.
     >>>
     >>> > (2) How will dependencies be included in the expansion
    response proto ? String (URL), bytes (blob) ?
     >>>
     >>> I'd go for a list of Protobuf strings first but the format
    would have to
     >>> evolve for other dependency types.
     >>>
     >> Here also (type, payload) should suffice. We can have
    interpreter for each type to translate the payload.
     >>>
     >>> > (3) How will we manage/share transitive dependencies required
    at runtime ?
     >>>
     >>> I'd say transitive dependencies have to be included in the
    list. In case
     >>> of fat jars, they are reduced to a single jar.
     >>
     >> Makes sense.
     >>>
     >>>
     >>> > (4) How will dependencies be staged for various runner/SDK
    combinations ? (for example, portable runner/Flink, Dataflow runner)
     >>>
     >>> Staging should be no different than it is now, i.e. go through
    Beam's
     >>> artifact staging service. As long as the protocol is stable,
    there could
     >>> also be different implementations.
     >>
     >> Makes sense.
     >>>
     >>>
     >>> -Max
     >>>
     >>> On 20.04.19 03:08, Chamikara Jayalath wrote:
     >>> > OK, sounds like this is a good path forward then.
     >>> >
     >>> > * When starting up the expansion service, user (that starts
    up the
     >>> > service) provide dependencies necessary to expand transforms.
    We will
     >>> > later add support for adding new transforms to an already running
     >>> > expansion service.
     >>> > * As a part of transform configuration, transform author have
    the option
     >>> > of providing a list of dependencies that will be needed to
    run the
     >>> > transform.
     >>> > * These dependencies will be send back to the pipeline SDK as
    a part of
     >>> > expansion response and pipeline SDK will stage these resources.
     >>> > * Pipeline author have the option of specifying the
    dependencies using a
     >>> > pipeline option. (for example,
    https://github.com/apache/beam/pull/8340)
     >>> >
     >>> > I think last option is important to (1) make existing
    transform easily
     >>> > available for cross-language usage without additional
    configurations (2)
     >>> > allow pipeline authors to override dependency versions
    specified by in
     >>> > the transform configuration (for example, to apply security
    patches)
     >>> > without updating the expansion service.
     >>> >
     >>> > There are few unanswered questions though.
     >>> > (1) In what form will a transform author specify dependencies
    ? For
     >>> > example, URL to a Maven repo, URL to a local file, blob ?
     >>> > (2) How will dependencies be included in the expansion
    response proto ?
     >>> > String (URL), bytes (blob) ?
     >>> > (3) How will we manage/share transitive dependencies required
    at runtime ?
     >>> > (4) How will dependencies be staged for various runner/SDK
    combinations
     >>> > ? (for example, portable runner/Flink, Dataflow runner)
     >>> >
     >>> > Thanks,
     >>> > Cham
     >>> >
     >>> > On Fri, Apr 19, 2019 at 4:49 AM Maximilian Michels
    <m...@apache.org <mailto:m...@apache.org>
     >>> > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
     >>> >
     >>> >     Thank you for your replies.
     >>> >
     >>> >     I did not suggest that the Expansion Service does the
    staging, but it
     >>> >     would return the required resources (e.g. jars) for the
    external
     >>> >     transform's runtime environment. The client then has to
    take care of
     >>> >     staging the resources.
     >>> >
     >>> >     The Expansion Service itself also needs resources to do the
     >>> >     expansion. I
     >>> >     assumed those to be provided when starting the expansion
    service. I
     >>> >     consider it less important but we could also provide a
    way to add new
     >>> >     transforms to the Expansion Service after startup.
     >>> >
     >>> >     Good point on Docker vs externally provided environments.
    For the PR
     >>> >     [1]
     >>> >     it will suffice then to add Kafka to the container
    dependencies. The
     >>> >     "--jar_package" pipeline option is ok for now but I'd
    like to see work
     >>> >     towards staging resources for external transforms via
    information
     >>> >     returned by the Expansion Service. That avoids users
    having to take
     >>> >     care
     >>> >     of including the correct jars in their pipeline options.
     >>> >
     >>> >     These issues are related and we could discuss them in
    separate threads:
     >>> >
     >>> >     * Auto-discovery of Expansion Service and its external
    transforms
     >>> >     * Credentials required during expansion / runtime
     >>> >
     >>> >     Thanks,
     >>> >     Max
     >>> >
     >>> >     [1] ttps://github.com/apache/beam/pull/8322
    <http://github.com/apache/beam/pull/8322>
     >>> >     <http://github.com/apache/beam/pull/8322>
     >>> >
     >>> >     On 19.04.19 07:35, Thomas Weise wrote:
     >>> >      > Good discussion :)
     >>> >      >
     >>> >      > Initially the expansion service was considered a user
     >>> >     responsibility,
     >>> >      > but I think that isn't necessarily the case. I can
    also see the
     >>> >      > expansion service provided as part of the
    infrastructure and the
     >>> >     user
     >>> >      > not wanting to deal with it at all. For example, users
    may want
     >>> >     to write
     >>> >      > Python transforms and use external IOs, without being
    concerned how
     >>> >      > these IOs are provided. Under such scenario it would
    be good if:
     >>> >      >
     >>> >      > * Expansion service(s) can be auto-discovered via the
    job service
     >>> >     endpoint
     >>> >      > * Available external transforms can be discovered via
    the expansion
     >>> >      > service(s)
     >>> >      > * Dependencies for external transforms are part of the
    metadata
     >>> >     returned
     >>> >      > by expansion service
     >>> >      >
     >>> >      > Dependencies could then be staged either by the SDK
    client or the
     >>> >      > expansion service. The expansion service could provide the
     >>> >     locations to
     >>> >      > stage to the SDK, it would still be transparent to the
    user.
     >>> >      >
     >>> >      > I also agree with Luke regarding the environments.
    Docker is the
     >>> >     choice
     >>> >      > for generic deployment. Other environments are used
    when the
     >>> >     flexibility
     >>> >      > offered by Docker isn't needed (or gets into the way).
    Then the
     >>> >      > dependencies are provided in different ways. Whether
    these are
     >>> >     Python
     >>> >      > packages or jar files, by opting out of Docker the
    decision is
     >>> >     made to
     >>> >      > manage dependencies externally.
     >>> >      >
     >>> >      > Thomas
     >>> >      >
     >>> >      >
     >>> >      > On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath
     >>> >     <chamik...@google.com <mailto:chamik...@google.com>
    <mailto:chamik...@google.com <mailto:chamik...@google.com>>
     >>> >      > <mailto:chamik...@google.com
    <mailto:chamik...@google.com> <mailto:chamik...@google.com
    <mailto:chamik...@google.com>>>> wrote:
     >>> >      >
     >>> >      >
     >>> >      >
     >>> >      >     On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath
     >>> >      >     <chamik...@google.com
    <mailto:chamik...@google.com> <mailto:chamik...@google.com
    <mailto:chamik...@google.com>>
     >>> >     <mailto:chamik...@google.com
    <mailto:chamik...@google.com> <mailto:chamik...@google.com
    <mailto:chamik...@google.com>>>> wrote:
     >>> >      >
     >>> >      >         Thanks for raising the concern about
    credentials Ankur, I
     >>> >     agree
     >>> >      >         that this is a significant issue.
     >>> >      >
     >>> >      >         On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik
     >>> >     <lc...@google.com <mailto:lc...@google.com>
    <mailto:lc...@google.com <mailto:lc...@google.com>>
     >>> >      >         <mailto:lc...@google.com
    <mailto:lc...@google.com> <mailto:lc...@google.com
    <mailto:lc...@google.com>>>> wrote:
     >>> >      >
     >>> >      >             I can understand the concern about
    credentials, the same
     >>> >      >             access concern will exist for several
    cross language
     >>> >      >             transforms (mostly IOs) since some will
    need access to
     >>> >      >             credentials to read/write to an external
    service.
     >>> >      >
     >>> >      >             Are there any ideas on how credential
    propagation
     >>> >     could work
     >>> >      >             to these IOs?
     >>> >      >
     >>> >      >
     >>> >      >         There are some cases where existing IO
    transforms need
     >>> >      >         credentials to access remote resources, for
    example, size
     >>> >      >         estimation, validation, etc. But usually these are
     >>> >     optional (or
     >>> >      >         transform can be configured to not perform
    these functions).
     >>> >      >
     >>> >      >
     >>> >      >     To clarify, I'm only talking about transform
    expansion here.
     >>> >     Many IO
     >>> >      >     transforms need read/write access to remote
    services at run
     >>> >     time. So
     >>> >      >     probably we need to figure out a way to propagate
    these
     >>> >     credentials
     >>> >      >     anyways.
     >>> >      >
     >>> >      >             Can we use these mechanisms for staging?
     >>> >      >
     >>> >      >
     >>> >      >         I think we'll have to find a way to do one of
    (1) propagate
     >>> >      >         credentials to other SDKs (2) allow users to
    configure SDK
     >>> >      >         containers to have necessary credentials (3)
    do the artifact
     >>> >      >         staging from the pipeline SDK environment
    which already have
     >>> >      >         credentials. I prefer (1) or (2) since this
    will given a
     >>> >      >         transform same feature set whether used
    directly (in the same
     >>> >      >         SDK language as the transform) or remotely but
    it might
     >>> >     be hard
     >>> >      >         to do this for an arbitrary service that a
    transform might
     >>> >      >         connect to considering the number of ways
    users can configure
     >>> >      >         credentials (after an offline discussion with
    Ankur).
     >>> >      >
     >>> >      >
     >>> >      >             On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka
     >>> >      >             <goe...@google.com
    <mailto:goe...@google.com> <mailto:goe...@google.com
    <mailto:goe...@google.com>>
     >>> >     <mailto:goe...@google.com <mailto:goe...@google.com>
    <mailto:goe...@google.com <mailto:goe...@google.com>>>> wrote:
     >>> >      >
     >>> >      >                 I agree that the Expansion service
    knows about the
     >>> >      >                 artifacts required for a cross
    language transform and
     >>> >      >                 having a prepackage folder/Zip for
    transforms
     >>> >     based on
     >>> >      >                 language makes sense.
     >>> >      >
     >>> >      >                 One think to note here is that
    expansion service
     >>> >     might
     >>> >      >                 not have the same access privilege as
    the pipeline
     >>> >      >                 author and hence might not be able to
    stage
     >>> >     artifacts by
     >>> >      >                 itself.
     >>> >      >                 Keeping this in mind I am leaning
    towards making
     >>> >      >                 Expansion service provide all the required
     >>> >     artifacts to
     >>> >      >                 the user and let the user stage the
    artifacts as
     >>> >     regular
     >>> >      >                 artifacts.
     >>> >      >                 At this time, we only have Beam File
    System based
     >>> >      >                 artifact staging which users local
    credentials to
     >>> >     access
     >>> >      >                 different file systems. Even a docker
    based expansion
     >>> >      >                 service running on local machine might
    not have
     >>> >     the same
     >>> >      >                 access privileges.
     >>> >      >
     >>> >      >                 In brief this is what I am leaning toward.
     >>> >      >                 User call for pipeline submission ->
    Expansion
     >>> >     service
     >>> >      >                 provide cross language transforms and
    relevant
     >>> >     artifacts
     >>> >      >                 to the Sdk -> Sdk Submits the pipeline to
     >>> >     Jobserver and
     >>> >      >                 Stages user and cross language
    artifacts to artifacts
     >>> >      >                 staging service
     >>> >      >
     >>> >      >
     >>> >      >                 On Thu, Apr 18, 2019 at 2:33 PM
    Chamikara Jayalath
     >>> >      >                 <chamik...@google.com
    <mailto:chamik...@google.com>
     >>> >     <mailto:chamik...@google.com
    <mailto:chamik...@google.com>> <mailto:chamik...@google.com
    <mailto:chamik...@google.com>
     >>> >     <mailto:chamik...@google.com
    <mailto:chamik...@google.com>>>> wrote:
     >>> >      >
     >>> >      >
     >>> >      >
     >>> >      >                     On Thu, Apr 18, 2019 at 2:12 PM
    Lukasz Cwik
     >>> >      >                     <lc...@google.com
    <mailto:lc...@google.com> <mailto:lc...@google.com
    <mailto:lc...@google.com>>
     >>> >     <mailto:lc...@google.com <mailto:lc...@google.com>
    <mailto:lc...@google.com <mailto:lc...@google.com>>>> wrote:
     >>> >      >
     >>> >      >                         Note that Max did ask whether
    making the
     >>> >      >                         expansion service do the
    staging made
     >>> >     sense, and
     >>> >      >                         my first line was agreeing
    with that
     >>> >     direction
     >>> >      >                         and expanding on how it could
    be done (so
     >>> >     this
     >>> >      >                         is really Max's idea or from
    whomever he
     >>> >     got the
     >>> >      >                         idea from).
     >>> >      >
     >>> >      >
     >>> >      >                     +1 to what Max said then :)
     >>> >      >
     >>> >      >
     >>> >      >                         I believe a lot of the value
    of the expansion
     >>> >      >                         service is not having users
    need to be
     >>> >     aware of
     >>> >      >                         all the SDK specific
    dependencies when
     >>> >     they are
     >>> >      >                         trying to create a pipeline,
    only the
     >>> >     "user" who
     >>> >      >                         is launching the expansion
    service may
     >>> >     need to.
     >>> >      >                         And in that case we can have a
    prepackaged
     >>> >      >                         expansion service application
    that does what
     >>> >      >                         most users would want (e.g.
    expansion
     >>> >     service as
     >>> >      >                         a docker container, a single
    bundled jar,
     >>> >     ...).
     >>> >      >                         We (the Apache Beam community)
    could
     >>> >     choose to
     >>> >      >                         host a default implementation
    of the
     >>> >     expansion
     >>> >      >                         service as well.
     >>> >      >
     >>> >      >
     >>> >      >                     I'm not against this. But I think
    this is a
     >>> >      >                     secondary more advanced use-case.
    For a Beam
     >>> >     users
     >>> >      >                     that needs to use a Java transform
    that they
     >>> >     already
     >>> >      >                     have in a Python pipeline, we
    should provide
     >>> >     a way
     >>> >      >                     to allow starting up a expansion
    service (with
     >>> >      >                     dependencies needed for that) and
    running a
     >>> >     pipeline
     >>> >      >                     that uses this external Java
    transform (with
     >>> >      >                     dependencies that are needed at
    runtime).
     >>> >     Probably,
     >>> >      >                     it'll be enough to allow providing all
     >>> >     dependencies
     >>> >      >                     when starting up the expansion
    service and allow
     >>> >      >                     expansion service to do the
    staging of jars are
     >>> >      >                     well. I don't see a need to
    include the list
     >>> >     of jars
     >>> >      >                     in the ExpansionResponse sent to
    the Python SDK.
     >>> >      >
     >>> >      >
     >>> >      >                         On Thu, Apr 18, 2019 at 2:02
    PM Chamikara
     >>> >      >                         Jayalath <chamik...@google.com
    <mailto:chamik...@google.com>
     >>> >     <mailto:chamik...@google.com <mailto:chamik...@google.com>>
     >>> >      >                         <mailto:chamik...@google.com
    <mailto:chamik...@google.com>
     >>> >     <mailto:chamik...@google.com
    <mailto:chamik...@google.com>>>> wrote:
     >>> >      >
     >>> >      >                             I think there are two kind of
     >>> >     dependencies
     >>> >      >                             we have to consider.
     >>> >      >
     >>> >      >                             (1) Dependencies that are
    needed to
     >>> >     expand
     >>> >      >                             the transform.
     >>> >      >
     >>> >      >                             These have to be provided
    when we
     >>> >     start the
     >>> >      >                             expansion service so that
    available
     >>> >     external
     >>> >      >                             transforms are correctly
    registered
     >>> >     with the
     >>> >      >                             expansion service.
     >>> >      >
     >>> >      >                             (2) Dependencies that are
    not needed at
     >>> >      >                             expansion but may be
    needed at runtime.
     >>> >      >
     >>> >      >                             I think in both cases,
    users have to
     >>> >     provide
     >>> >      >                             these dependencies either
    when expansion
     >>> >      >                             service is started or when
    a pipeline is
     >>> >      >                             being executed.
     >>> >      >
     >>> >      >                             Max, I'm not sure why
    expansion
     >>> >     service will
     >>> >      >                             need to provide
    dependencies to the user
     >>> >      >                             since user will already be
    aware of
     >>> >     these.
     >>> >      >                             Are you talking about a
    expansion service
     >>> >      >                             that is readily available
    that will
     >>> >     be used
     >>> >      >                             by many Beam users ? I
    think such a
     >>> >      >                             (possibly long running)
    service will
     >>> >     have to
     >>> >      >                             maintain a repository of
    transforms and
     >>> >      >                             should have mechanism for
    registering new
     >>> >      >                             transforms and discovering
    already
     >>> >      >                             registered transforms etc.
    I think
     >>> >     there's
     >>> >      >                             more design work needed to
    make transform
     >>> >      >                             expansion service support
    such use-cases.
     >>> >      >                             Currently, I think
    allowing pipeline
     >>> >     author
     >>> >      >                             to provide the jars when
    starting the
     >>> >      >                             expansion service and when
    executing the
     >>> >      >                             pipeline will be adequate.
     >>> >      >
     >>> >      >                             Regarding the entity that will
     >>> >     perform the
     >>> >      >                             staging, I like Luke's
    idea of allowing
     >>> >      >                             expansion service to do
    the staging
     >>> >     (of jars
     >>> >      >                             provided by the user).
    Notion of
     >>> >     artifacts
     >>> >      >                             and how they are
    extracted/represented is
     >>> >      >                             SDK dependent. So if the
    pipeline SDK
     >>> >     tries
     >>> >      >                             to do this we have to add
    n x (n -1)
     >>> >      >                             configurations (for n SDKs).
     >>> >      >
     >>> >      >                             - Cham
     >>> >      >
     >>> >      >                             On Thu, Apr 18, 2019 at
    11:45 AM
     >>> >     Lukasz Cwik
     >>> >      >                             <lc...@google.com
    <mailto:lc...@google.com>
     >>> >     <mailto:lc...@google.com <mailto:lc...@google.com>>
    <mailto:lc...@google.com <mailto:lc...@google.com>
     >>> >     <mailto:lc...@google.com <mailto:lc...@google.com>>>>
     >>> >      >                             wrote:
     >>> >      >
     >>> >      >                                 We can expose the
    artifact staging
     >>> >      >                                 endpoint and artifact
    token to
     >>> >     allow the
     >>> >      >                                 expansion service to
    upload any
     >>> >      >                                 resources its
    environment may
     >>> >     need. For
     >>> >      >                                 example, the expansion
    service
     >>> >     for the
     >>> >      >                                 Beam Java SDK would be
    able to
     >>> >     upload jars.
     >>> >      >
     >>> >      >                                 In the "docker"
    environment, the
     >>> >     Apache
     >>> >      >                                 Beam Java SDK harness
    container would
     >>> >      >                                 fetch the relevant
    artifacts for
     >>> >     itself
     >>> >      >                                 and be able to execute
    the pipeline.
     >>> >      >                                 (Note that a docker
    environment could
     >>> >      >                                 skip all this artifact
    staging if the
     >>> >      >                                 docker environment
    contained all
     >>> >      >                                 necessary artifacts).
     >>> >      >
     >>> >      >                                 For the existing
    "external"
     >>> >     environment,
     >>> >      >                                 it should already come
    with all the
     >>> >      >                                 resources prepackaged
    wherever
     >>> >      >                                 "external" points to.
    The "process"
     >>> >      >                                 based environment
    could choose to use
     >>> >      >                                 the artifact staging
    service to fetch
     >>> >      >                                 those resources
    associated with its
     >>> >      >                                 process or it could
    follow the same
     >>> >      >                                 pattern that
    "external" would do and
     >>> >      >                                 already contain all
    the prepackaged
     >>> >      >                                 resources. Note that both
     >>> >     "external" and
     >>> >      >                                 "process" will require the
     >>> >     instance of
     >>> >      >                                 the expansion service
    to be
     >>> >     specialized
     >>> >      >                                 for those environments
    which is
     >>> >     why the
     >>> >      >                                 default should for the
    expansion
     >>> >     service
     >>> >      >                                 to be the "docker"
    environment.
     >>> >      >
     >>> >      >                                 Note that a major
    reason for
     >>> >     going with
     >>> >      >                                 docker containers as
    the environment
     >>> >      >                                 that all runners
    should support
     >>> >     is that
     >>> >      >                                 containers provides a
    solution
     >>> >     for this
     >>> >      >                                 exact issue. Both the
    "process" and
     >>> >      >                                 "external"
    environments are
     >>> >     explicitly
     >>> >      >                                 limiting and expanding
    their
     >>> >      >                                 capabilities will
    quickly have us
     >>> >      >                                 building something
    like a docker
     >>> >      >                                 container because
    we'll quickly find
     >>> >      >                                 ourselves solving the same
     >>> >     problems that
     >>> >      >                                 docker containers
    provide (resources,
     >>> >      >                                 file layout,
    permissions, ...)
     >>> >      >
     >>> >      >
     >>> >      >
     >>> >      >
     >>> >      >                                 On Thu, Apr 18, 2019
    at 11:21 AM
     >>> >      >                                 Maximilian Michels
     >>> >     <m...@apache.org <mailto:m...@apache.org>
    <mailto:m...@apache.org <mailto:m...@apache.org>>
     >>> >      >                                 <mailto:m...@apache.org
    <mailto:m...@apache.org>
     >>> >     <mailto:m...@apache.org <mailto:m...@apache.org>>>> wrote:
     >>> >      >
     >>> >      >                                     Hi everyone,
     >>> >      >
     >>> >      >                                     We have previously
    merged support
     >>> >      >                                     for configuring
    transforms across
     >>> >      >                                     languages. Please
    see Cham's
     >>> >     summary
     >>> >      >                                     on the discussion
    [1]. There is
     >>> >      >                                     also a design
    document [2].
     >>> >      >
     >>> >      >                                     Subsequently,
    we've added
     >>> >     wrappers
     >>> >      >                                     for cross-language
    transforms
     >>> >     to the
     >>> >      >                                     Python SDK, i.e.
     >>> >     GenerateSequence,
     >>> >      >                                     ReadFromKafka, and
    there is a
     >>> >     pending
     >>> >      >                                     PR [1] for
    WriteToKafka. All
     >>> >     of them
     >>> >      >                                     utilize Java
    transforms via
     >>> >      >                                     cross-language
    configuration.
     >>> >      >
     >>> >      >                                     That is all pretty
    exciting :)
     >>> >      >
     >>> >      >                                     We still have some
    issues to
     >>> >     solve,
     >>> >      >                                     one being how to stage
     >>> >     artifact from
     >>> >      >                                     a foreign
    environment. When
     >>> >     we run
     >>> >      >                                     external
    transforms which are
     >>> >     part of
     >>> >      >                                     Beam's core (e.g.
     >>> >     GenerateSequence),
     >>> >      >                                     we have them
    available in the SDK
     >>> >      >                                     Harness. However,
    when they
     >>> >     are not
     >>> >      >                                     (e.g. KafkaIO) we
    need to
     >>> >     stage the
     >>> >      >                                     necessary files.
     >>> >      >
     >>> >      >                                     For my PR [3] I've
    naively added

>>> > > ":beam-sdks-java-io-kafka" to

     >>> >     the SDK
     >>> >      >                                     Harness which
    caused dependency
     >>> >      >                                     problems [4].
    Those could be
     >>> >     resolved
     >>> >      >                                     but the bigger
    question is how to
     >>> >      >                                     stage artifacts
    for external
     >>> >      >                                     transforms
    programmatically?
     >>> >      >
     >>> >      >                                     Heejong has solved
    this by
     >>> >     adding a
     >>> >      >                                     "--jar_package"
    option to the
     >>> >     Python
     >>> >      >                                     SDK to stage Java
    files [5].
     >>> >     I think
     >>> >      >                                     that is a better
    solution than
     >>> >      >                                     adding required
    Jars to the SDK
     >>> >      >                                     Harness directly,
    but it is
     >>> >     not very
     >>> >      >                                     convenient for users.
     >>> >      >
     >>> >      >                                     I've discussed
    this today with
     >>> >      >                                     Thomas and we both
    figured
     >>> >     that the
     >>> >      >                                     expansion service
    needs to
     >>> >     provide a
     >>> >      >                                     list of required
    Jars with the
     >>> >      >                                     ExpansionResponse it
     >>> >     provides. It's
     >>> >      >                                     not entirely
    clear, how we
     >>> >     determine
     >>> >      >                                     which artifacts
    are necessary
     >>> >     for an
     >>> >      >                                     external
    transform. We could just
     >>> >      >                                     dump the entire
    classpath
     >>> >     like we do
     >>> >      >                                     in
    PipelineResources for Java
     >>> >      >                                     pipelines. This
    provides many
     >>> >      >                                     unneeded classes
    but would work.
     >>> >      >
     >>> >      >                                     Do you think it
    makes sense
     >>> >     for the
     >>> >      >                                     expansion service
    to provide the
     >>> >      >                                     artifacts? Perhaps
    you have a
     >>> >     better
     >>> >      >                                     idea how to
    resolve the staging
     >>> >      >                                     problem in
    cross-language
     >>> >     pipelines?
     >>> >      >
     >>> >      >                                     Thanks,
     >>> >      >                                     Max
     >>> >      >
     >>> >      >                                     [1]
     >>> >      >
     >>> >
    
https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E
     >>> >      >
     >>> >      >                                     [2]
     >>> >      > https://s.apache.org/beam-cross-language-io
     >>> >      >
     >>> >      >                                     [3]
     >>> >      >
    https://github.com/apache/beam/pull/8322#discussion_r276336748
     >>> >      >
     >>> >      >                                     [4] Dependency
    graph for

>>> > > beam-runners-direct-java:

     >>> >      >

>>> > > beam-runners-direct-java ->

     >>> >      >                                     sdks-java-harness ->

>>> > > beam-sdks-java-io-kafka

     >>> >      >                                     ->
    beam-runners-direct-java
     >>> >     ... the
     >>> >      >                                     cycle continues
     >>> >      >

>>> > > Beam-runners-direct-java

     >>> >     depends on
     >>> >      >                                     sdks-java-harness due
     >>> >      >                                     to the infamous
    Universal Local
     >>> >      >                                     Runner.
     >>> >     Beam-sdks-java-io-kafka depends
     >>> >      >                                     on
    beam-runners-direct-java for
     >>> >      >                                     running tests.
     >>> >      >
     >>> >      >                                     [5]
     >>> >      > https://github.com/apache/beam/pull/8340
     >>> >      >
     >>> >

Re: Artifact staging in cross-language pipelines

Reply via email to