Re: Artifact staging in cross-language pipelines

Maximilian Michels Fri, 19 Apr 2019 04:50:07 -0700

Thank you for your replies.

I did not suggest that the Expansion Service does the staging, but itwould return the required resources (e.g. jars) for the externaltransform's runtime environment. The client then has to take care ofstaging the resources.

The Expansion Service itself also needs resources to do the expansion. Iassumed those to be provided when starting the expansion service. Iconsider it less important but we could also provide a way to add newtransforms to the Expansion Service after startup.

Good point on Docker vs externally provided environments. For the PR [1]it will suffice then to add Kafka to the container dependencies. The"--jar_package" pipeline option is ok for now but I'd like to see worktowards staging resources for external transforms via informationreturned by the Expansion Service. That avoids users having to take careof including the correct jars in their pipeline options.


These issues are related and we could discuss them in separate threads:

* Auto-discovery of Expansion Service and its external transforms
* Credentials required during expansion / runtime

Thanks,
Max

[1] ttps://github.com/apache/beam/pull/8322

On 19.04.19 07:35, Thomas Weise wrote:

Good discussion :)

Initially the expansion service was considered a user responsibility,but I think that isn't necessarily the case. I can also see theexpansion service provided as part of the infrastructure and the usernot wanting to deal with it at all. For example, users may want to writePython transforms and use external IOs, without being concerned howthese IOs are provided. Under such scenario it would be good if:


* Expansion service(s) can be auto-discovered via the job service endpoint

* Available external transforms can be discovered via the expansionservice(s)* Dependencies for external transforms are part of the metadata returnedby expansion service

Dependencies could then be staged either by the SDK client or theexpansion service. The expansion service could provide the locations tostage to the SDK, it would still be transparent to the user.

I also agree with Luke regarding the environments. Docker is the choicefor generic deployment. Other environments are used when the flexibilityoffered by Docker isn't needed (or gets into the way). Then thedependencies are provided in different ways. Whether these are Pythonpackages or jar files, by opting out of Docker the decision is made tomanage dependencies externally.


Thomas

On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath <chamik...@google.com<mailto:chamik...@google.com>> wrote:




    On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath
    <chamik...@google.com <mailto:chamik...@google.com>> wrote:

        Thanks for raising the concern about credentials Ankur, I agree
        that this is a significant issue.

        On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik <lc...@google.com
        <mailto:lc...@google.com>> wrote:

            I can understand the concern about credentials, the same
            access concern will exist for several cross language
            transforms (mostly IOs) since some will need access to
            credentials to read/write to an external service.

            Are there any ideas on how credential propagation could work
            to these IOs?


        There are some cases where existing IO transforms need
        credentials to access remote resources, for example, size
        estimation, validation, etc. But usually these are optional (or
        transform can be configured to not perform these functions).


    To clarify, I'm only talking about transform expansion here. Many IO
    transforms need read/write access to remote services at run time. So
    probably we need to figure out a way to propagate these credentials
    anyways.

Can we use these mechanisms for staging?


        I think we'll have to find a way to do one of (1) propagate
        credentials to other SDKs (2) allow users to configure SDK
        containers to have necessary credentials (3) do the artifact
        staging from the pipeline SDK environment which already have
        credentials. I prefer (1) or (2) since this will given a
        transform same feature set whether used directly (in the same
        SDK language as the transform) or remotely but it might be hard
        to do this for an arbitrary service that a transform might
        connect to considering the number of ways users can configure
        credentials (after an offline discussion with Ankur).


            On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka
            <goe...@google.com <mailto:goe...@google.com>> wrote:

                I agree that the Expansion service knows about the
                artifacts required for a cross language transform and
                having a prepackage folder/Zip for transforms based on
                language makes sense.

                One think to note here is that expansion service might
                not have the same access privilege as the pipeline
                author and hence might not be able to stage artifacts by
                itself.
                Keeping this in mind I am leaning towards making
                Expansion service provide all the required artifacts to
                the user and let the user stage the artifacts as regular
                artifacts.
                At this time, we only have Beam File System based
                artifact staging which users local credentials to access
                different file systems. Even a docker based expansion
                service running on local machine might not have the same
                access privileges.

                In brief this is what I am leaning toward.
                User call for pipeline submission -> Expansion service
                provide cross language transforms and relevant artifacts
                to the Sdk -> Sdk Submits the pipeline to Jobserver and
                Stages user and cross language artifacts to artifacts
                staging service


                On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath
                <chamik...@google.com <mailto:chamik...@google.com>> wrote:



                    On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik
                    <lc...@google.com <mailto:lc...@google.com>> wrote:

                        Note that Max did ask whether making the
                        expansion service do the staging made sense, and
                        my first line was agreeing with that direction
                        and expanding on how it could be done (so this
                        is really Max's idea or from whomever he got the
                        idea from).


                    +1 to what Max said then :)


                        I believe a lot of the value of the expansion
                        service is not having users need to be aware of
                        all the SDK specific dependencies when they are
                        trying to create a pipeline, only the "user" who
                        is launching the expansion service may need to.
                        And in that case we can have a prepackaged
                        expansion service application that does what
                        most users would want (e.g. expansion service as
                        a docker container, a single bundled jar, ...).
                        We (the Apache Beam community) could choose to
                        host a default implementation of the expansion
                        service as well.


                    I'm not against this. But I think this is a
                    secondary more advanced use-case. For a Beam users
                    that needs to use a Java transform that they already
                    have in a Python pipeline, we should provide a way
                    to allow starting up a expansion service (with
                    dependencies needed for that) and running a pipeline
                    that uses this external Java transform (with
                    dependencies that are needed at runtime). Probably,
                    it'll be enough to allow providing all dependencies
                    when starting up the expansion service and allow
                    expansion service to do the staging of jars are
                    well. I don't see a need to include the list of jars
                    in the ExpansionResponse sent to the Python SDK.


                        On Thu, Apr 18, 2019 at 2:02 PM Chamikara
                        Jayalath <chamik...@google.com
                        <mailto:chamik...@google.com>> wrote:

                            I think there are two kind of dependencies
                            we have to consider.

                            (1) Dependencies that are needed to expand
                            the transform.

                            These have to be provided when we start the
                            expansion service so that available external
                            transforms are correctly registered with the
                            expansion service.

                            (2) Dependencies that are not needed at
                            expansion but may be needed at runtime.

                            I think in both cases, users have to provide
                            these dependencies either when expansion
                            service is started or when a pipeline is
                            being executed.

                            Max, I'm not sure why expansion service will
                            need to provide dependencies to the user
                            since user will already be aware of these.
                            Are you talking about a expansion service
                            that is readily available that will be used
                            by many Beam users ? I think such a
                            (possibly long running) service will have to
                            maintain a repository of transforms and
                            should have mechanism for registering new
                            transforms and discovering already
                            registered transforms etc. I think there's
                            more design work needed to make transform
                            expansion service support such use-cases.
                            Currently, I think allowing pipeline author
                            to provide the jars when starting the
                            expansion service and when executing the
                            pipeline will be adequate.

                            Regarding the entity that will perform the
                            staging, I like Luke's idea of allowing
                            expansion service to do the staging (of jars
                            provided by the user). Notion of artifacts
                            and how they are extracted/represented is
                            SDK dependent. So if the pipeline SDK tries
                            to do this we have to add n x (n -1)
                            configurations (for n SDKs).

                            - Cham

                            On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik
                            <lc...@google.com <mailto:lc...@google.com>>
                            wrote:

                                We can expose the artifact staging
                                endpoint and artifact token to allow the
                                expansion service to upload any
                                resources its environment may need. For
                                example, the expansion service for the
                                Beam Java SDK would be able to upload jars.

                                In the "docker" environment, the Apache
                                Beam Java SDK harness container would
                                fetch the relevant artifacts for itself
                                and be able to execute the pipeline.
                                (Note that a docker environment could
                                skip all this artifact staging if the
                                docker environment contained all
                                necessary artifacts).

                                For the existing "external" environment,
                                it should already come with all the
                                resources prepackaged wherever
                                "external" points to. The "process"
                                based environment could choose to use
                                the artifact staging service to fetch
                                those resources associated with its
                                process or it could follow the same
                                pattern that "external" would do and
                                already contain all the prepackaged
                                resources. Note that both "external" and
                                "process" will require the instance of
                                the expansion service to be specialized
                                for those environments which is why the
                                default should for the expansion service
                                to be the "docker" environment.

                                Note that a major reason for going with
                                docker containers as the environment
                                that all runners should support is that
                                containers provides a solution for this
                                exact issue. Both the "process" and
                                "external" environments are explicitly
                                limiting and expanding their
                                capabilities will quickly have us
                                building something like a docker
                                container because we'll quickly find
                                ourselves solving the same problems that
                                docker containers provide (resources,
                                file layout, permissions, ...)




                                On Thu, Apr 18, 2019 at 11:21 AM
                                Maximilian Michels <m...@apache.org
                                <mailto:m...@apache.org>> wrote:

                                    Hi everyone,

                                    We have previously merged support
                                    for configuring transforms across
                                    languages. Please see Cham's summary
                                    on the discussion [1]. There is
                                    also a design document [2].

                                    Subsequently, we've added wrappers
                                    for cross-language transforms to the
                                    Python SDK, i.e. GenerateSequence,
                                    ReadFromKafka, and there is a pending
                                    PR [1] for WriteToKafka. All of them
                                    utilize Java transforms via
                                    cross-language configuration.

                                    That is all pretty exciting :)

                                    We still have some issues to solve,
                                    one being how to stage artifact from
                                    a foreign environment. When we run
                                    external transforms which are part of
                                    Beam's core (e.g. GenerateSequence),
                                    we have them available in the SDK
                                    Harness. However, when they are not
                                    (e.g. KafkaIO) we need to stage the
                                    necessary files.

                                    For my PR [3] I've naively added
                                    ":beam-sdks-java-io-kafka" to the SDK
                                    Harness which caused dependency
                                    problems [4]. Those could be resolved
                                    but the bigger question is how to
                                    stage artifacts for external
                                    transforms programmatically?

                                    Heejong has solved this by adding a
                                    "--jar_package" option to the Python
                                    SDK to stage Java files [5]. I think
                                    that is a better solution than
                                    adding required Jars to the SDK
                                    Harness directly, but it is not very
                                    convenient for users.

                                    I've discussed this today with
                                    Thomas and we both figured that the
                                    expansion service needs to provide a
                                    list of required Jars with the
                                    ExpansionResponse it provides. It's
                                    not entirely clear, how we determine
                                    which artifacts are necessary for an
                                    external transform. We could just
                                    dump the entire classpath like we do
                                    in PipelineResources for Java
                                    pipelines. This provides many
                                    unneeded classes but would work.

                                    Do you think it makes sense for the
                                    expansion service to provide the
                                    artifacts? Perhaps you have a better
                                    idea how to resolve the staging
                                    problem in cross-language pipelines?

                                    Thanks,
                                    Max

                                    [1]
                                    
https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E

                                    [2]
                                    https://s.apache.org/beam-cross-language-io

                                    [3]
                                    
https://github.com/apache/beam/pull/8322#discussion_r276336748

                                    [4] Dependency graph for
                                    beam-runners-direct-java:

                                    beam-runners-direct-java ->
                                    sdks-java-harness ->
                                    beam-sdks-java-io-kafka
                                    -> beam-runners-direct-java ... the
                                    cycle continues

                                    Beam-runners-direct-java depends on
                                    sdks-java-harness due
                                    to the infamous Universal Local
                                    Runner. Beam-sdks-java-io-kafka depends
                                    on beam-runners-direct-java for
                                    running tests.

                                    [5]
                                    https://github.com/apache/beam/pull/8340

Re: Artifact staging in cross-language pipelines

Reply via email to