Thank you for your replies.
I did not suggest that the Expansion Service does the staging, but it
would return the required resources (e.g. jars) for the external
transform's runtime environment. The client then has to take care of
staging the resources.
The Expansion Service itself also needs resources to do the expansion. I
assumed those to be provided when starting the expansion service. I
consider it less important but we could also provide a way to add new
transforms to the Expansion Service after startup.
Good point on Docker vs externally provided environments. For the PR [1]
it will suffice then to add Kafka to the container dependencies. The
"--jar_package" pipeline option is ok for now but I'd like to see work
towards staging resources for external transforms via information
returned by the Expansion Service. That avoids users having to take care
of including the correct jars in their pipeline options.
These issues are related and we could discuss them in separate threads:
* Auto-discovery of Expansion Service and its external transforms
* Credentials required during expansion / runtime
Thanks,
Max
[1] ttps://github.com/apache/beam/pull/8322
On 19.04.19 07:35, Thomas Weise wrote:
Good discussion :)
Initially the expansion service was considered a user responsibility,
but I think that isn't necessarily the case. I can also see the
expansion service provided as part of the infrastructure and the user
not wanting to deal with it at all. For example, users may want to write
Python transforms and use external IOs, without being concerned how
these IOs are provided. Under such scenario it would be good if:
* Expansion service(s) can be auto-discovered via the job service endpoint
* Available external transforms can be discovered via the expansion
service(s)
* Dependencies for external transforms are part of the metadata returned
by expansion service
Dependencies could then be staged either by the SDK client or the
expansion service. The expansion service could provide the locations to
stage to the SDK, it would still be transparent to the user.
I also agree with Luke regarding the environments. Docker is the choice
for generic deployment. Other environments are used when the flexibility
offered by Docker isn't needed (or gets into the way). Then the
dependencies are provided in different ways. Whether these are Python
packages or jar files, by opting out of Docker the decision is made to
manage dependencies externally.
Thomas
On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath <chamik...@google.com
<mailto:chamik...@google.com>> wrote:
On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath
<chamik...@google.com <mailto:chamik...@google.com>> wrote:
Thanks for raising the concern about credentials Ankur, I agree
that this is a significant issue.
On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik <lc...@google.com
<mailto:lc...@google.com>> wrote:
I can understand the concern about credentials, the same
access concern will exist for several cross language
transforms (mostly IOs) since some will need access to
credentials to read/write to an external service.
Are there any ideas on how credential propagation could work
to these IOs?
There are some cases where existing IO transforms need
credentials to access remote resources, for example, size
estimation, validation, etc. But usually these are optional (or
transform can be configured to not perform these functions).
To clarify, I'm only talking about transform expansion here. Many IO
transforms need read/write access to remote services at run time. So
probably we need to figure out a way to propagate these credentials
anyways.
Can we use these mechanisms for staging?
I think we'll have to find a way to do one of (1) propagate
credentials to other SDKs (2) allow users to configure SDK
containers to have necessary credentials (3) do the artifact
staging from the pipeline SDK environment which already have
credentials. I prefer (1) or (2) since this will given a
transform same feature set whether used directly (in the same
SDK language as the transform) or remotely but it might be hard
to do this for an arbitrary service that a transform might
connect to considering the number of ways users can configure
credentials (after an offline discussion with Ankur).
On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka
<goe...@google.com <mailto:goe...@google.com>> wrote:
I agree that the Expansion service knows about the
artifacts required for a cross language transform and
having a prepackage folder/Zip for transforms based on
language makes sense.
One think to note here is that expansion service might
not have the same access privilege as the pipeline
author and hence might not be able to stage artifacts by
itself.
Keeping this in mind I am leaning towards making
Expansion service provide all the required artifacts to
the user and let the user stage the artifacts as regular
artifacts.
At this time, we only have Beam File System based
artifact staging which users local credentials to access
different file systems. Even a docker based expansion
service running on local machine might not have the same
access privileges.
In brief this is what I am leaning toward.
User call for pipeline submission -> Expansion service
provide cross language transforms and relevant artifacts
to the Sdk -> Sdk Submits the pipeline to Jobserver and
Stages user and cross language artifacts to artifacts
staging service
On Thu, Apr 18, 2019 at 2:33 PM Chamikara Jayalath
<chamik...@google.com <mailto:chamik...@google.com>> wrote:
On Thu, Apr 18, 2019 at 2:12 PM Lukasz Cwik
<lc...@google.com <mailto:lc...@google.com>> wrote:
Note that Max did ask whether making the
expansion service do the staging made sense, and
my first line was agreeing with that direction
and expanding on how it could be done (so this
is really Max's idea or from whomever he got the
idea from).
+1 to what Max said then :)
I believe a lot of the value of the expansion
service is not having users need to be aware of
all the SDK specific dependencies when they are
trying to create a pipeline, only the "user" who
is launching the expansion service may need to.
And in that case we can have a prepackaged
expansion service application that does what
most users would want (e.g. expansion service as
a docker container, a single bundled jar, ...).
We (the Apache Beam community) could choose to
host a default implementation of the expansion
service as well.
I'm not against this. But I think this is a
secondary more advanced use-case. For a Beam users
that needs to use a Java transform that they already
have in a Python pipeline, we should provide a way
to allow starting up a expansion service (with
dependencies needed for that) and running a pipeline
that uses this external Java transform (with
dependencies that are needed at runtime). Probably,
it'll be enough to allow providing all dependencies
when starting up the expansion service and allow
expansion service to do the staging of jars are
well. I don't see a need to include the list of jars
in the ExpansionResponse sent to the Python SDK.
On Thu, Apr 18, 2019 at 2:02 PM Chamikara
Jayalath <chamik...@google.com
<mailto:chamik...@google.com>> wrote:
I think there are two kind of dependencies
we have to consider.
(1) Dependencies that are needed to expand
the transform.
These have to be provided when we start the
expansion service so that available external
transforms are correctly registered with the
expansion service.
(2) Dependencies that are not needed at
expansion but may be needed at runtime.
I think in both cases, users have to provide
these dependencies either when expansion
service is started or when a pipeline is
being executed.
Max, I'm not sure why expansion service will
need to provide dependencies to the user
since user will already be aware of these.
Are you talking about a expansion service
that is readily available that will be used
by many Beam users ? I think such a
(possibly long running) service will have to
maintain a repository of transforms and
should have mechanism for registering new
transforms and discovering already
registered transforms etc. I think there's
more design work needed to make transform
expansion service support such use-cases.
Currently, I think allowing pipeline author
to provide the jars when starting the
expansion service and when executing the
pipeline will be adequate.
Regarding the entity that will perform the
staging, I like Luke's idea of allowing
expansion service to do the staging (of jars
provided by the user). Notion of artifacts
and how they are extracted/represented is
SDK dependent. So if the pipeline SDK tries
to do this we have to add n x (n -1)
configurations (for n SDKs).
- Cham
On Thu, Apr 18, 2019 at 11:45 AM Lukasz Cwik
<lc...@google.com <mailto:lc...@google.com>>
wrote:
We can expose the artifact staging
endpoint and artifact token to allow the
expansion service to upload any
resources its environment may need. For
example, the expansion service for the
Beam Java SDK would be able to upload jars.
In the "docker" environment, the Apache
Beam Java SDK harness container would
fetch the relevant artifacts for itself
and be able to execute the pipeline.
(Note that a docker environment could
skip all this artifact staging if the
docker environment contained all
necessary artifacts).
For the existing "external" environment,
it should already come with all the
resources prepackaged wherever
"external" points to. The "process"
based environment could choose to use
the artifact staging service to fetch
those resources associated with its
process or it could follow the same
pattern that "external" would do and
already contain all the prepackaged
resources. Note that both "external" and
"process" will require the instance of
the expansion service to be specialized
for those environments which is why the
default should for the expansion service
to be the "docker" environment.
Note that a major reason for going with
docker containers as the environment
that all runners should support is that
containers provides a solution for this
exact issue. Both the "process" and
"external" environments are explicitly
limiting and expanding their
capabilities will quickly have us
building something like a docker
container because we'll quickly find
ourselves solving the same problems that
docker containers provide (resources,
file layout, permissions, ...)
On Thu, Apr 18, 2019 at 11:21 AM
Maximilian Michels <m...@apache.org
<mailto:m...@apache.org>> wrote:
Hi everyone,
We have previously merged support
for configuring transforms across
languages. Please see Cham's summary
on the discussion [1]. There is
also a design document [2].
Subsequently, we've added wrappers
for cross-language transforms to the
Python SDK, i.e. GenerateSequence,
ReadFromKafka, and there is a pending
PR [1] for WriteToKafka. All of them
utilize Java transforms via
cross-language configuration.
That is all pretty exciting :)
We still have some issues to solve,
one being how to stage artifact from
a foreign environment. When we run
external transforms which are part of
Beam's core (e.g. GenerateSequence),
we have them available in the SDK
Harness. However, when they are not
(e.g. KafkaIO) we need to stage the
necessary files.
For my PR [3] I've naively added
":beam-sdks-java-io-kafka" to the SDK
Harness which caused dependency
problems [4]. Those could be resolved
but the bigger question is how to
stage artifacts for external
transforms programmatically?
Heejong has solved this by adding a
"--jar_package" option to the Python
SDK to stage Java files [5]. I think
that is a better solution than
adding required Jars to the SDK
Harness directly, but it is not very
convenient for users.
I've discussed this today with
Thomas and we both figured that the
expansion service needs to provide a
list of required Jars with the
ExpansionResponse it provides. It's
not entirely clear, how we determine
which artifacts are necessary for an
external transform. We could just
dump the entire classpath like we do
in PipelineResources for Java
pipelines. This provides many
unneeded classes but would work.
Do you think it makes sense for the
expansion service to provide the
artifacts? Perhaps you have a better
idea how to resolve the staging
problem in cross-language pipelines?
Thanks,
Max
[1]
https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E
[2]
https://s.apache.org/beam-cross-language-io
[3]
https://github.com/apache/beam/pull/8322#discussion_r276336748
[4] Dependency graph for
beam-runners-direct-java:
beam-runners-direct-java ->
sdks-java-harness ->
beam-sdks-java-io-kafka
-> beam-runners-direct-java ... the
cycle continues
Beam-runners-direct-java depends on
sdks-java-harness due
to the infamous Universal Local
Runner. Beam-sdks-java-io-kafka depends
on beam-runners-direct-java for
running tests.
[5]
https://github.com/apache/beam/pull/8340