2019년 4월 23일 (화) 오전 2:07, Robert Bradshaw <rober...@google.com
<mailto:rober...@google.com>>님이 작성:
I've been out, so coming a bit late to the discussion, but here's my
thoughts.
The expansion service absolutely needs to be able to provide the
dependencies for the transform(s) it expands. It seems the default,
foolproof way of doing this is via the environment, which can be a
docker image with all the required dependencies. More than this an
(arguably important, but possibly messy) optimization.
The standard way to provide artifacts outside of the environment is
via the artifact staging service. Of course, the expansion service may
not have access to the (final) artifact staging service (due to
permissions, locality, or it may not even be started up yet) but the
SDK invoking the expansion service could offer an artifact staging
environment for the SDK to publish artifacts to. However, there are
some difficulties here, in particular avoiding name collisions with
staged artifacts, assigning semantic meaning to the artifacts (e.g.
should jar files get automatically placed in the classpath, or Python
packages recognized and installed at startup). The alternative is
going with a (type, pointer) scheme for naming dependencies; if we go
this route I think we should consider migrating all artifact staging
to this style. I am concerned that the "file" version will be less
than useful for what will become the most convenient expansion
services (namely, hosted and docker image). I am still at a loss,
however, as to how to solve the diamond dependency problem among
dependencies--perhaps the information is there if one walks
maven/pypi/go modules/... but do we expect every runner to know about
every packaging platform? This also wouldn't solve the issue if fat
jars are used as dependencies. The only safe thing to do here is to
force distinct dependency sets to live in different environments,
which could be too conservative.
This all leads me to think that perhaps the environment itself should
be docker image (often one of "vanilla" beam-java-x.y ones) +
dependency list, rather than have the dependency/artifact list as some
kind of data off to the side. In this case, the runner would (as
requested by its configuration) be free to merge environments it
deemed compatible, including swapping out beam-java-X for
beam-java-embedded if it considers itself compatible with the
dependency list.
Like this idea to build multiple docker environments on top of a bare
minimum SDK harness container and allow runners to pick a suitable one
based on a dependency list.
I agree with Thomas that we'll want to make expansion services, and
the transforms they offer, more discoverable. The whole lifetime cycle
of expansion services is something that has yet to be fully fleshed
out, and may influence some of these decisions.
As for adding --jar_package to the Python SDK, this seems really
specific to calling java-from-python (would we have O(n^2) such
options?) as well as out-of-place for a Python user to specify. I
would really hope we can figure out a more generic solution. If we
need this option in the meantime, let's at least make it clear
(probably in the name) that it's temporary.
Good points. I second that we need a more generic solution than
python-to-java specific option. I think instead of naming differently we
can make --jar_package a secondary option under --experiment in the
meantime. WDYT?
On Tue, Apr 23, 2019 at 1:08 AM Thomas Weise <t...@apache.org
<mailto:t...@apache.org>> wrote:
>
> One more suggestion:
>
> It would be nice to be able to select the environment for the
external transforms. For example, I would like to be able to use
EMBEDDED for Flink. That's implicit for sources which are runner
native unbounded read translations, but it should also be possible
for writes. That would then be similar to how pipelines are packaged
and run with the "legacy" runner.
>
> Thomas
>
>
> On Mon, Apr 22, 2019 at 1:18 PM Ankur Goenka <goe...@google.com
<mailto:goe...@google.com>> wrote:
>>
>> Great discussion!
>> I have a few points around the structure of proto but that is
less important as it can evolve.
>> However, I think that artifact compatibility is another
important aspect to look at.
>> Example: TransformA uses Guava 1.6>< 1.7, TransformB uses
1.8><1.9 and TransformC uses 1.6><1.8. As sdk provide the
environment for each transform, it can not simply say
EnvironmentJava for both TransformA and TransformB as the
dependencies are not compatible.
>> We should have separate environment associated with TransformA
and TransformB in this case.
>>
>> To support this case, we need 2 things.
>> 1: Granular metadata about the dependency including type.
>> 2: Complete list of the transforms to be expanded.
>>
>> Elaboration:
>> The compatibility check can be done in a crude way if we provide
all the metadata about the dependency to expansion service.
>> Also, the expansion service should expand all the applicable
transforms in a single call so that it knows about incompatibility
and create separate environments for these transforms. So in the
above example, expansion service will associate EnvA to TransformA
and EnvB to TransformB and EnvA to TransformC. This will ofcource
require changes to Expansion service proto but giving all the
information to expansion service will make it support more case and
make it a bit more future proof.
>>
>>
>> On Mon, Apr 22, 2019 at 10:16 AM Maximilian Michels
<m...@apache.org <mailto:m...@apache.org>> wrote:
>>>
>>> Thanks for the summary Cham. All makes sense. I agree that we
want to
>>> keep the option to manually specify artifacts.
>>>
>>> > There are few unanswered questions though.
>>> > (1) In what form will a transform author specify dependencies
? For example, URL to a Maven repo, URL to a local file, blob ?
>>>
>>> Going forward, we probably want to support multiple ways. For
now, we
>>> could stick with a URL-based approach with support for
different file
>>> systems. In the future a list of packages to retrieve from
Maven/PyPi
>>> would be useful.
>>>
>> We can ask user for (type, metadata). For maven it can be
something like (MAVEN, {groupId:com.google.guava, artifactId: guava,
version: 19}) or (FILE, file://myfile)
>> To begin with, we can only support a few types like File and can
add more types in future.
>>>
>>> > (2) How will dependencies be included in the expansion
response proto ? String (URL), bytes (blob) ?
>>>
>>> I'd go for a list of Protobuf strings first but the format
would have to
>>> evolve for other dependency types.
>>>
>> Here also (type, payload) should suffice. We can have
interpreter for each type to translate the payload.
>>>
>>> > (3) How will we manage/share transitive dependencies required
at runtime ?
>>>
>>> I'd say transitive dependencies have to be included in the
list. In case
>>> of fat jars, they are reduced to a single jar.
>>
>> Makes sense.
>>>
>>>
>>> > (4) How will dependencies be staged for various runner/SDK
combinations ? (for example, portable runner/Flink, Dataflow runner)
>>>
>>> Staging should be no different than it is now, i.e. go through
Beam's
>>> artifact staging service. As long as the protocol is stable,
there could
>>> also be different implementations.
>>
>> Makes sense.
>>>
>>>
>>> -Max
>>>
>>> On 20.04.19 03:08, Chamikara Jayalath wrote:
>>> > OK, sounds like this is a good path forward then.
>>> >
>>> > * When starting up the expansion service, user (that starts
up the
>>> > service) provide dependencies necessary to expand transforms.
We will
>>> > later add support for adding new transforms to an already running
>>> > expansion service.
>>> > * As a part of transform configuration, transform author have
the option
>>> > of providing a list of dependencies that will be needed to
run the
>>> > transform.
>>> > * These dependencies will be send back to the pipeline SDK as
a part of
>>> > expansion response and pipeline SDK will stage these resources.
>>> > * Pipeline author have the option of specifying the
dependencies using a
>>> > pipeline option. (for example,
https://github.com/apache/beam/pull/8340)
>>> >
>>> > I think last option is important to (1) make existing
transform easily
>>> > available for cross-language usage without additional
configurations (2)
>>> > allow pipeline authors to override dependency versions
specified by in
>>> > the transform configuration (for example, to apply security
patches)
>>> > without updating the expansion service.
>>> >
>>> > There are few unanswered questions though.
>>> > (1) In what form will a transform author specify dependencies
? For
>>> > example, URL to a Maven repo, URL to a local file, blob ?
>>> > (2) How will dependencies be included in the expansion
response proto ?
>>> > String (URL), bytes (blob) ?
>>> > (3) How will we manage/share transitive dependencies required
at runtime ?
>>> > (4) How will dependencies be staged for various runner/SDK
combinations
>>> > ? (for example, portable runner/Flink, Dataflow runner)
>>> >
>>> > Thanks,
>>> > Cham
>>> >
>>> > On Fri, Apr 19, 2019 at 4:49 AM Maximilian Michels
<m...@apache.org <mailto:m...@apache.org>
>>> > <mailto:m...@apache.org <mailto:m...@apache.org>>> wrote:
>>> >
>>> > Thank you for your replies.
>>> >
>>> > I did not suggest that the Expansion Service does the
staging, but it
>>> > would return the required resources (e.g. jars) for the
external
>>> > transform's runtime environment. The client then has to
take care of
>>> > staging the resources.
>>> >
>>> > The Expansion Service itself also needs resources to do the
>>> > expansion. I
>>> > assumed those to be provided when starting the expansion
service. I
>>> > consider it less important but we could also provide a
way to add new
>>> > transforms to the Expansion Service after startup.
>>> >
>>> > Good point on Docker vs externally provided environments.
For the PR
>>> > [1]
>>> > it will suffice then to add Kafka to the container
dependencies. The
>>> > "--jar_package" pipeline option is ok for now but I'd
like to see work
>>> > towards staging resources for external transforms via
information
>>> > returned by the Expansion Service. That avoids users
having to take
>>> > care
>>> > of including the correct jars in their pipeline options.
>>> >
>>> > These issues are related and we could discuss them in
separate threads:
>>> >
>>> > * Auto-discovery of Expansion Service and its external
transforms
>>> > * Credentials required during expansion / runtime
>>> >
>>> > Thanks,
>>> > Max
>>> >
>>> > [1] ttps://github.com/apache/beam/pull/8322
<http://github.com/apache/beam/pull/8322>
>>> > <http://github.com/apache/beam/pull/8322>
>>> >
>>> > On 19.04.19 07:35, Thomas Weise wrote:
>>> > > Good discussion :)
>>> > >
>>> > > Initially the expansion service was considered a user
>>> > responsibility,
>>> > > but I think that isn't necessarily the case. I can
also see the
>>> > > expansion service provided as part of the
infrastructure and the
>>> > user
>>> > > not wanting to deal with it at all. For example, users
may want
>>> > to write
>>> > > Python transforms and use external IOs, without being
concerned how
>>> > > these IOs are provided. Under such scenario it would
be good if:
>>> > >
>>> > > * Expansion service(s) can be auto-discovered via the
job service
>>> > endpoint
>>> > > * Available external transforms can be discovered via
the expansion
>>> > > service(s)
>>> > > * Dependencies for external transforms are part of the
metadata
>>> > returned
>>> > > by expansion service
>>> > >
>>> > > Dependencies could then be staged either by the SDK
client or the
>>> > > expansion service. The expansion service could provide the
>>> > locations to
>>> > > stage to the SDK, it would still be transparent to the
user.
>>> > >
>>> > > I also agree with Luke regarding the environments.
Docker is the
>>> > choice
>>> > > for generic deployment. Other environments are used
when the
>>> > flexibility
>>> > > offered by Docker isn't needed (or gets into the way).
Then the
>>> > > dependencies are provided in different ways. Whether
these are
>>> > Python
>>> > > packages or jar files, by opting out of Docker the
decision is
>>> > made to
>>> > > manage dependencies externally.
>>> > >
>>> > > Thomas
>>> > >
>>> > >
>>> > > On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath
>>> > <chamik...@google.com <mailto:chamik...@google.com>
<mailto:chamik...@google.com <mailto:chamik...@google.com>>
>>> > > <mailto:chamik...@google.com
<mailto:chamik...@google.com> <mailto:chamik...@google.com
<mailto:chamik...@google.com>>>> wrote:
>>> > >
>>> > >
>>> > >
>>> > > On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath
>>> > > <chamik...@google.com
<mailto:chamik...@google.com> <mailto:chamik...@google.com
<mailto:chamik...@google.com>>
>>> > <mailto:chamik...@google.com
<mailto:chamik...@google.com> <mailto:chamik...@google.com
<mailto:chamik...@google.com>>>> wrote:
>>> > >
>>> > > Thanks for raising the concern about
credentials Ankur, I
>>> > agree
>>> > > that this is a significant issue.
>>> > >
>>> > > On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik
>>> > <lc...@google.com <mailto:lc...@google.com>
<mailto:lc...@google.com <mailto:lc...@google.com>>
>>> > > <mailto:lc...@google.com
<mailto:lc...@google.com> <mailto:lc...@google.com
<mailto:lc...@google.com>>>> wrote:
>>> > >
>>> > > I can understand the concern about
credentials, the same
>>> > > access concern will exist for several
cross language
>>> > > transforms (mostly IOs) since some will
need access to
>>> > > credentials to read/write to an external
service.
>>> > >
>>> > > Are there any ideas on how credential
propagation
>>> > could work
>>> > > to these IOs?
>>> > >
>>> > >
>>> > > There are some cases where existing IO
transforms need
>>> > > credentials to access remote resources, for
example, size
>>> > > estimation, validation, etc. But usually these are
>>> > optional (or
>>> > > transform can be configured to not perform
these functions).
>>> > >
>>> > >
>>> > > To clarify, I'm only talking about transform
expansion here.
>>> > Many IO
>>> > > transforms need read/write access to remote
services at run
>>> > time. So
>>> > > probably we need to figure out a way to propagate
these
>>> > credentials
>>> > > anyways.
>>> > >
>>> > > Can we use these mechanisms for staging?
>>> > >
>>> > >
>>> > > I think we'll have to find a way to do one of
(1) propagate
>>> > > credentials to other SDKs (2) allow users to
configure SDK
>>> > > containers to have necessary credentials (3)
do the artifact
>>> > > staging from the pipeline SDK environment
which already have
>>> > > credentials. I prefer (1) or (2) since this
will given a
>>> > > transform same feature set whether used
directly (in the same
>>> > > SDK language as the transform) or remotely but
it might
>>> > be hard
>>> > > to do this for an arbitrary service that a
transform might
>>> > > connect to considering the number of ways
users can configure
>>> > > credentials (after an offline discussion with
Ankur).
>>> > >
>>> > >
>>> > > On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka
>>> > > <goe...@google.com
<mailto:goe...@google.com> <mailto:goe...@google.com
<mailto:goe...@google.com>>
>>> > <mailto:goe...@google.com <mailto:goe...@google.com>
<mailto:goe...@google.com <mailto:goe...@google.com>>>> wrote:
>>> > >
>>> > > I agree that the Expansion service
knows about the
>>> > > artifacts required for a cross
language transform and
>>> > > having a prepackage folder/Zip for
transforms
>>> > based on
>>> > > language makes sense.
>>> > >
>>> > > One think to note here is that
expansion service
>>> > might
>>> > > not have the same access privilege as
the pipeline
>>> > > author and hence might not be able to
stage
>>> > artifacts by
>>> > > itself.
>>> > > Keeping this in mind I am leaning
towards making
>>> > > Expansion service provide all the required
>>> > artifacts to
>>> > > the user and let the user stage the
artifacts as
>>> > regular
>>> > > artifacts.
>>> > > At this time, we only have Beam File
System based
>>> > > artifact staging which users local
credentials to
>>> > access
>>> > > different file systems. Even a docker
based expansion
>>> > > service running on local machine might
not have
>>> > the same
>>> > > access privileges.
>>> > >
>>> > > In brief this is what I am leaning toward.
>>> > > User call for pipeline submission ->
Expansion
>>> > service
>>> > > provide cross language transforms and
relevant
>>> > artifacts
>>> > > to the Sdk -> Sdk Submits the pipeline to
>>> > Jobserver and
>>> > > Stages user and cross language
artifacts to artifacts
>>> > > staging service
>>> > >
>>> > >
>>> > > On Thu, Apr 18, 2019 at 2:33 PM
Chamikara Jayalath
>>> > > <chamik...@google.com
<mailto:chamik...@google.com>
>>> > <mailto:chamik...@google.com
<mailto:chamik...@google.com>> <mailto:chamik...@google.com
<mailto:chamik...@google.com>
>>> > <mailto:chamik...@google.com
<mailto:chamik...@google.com>>>> wrote:
>>> > >
>>> > >
>>> > >
>>> > > On Thu, Apr 18, 2019 at 2:12 PM
Lukasz Cwik
>>> > > <lc...@google.com
<mailto:lc...@google.com> <mailto:lc...@google.com
<mailto:lc...@google.com>>
>>> > <mailto:lc...@google.com <mailto:lc...@google.com>
<mailto:lc...@google.com <mailto:lc...@google.com>>>> wrote:
>>> > >
>>> > > Note that Max did ask whether
making the
>>> > > expansion service do the
staging made
>>> > sense, and
>>> > > my first line was agreeing
with that
>>> > direction
>>> > > and expanding on how it could
be done (so
>>> > this
>>> > > is really Max's idea or from
whomever he
>>> > got the
>>> > > idea from).
>>> > >
>>> > >
>>> > > +1 to what Max said then :)
>>> > >
>>> > >
>>> > > I believe a lot of the value
of the expansion
>>> > > service is not having users
need to be
>>> > aware of
>>> > > all the SDK specific
dependencies when
>>> > they are
>>> > > trying to create a pipeline,
only the
>>> > "user" who
>>> > > is launching the expansion
service may
>>> > need to.
>>> > > And in that case we can have a
prepackaged
>>> > > expansion service application
that does what
>>> > > most users would want (e.g.
expansion
>>> > service as
>>> > > a docker container, a single
bundled jar,
>>> > ...).
>>> > > We (the Apache Beam community)
could
>>> > choose to
>>> > > host a default implementation
of the
>>> > expansion
>>> > > service as well.
>>> > >
>>> > >
>>> > > I'm not against this. But I think
this is a
>>> > > secondary more advanced use-case.
For a Beam
>>> > users
>>> > > that needs to use a Java transform
that they
>>> > already
>>> > > have in a Python pipeline, we
should provide
>>> > a way
>>> > > to allow starting up a expansion
service (with
>>> > > dependencies needed for that) and
running a
>>> > pipeline
>>> > > that uses this external Java
transform (with
>>> > > dependencies that are needed at
runtime).
>>> > Probably,
>>> > > it'll be enough to allow providing all
>>> > dependencies
>>> > > when starting up the expansion
service and allow
>>> > > expansion service to do the
staging of jars are
>>> > > well. I don't see a need to
include the list
>>> > of jars
>>> > > in the ExpansionResponse sent to
the Python SDK.
>>> > >
>>> > >
>>> > > On Thu, Apr 18, 2019 at 2:02
PM Chamikara
>>> > > Jayalath <chamik...@google.com
<mailto:chamik...@google.com>
>>> > <mailto:chamik...@google.com <mailto:chamik...@google.com>>
>>> > > <mailto:chamik...@google.com
<mailto:chamik...@google.com>
>>> > <mailto:chamik...@google.com
<mailto:chamik...@google.com>>>> wrote:
>>> > >
>>> > > I think there are two kind of
>>> > dependencies
>>> > > we have to consider.
>>> > >
>>> > > (1) Dependencies that are
needed to
>>> > expand
>>> > > the transform.
>>> > >
>>> > > These have to be provided
when we
>>> > start the
>>> > > expansion service so that
available
>>> > external
>>> > > transforms are correctly
registered
>>> > with the
>>> > > expansion service.
>>> > >
>>> > > (2) Dependencies that are
not needed at
>>> > > expansion but may be
needed at runtime.
>>> > >
>>> > > I think in both cases,
users have to
>>> > provide
>>> > > these dependencies either
when expansion
>>> > > service is started or when
a pipeline is
>>> > > being executed.
>>> > >
>>> > > Max, I'm not sure why
expansion
>>> > service will
>>> > > need to provide
dependencies to the user
>>> > > since user will already be
aware of
>>> > these.
>>> > > Are you talking about a
expansion service
>>> > > that is readily available
that will
>>> > be used
>>> > > by many Beam users ? I
think such a
>>> > > (possibly long running)
service will
>>> > have to
>>> > > maintain a repository of
transforms and
>>> > > should have mechanism for
registering new
>>> > > transforms and discovering
already
>>> > > registered transforms etc.
I think
>>> > there's
>>> > > more design work needed to
make transform
>>> > > expansion service support
such use-cases.
>>> > > Currently, I think
allowing pipeline
>>> > author
>>> > > to provide the jars when
starting the
>>> > > expansion service and when
executing the
>>> > > pipeline will be adequate.
>>> > >
>>> > > Regarding the entity that will
>>> > perform the
>>> > > staging, I like Luke's
idea of allowing
>>> > > expansion service to do
the staging
>>> > (of jars
>>> > > provided by the user).
Notion of
>>> > artifacts
>>> > > and how they are
extracted/represented is
>>> > > SDK dependent. So if the
pipeline SDK
>>> > tries
>>> > > to do this we have to add
n x (n -1)
>>> > > configurations (for n SDKs).
>>> > >
>>> > > - Cham
>>> > >
>>> > > On Thu, Apr 18, 2019 at
11:45 AM
>>> > Lukasz Cwik
>>> > > <lc...@google.com
<mailto:lc...@google.com>
>>> > <mailto:lc...@google.com <mailto:lc...@google.com>>
<mailto:lc...@google.com <mailto:lc...@google.com>
>>> > <mailto:lc...@google.com <mailto:lc...@google.com>>>>
>>> > > wrote:
>>> > >
>>> > > We can expose the
artifact staging
>>> > > endpoint and artifact
token to
>>> > allow the
>>> > > expansion service to
upload any
>>> > > resources its
environment may
>>> > need. For
>>> > > example, the expansion
service
>>> > for the
>>> > > Beam Java SDK would be
able to
>>> > upload jars.
>>> > >
>>> > > In the "docker"
environment, the
>>> > Apache
>>> > > Beam Java SDK harness
container would
>>> > > fetch the relevant
artifacts for
>>> > itself
>>> > > and be able to execute
the pipeline.
>>> > > (Note that a docker
environment could
>>> > > skip all this artifact
staging if the
>>> > > docker environment
contained all
>>> > > necessary artifacts).
>>> > >
>>> > > For the existing
"external"
>>> > environment,
>>> > > it should already come
with all the
>>> > > resources prepackaged
wherever
>>> > > "external" points to.
The "process"
>>> > > based environment
could choose to use
>>> > > the artifact staging
service to fetch
>>> > > those resources
associated with its
>>> > > process or it could
follow the same
>>> > > pattern that
"external" would do and
>>> > > already contain all
the prepackaged
>>> > > resources. Note that both
>>> > "external" and
>>> > > "process" will require the
>>> > instance of
>>> > > the expansion service
to be
>>> > specialized
>>> > > for those environments
which is
>>> > why the
>>> > > default should for the
expansion
>>> > service
>>> > > to be the "docker"
environment.
>>> > >
>>> > > Note that a major
reason for
>>> > going with
>>> > > docker containers as
the environment
>>> > > that all runners
should support
>>> > is that
>>> > > containers provides a
solution
>>> > for this
>>> > > exact issue. Both the
"process" and
>>> > > "external"
environments are
>>> > explicitly
>>> > > limiting and expanding
their
>>> > > capabilities will
quickly have us
>>> > > building something
like a docker
>>> > > container because
we'll quickly find
>>> > > ourselves solving the same
>>> > problems that
>>> > > docker containers
provide (resources,
>>> > > file layout,
permissions, ...)
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Thu, Apr 18, 2019
at 11:21 AM
>>> > > Maximilian Michels
>>> > <m...@apache.org <mailto:m...@apache.org>
<mailto:m...@apache.org <mailto:m...@apache.org>>
>>> > > <mailto:m...@apache.org
<mailto:m...@apache.org>
>>> > <mailto:m...@apache.org <mailto:m...@apache.org>>>> wrote:
>>> > >
>>> > > Hi everyone,
>>> > >
>>> > > We have previously
merged support
>>> > > for configuring
transforms across
>>> > > languages. Please
see Cham's
>>> > summary
>>> > > on the discussion
[1]. There is
>>> > > also a design
document [2].
>>> > >
>>> > > Subsequently,
we've added
>>> > wrappers
>>> > > for cross-language
transforms
>>> > to the
>>> > > Python SDK, i.e.
>>> > GenerateSequence,
>>> > > ReadFromKafka, and
there is a
>>> > pending
>>> > > PR [1] for
WriteToKafka. All
>>> > of them
>>> > > utilize Java
transforms via
>>> > > cross-language
configuration.
>>> > >
>>> > > That is all pretty
exciting :)
>>> > >
>>> > > We still have some
issues to
>>> > solve,
>>> > > one being how to stage
>>> > artifact from
>>> > > a foreign
environment. When
>>> > we run
>>> > > external
transforms which are
>>> > part of
>>> > > Beam's core (e.g.
>>> > GenerateSequence),
>>> > > we have them
available in the SDK
>>> > > Harness. However,
when they
>>> > are not
>>> > > (e.g. KafkaIO) we
need to
>>> > stage the
>>> > > necessary files.
>>> > >
>>> > > For my PR [3] I've
naively added
>>> > >
":beam-sdks-java-io-kafka" to
>>> > the SDK
>>> > > Harness which
caused dependency
>>> > > problems [4].
Those could be
>>> > resolved
>>> > > but the bigger
question is how to
>>> > > stage artifacts
for external
>>> > > transforms
programmatically?
>>> > >
>>> > > Heejong has solved
this by
>>> > adding a
>>> > > "--jar_package"
option to the
>>> > Python
>>> > > SDK to stage Java
files [5].
>>> > I think
>>> > > that is a better
solution than
>>> > > adding required
Jars to the SDK
>>> > > Harness directly,
but it is
>>> > not very
>>> > > convenient for users.
>>> > >
>>> > > I've discussed
this today with
>>> > > Thomas and we both
figured
>>> > that the
>>> > > expansion service
needs to
>>> > provide a
>>> > > list of required
Jars with the
>>> > > ExpansionResponse it
>>> > provides. It's
>>> > > not entirely
clear, how we
>>> > determine
>>> > > which artifacts
are necessary
>>> > for an
>>> > > external
transform. We could just
>>> > > dump the entire
classpath
>>> > like we do
>>> > > in
PipelineResources for Java
>>> > > pipelines. This
provides many
>>> > > unneeded classes
but would work.
>>> > >
>>> > > Do you think it
makes sense
>>> > for the
>>> > > expansion service
to provide the
>>> > > artifacts? Perhaps
you have a
>>> > better
>>> > > idea how to
resolve the staging
>>> > > problem in
cross-language
>>> > pipelines?
>>> > >
>>> > > Thanks,
>>> > > Max
>>> > >
>>> > > [1]
>>> > >
>>> >
https://lists.apache.org/thread.html/b99ba8527422e31ec7bb7ad9dc3a6583551ea392ebdc5527b5fb4a67@%3Cdev.beam.apache.org%3E
>>> > >
>>> > > [2]
>>> > > https://s.apache.org/beam-cross-language-io
>>> > >
>>> > > [3]
>>> > >
https://github.com/apache/beam/pull/8322#discussion_r276336748
>>> > >
>>> > > [4] Dependency
graph for
>>> > >
beam-runners-direct-java:
>>> > >
>>> > >
beam-runners-direct-java ->
>>> > > sdks-java-harness ->
>>> > >
beam-sdks-java-io-kafka
>>> > > ->
beam-runners-direct-java
>>> > ... the
>>> > > cycle continues
>>> > >
>>> > >
Beam-runners-direct-java
>>> > depends on
>>> > > sdks-java-harness due
>>> > > to the infamous
Universal Local
>>> > > Runner.
>>> > Beam-sdks-java-io-kafka depends
>>> > > on
beam-runners-direct-java for
>>> > > running tests.
>>> > >
>>> > > [5]
>>> > > https://github.com/apache/beam/pull/8340
>>> > >
>>> >