Hi all,

Currently, when running a pipeline that has the options
runner=PortableRunner and job_endpoint unset, the Python SDK spins up a
Dockerized Flink job server [1]. This is problematic because the
PortableRunner can be used by any portable runner. So for example, a Spark
runner user was recently baffled when their job ran successfully but
printed a bunch of Flink log messages.

There are not too many uses of this default behavior to my knowledge, at
least within Beam itself. The only example I could find was in the
portableWordCount tests, which is mostly the same as
portableWordCountFlinkRunner tests [2]. The default behavior is entirely
superseded by the FlinkRunner class, which provides better encapsulation.

I also noticed that DockerizedJobServer is only used by [3]. In
FlinkRunner, we pull the job server from Maven if necessary and call Java
directly. In general, I think there are already quite enough knobs in the
portability framework, so we should remove it unless there is reason to
prefer running the job server with Docker instead of calling Java directly.

There are a couple options:

A) Remove the default behavior and require job_endpoint to always be set
when using PortableRunner. This would be a breaking change.
B) Keep the current behavior, but warn when the user sets
runner=PortableRunner without job_endpoint. This is easy to miss, but it's
better than nothing.

What do you think?

[1]
https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L184
[2]
https://github.com/apache/beam/blob/b3596b89dbc002c686bdaa7853074e757a81b6fb/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L1983-L2048
[3]
https://github.com/apache/beam/blob/33c73739cec8bc6a7c8319efa41eda7a2540bce1/sdks/python/apache_beam/runners/portability/job_server.py#L163

Reply via email to