Hi everyone,
In Dataflow we've had some Java SDK user issues that could be easily solved
if users were able to define some one-time initialization code (e.g. set a
system property) that their workers execute before they begin to process
data. One solution we've found that works with currently released SDKs is
for users to put that code in static {} blocks on any DoFn implementations
that require that initialization. Then the code is executed one time
when/if that DoFn class is loaded. This works well in a lot of cases but
also has some limitations:
- Each DoFn that depends on that initialization needs to include the same
initialization
- There is no way for users to know which workers executed a particular
DoFn - users could have workers with different configurations

You could perhaps argue that these are actually good things - we only run
the initialization when it's needed - but it could also lead to confusing
behavior.

So I'd like to a propose an addition to the Java SDK that provides hooks
for JVM initialization that is guaranteed to execute once across all worker
workers. I've written up a PR [1] that implements this. It adds a service
interface, BeamWorkerInitializer, that users can implement to define some
initialization, and modifies workers (currently just the portable worker
and the dataflow worker) to find and execute these implementations using
ServiceLoader. BeamWorkerInitializer has two methods that can be overriden:
onStartup, which workers run immediately after starting, and
beforeProcessing, which workers run after initializing things like logging,
but before beginning to process data.

Since this is a pretty fundamental change I wanted to have a quick
discussion here before merging, in case there are any comments or concerns.

Thanks!
Brian

[1] https://github.com/apache/beam/pull/8104

Reply via email to