Hi everyone, In Dataflow we've had some Java SDK user issues that could be easily solved if users were able to define some one-time initialization code (e.g. set a system property) that their workers execute before they begin to process data. One solution we've found that works with currently released SDKs is for users to put that code in static {} blocks on any DoFn implementations that require that initialization. Then the code is executed one time when/if that DoFn class is loaded. This works well in a lot of cases but also has some limitations: - Each DoFn that depends on that initialization needs to include the same initialization - There is no way for users to know which workers executed a particular DoFn - users could have workers with different configurations
You could perhaps argue that these are actually good things - we only run the initialization when it's needed - but it could also lead to confusing behavior. So I'd like to a propose an addition to the Java SDK that provides hooks for JVM initialization that is guaranteed to execute once across all worker workers. I've written up a PR [1] that implements this. It adds a service interface, BeamWorkerInitializer, that users can implement to define some initialization, and modifies workers (currently just the portable worker and the dataflow worker) to find and execute these implementations using ServiceLoader. BeamWorkerInitializer has two methods that can be overriden: onStartup, which workers run immediately after starting, and beforeProcessing, which workers run after initializing things like logging, but before beginning to process data. Since this is a pretty fundamental change I wanted to have a quick discussion here before merging, in case there are any comments or concerns. Thanks! Brian [1] https://github.com/apache/beam/pull/8104