On Mon, Apr 15, 2019 at 9:35 AM Udi Meiri <eh...@google.com> wrote:

> Is this like the way Python SDK allows for a custom setup.py?
> example:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/setup.py
>

custom setup.py is slightly different. It will execute a custom piece of
code before starting the python interpreter and running any worker code.
Brian's proposal will execute immediately after starting the worker.


>
> On Fri, Apr 12, 2019 at 10:51 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> +1 on the use cases that Ahmet pointed out and the solution that Brian
>> put forth. I like how the change is being applied to the Beam Java SDK
>> harness and not just Dataflow so all portable runner users get this as well.
>>
>> On Wed, Apr 10, 2019 at 9:03 PM Kenneth Knowles <k...@apache.org> wrote:
>>
>>>
>>>
>>> On Wed, Apr 10, 2019 at 8:18 PM Ahmet Altay <al...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Apr 10, 2019 at 7:59 PM Kenneth Knowles <k...@apache.org>
>>>> wrote:
>>>>
>>>>> TL;DR I like the simple approach better than the ServiceLoader
>>>>> solution when a particular DoFn depends on the result. The ServiceLoader
>>>>> solution fits when it is somewhat independent of a particular DoFn (I'm 
>>>>> not
>>>>> sure the use case(s)).
>>>>>
>>>>> On Wed, Apr 10, 2019 at 4:10 PM Brian Hulette <bhule...@google.com>
>>>>> wrote:
>>>>>
>>>>>> - Each DoFn that depends on that initialization needs to include the
>>>>>> same initialization
>>>>>>
>>>>>
>>>>> What if a DoFn that depends on the initialization is used in a new
>>>>> context? Then it is relying on initialization done elsewhere, and it will
>>>>> break or, worse, give wrong results. So I think this bullet point is a
>>>>> feature, not a bug. And if the initialization is built as a static method
>>>>> of some third class, referenced by all the DoFns that need it, it is a
>>>>> one-liner to declare the dependency explicitly.
>>>>>
>>>>>
>>>>>> - There is no way for users to know which workers executed a
>>>>>> particular DoFn - users could have workers with different configurations
>>>>>>
>>>>>
>>>>> What is a worker? j/k. Each runner has different notions of what a
>>>>> worker is, including the Java SDK Harness. But they all do require one or
>>>>> more JVMs. It is true that you can't easily predict which DoFn classes are
>>>>> loaded on a particular JVM. This bullet is a strong case against
>>>>> initialization at a distance. I think your proposed solution and also the
>>>>> simple static block approach avoid this pitfall, so all is good.
>>>>>
>>>>> You could perhaps argue that these are actually good things - we only
>>>>>> run the initialization when it's needed - but it could also lead to
>>>>>> confusing behavior.
>>>>>>
>>>>>
>>>>> FWIW my argument above is not about only running when needed. The
>>>>> opposite - it is about being certain it is run when needed.
>>>>>
>>>>>
>>>>>> So I'd like to a propose an addition to the Java SDK that provides
>>>>>> hooks for JVM initialization that is guaranteed to execute once across 
>>>>>> all
>>>>>> worker workers. I've written up a PR [1] that implements this. It adds a
>>>>>> service interface, BeamWorkerInitializer, that users can implement to
>>>>>> define some initialization, and modifies workers (currently just the
>>>>>> portable worker and the dataflow worker) to find and execute these
>>>>>> implementations using ServiceLoader. BeamWorkerInitializer has two 
>>>>>> methods
>>>>>> that can be overriden: onStartup, which workers run immediately after
>>>>>> starting, and beforeProcessing, which workers run after initializing 
>>>>>> things
>>>>>> like logging, but before beginning to process data.
>>>>>>
>>>>>> Since this is a pretty fundamental change I wanted to have a quick
>>>>>> discussion here before merging, in case there are any comments or 
>>>>>> concerns.
>>>>>>
>>>>>
>>>>> FWIW (again) I have no objection to the general idea and don't have
>>>>> any problem with making such a fundamental change. I actually think your
>>>>> change is probably useful. But if a particular DoFn depends on the JVM
>>>>> being configured a certain way, a static block in that DoFn class seems
>>>>> more readable and reliable.
>>>>>
>>>>> Are there use cases for more generic JVM initialization that,
>>>>> presumably, a user would want to affect all their DoFns?
>>>>>
>>>>
>>>> A few things I can recall from recent user interactions are a need for
>>>> setting a custom ssl providers, time zone rules providers. Users would want
>>>> such settings to apply for all their dofns in a pipeline.
>>>>
>>>
>>> This makes sense. Another perspective is whether the
>>> initialization/configuration might be orthogonal to the DoFns in the
>>> pipeline. These seem to fit that description.
>>>
>>> Kenn
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Kenn
>>>>>
>>>>>
>>>>>> Thanks!
>>>>>> Brian
>>>>>>
>>>>>> [1] https://github.com/apache/beam/pull/8104
>>>>>>
>>>>>

Reply via email to