Re: Advice on parallelizing network calls in DoFn

Reuven Lax Sat, 10 Mar 2018 08:30:48 -0800

Have you considered drafting in detail what you think this API might look
like?


If it's a radically different API, it might be more appropriate as an
alternative parallel Beam API rather than a replacement for the current API
(there is also one such fluent API in the works).


On Sat, Mar 10, 2018 at 7:23 AM Romain Manni-Bucau <[email protected]>
wrote:

>
>
> 2018-03-10 16:19 GMT+01:00 Reuven Lax <[email protected]>:
>
>> This is another version (maybe a better, Java 8 idiomatic one?) of what
>> Kenn suggested.
>>
>> Note that with NewDoFn this need not be incompatible (so might not
>> require waiting till Beam 3.0). We can recognize new parameters to
>> processElement and populate add needed.
>>
>
> This is right however in my head it was a single way movemenent to enforce
> the design to be reactive and not fake a reactive API with a sync and not
> reactive impl which is what would be done today with both support I fear.
>
>
>>
>> On Sat, Mar 10, 2018, 12:13 PM Romain Manni-Bucau <[email protected]>
>> wrote:
>>
>>> Yes, for the dofn for instance, instead of having
>>> processcontext.element()=<T> you get a CompletionStage<T> and output gets
>>> it as well.
>>>
>>> This way you register an execution chain. Mixed with streams you get a
>>> big data java 8/9/10 API which enabkes any connectivity in a wel performing
>>> manner ;).
>>>
>>> Le 10 mars 2018 13:56, "Reuven Lax" <[email protected]> a écrit :
>>>
>>>> So you mean the user should have a way of registering asynchronous
>>>> activity with a callback (the callback must be registered with Beam,
>>>> because Beam needs to know not to mark the element as done until all
>>>> associated callbacks have completed). I think that's basically what Kenn
>>>> was suggesting, unless I'm missing something.
>>>>
>>>>
>>>> On Fri, Mar 9, 2018 at 11:07 PM Romain Manni-Bucau <
>>>> [email protected]> wrote:
>>>>
>>>>> Yes, callback based. Beam today is synchronous and until
>>>>> bundles+combines are reactive friendly, beam will be synchronous whatever
>>>>> other parts do. Becoming reactive will enable to manage the threading
>>>>> issues properly and to have better scalability on the overall execution
>>>>> when remote IO are involved.
>>>>>
>>>>> However it requires to break source, sdf design to use completionstage
>>>>> - or equivalent - to chain the processing properly and in an unified
>>>>> fashion.
>>>>>
>>>>> Le 9 mars 2018 23:48, "Reuven Lax" <[email protected]> a écrit :
>>>>>
>>>>> If you're talking about reactive programming, at a certain level beam
>>>>> is already reactive. Are you referring to a specific way of writing the
>>>>> code?
>>>>>
>>>>>
>>>>> On Fri, Mar 9, 2018 at 1:59 PM Reuven Lax <[email protected]> wrote:
>>>>>
>>>>>> What do you mean by reactive?
>>>>>>
>>>>>> On Fri, Mar 9, 2018, 6:58 PM Romain Manni-Bucau <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> @Kenn: why not preferring to make beam reactive? Would alow to scale
>>>>>>> way more without having to hardly synchronize multithreading. Elegant 
>>>>>>> and
>>>>>>> efficient :). Beam 3?
>>>>>>>
>>>>>>>
>>>>>>> Le 9 mars 2018 22:49, "Kenneth Knowles" <[email protected]> a écrit :
>>>>>>>
>>>>>>>> I will start with the "exciting futuristic" answer, which is that
>>>>>>>> we envision the new DoFn to be able to provide an automatic 
>>>>>>>> ExecutorService
>>>>>>>> parameters that you can use as you wish.
>>>>>>>>
>>>>>>>>     new DoFn<>() {
>>>>>>>>       @ProcessElement
>>>>>>>>       public void process(ProcessContext ctx, ExecutorService
>>>>>>>> executorService) {
>>>>>>>>           ... launch some futures, put them in instance vars ...
>>>>>>>>       }
>>>>>>>>
>>>>>>>>       @FinishBundle
>>>>>>>>       public void finish(...) {
>>>>>>>>          ... block on futures, output results if appropriate ...
>>>>>>>>       }
>>>>>>>>     }
>>>>>>>>
>>>>>>>> This way, the Java SDK harness can use its overarching knowledge of
>>>>>>>> what is going on in a computation to, for example, share a thread pool
>>>>>>>> between different bits. This was one reason to delete
>>>>>>>> IntraBundleParallelization - it didn't allow the runner and user code 
>>>>>>>> to
>>>>>>>> properly manage how many things were going on concurrently. And mostly 
>>>>>>>> the
>>>>>>>> runner should own parallelizing to max out cores and what user code 
>>>>>>>> needs
>>>>>>>> is asynchrony hooks that can interact with that. However, this feature 
>>>>>>>> is
>>>>>>>> not thoroughly considered. TBD how much the harness itself manages 
>>>>>>>> blocking
>>>>>>>> on outstanding requests versus it being your responsibility in
>>>>>>>> FinishBundle, etc.
>>>>>>>>
>>>>>>>> I haven't explored rolling your own here, if you are willing to do
>>>>>>>> the knob tuning to get the threading acceptable for your particular use
>>>>>>>> case. Perhaps someone else can weigh in.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Fri, Mar 9, 2018 at 1:38 PM Josh Ferge <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hello all:
>>>>>>>>>
>>>>>>>>> Our team has a pipeline that make external network calls. These
>>>>>>>>> pipelines are currently super slow, and the hypothesis is that they 
>>>>>>>>> are
>>>>>>>>> slow because we are not threading for our network calls. The github 
>>>>>>>>> issue
>>>>>>>>> below provides some discussion around this:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/beam/pull/957
>>>>>>>>>
>>>>>>>>> In beam 1.0, there was IntraBundleParallelization, which helped
>>>>>>>>> with this. However, this was removed because it didn't comply with a 
>>>>>>>>> few
>>>>>>>>> BEAM paradigms.
>>>>>>>>>
>>>>>>>>> Questions going forward:
>>>>>>>>>
>>>>>>>>> What is advised for jobs that make blocking network calls? It
>>>>>>>>> seems bundling the elements into groups of size X prior to passing to 
>>>>>>>>> the
>>>>>>>>> DoFn, and managing the threading within the function might work. 
>>>>>>>>> thoughts?
>>>>>>>>> Are these types of jobs even suitable for beam?
>>>>>>>>> Are there any plans to develop features that help with this?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>
>>>>>
>

Re: Advice on parallelizing network calls in DoFn

Reply via email to