Re: Advice on parallelizing network calls in DoFn

Reuven Lax Sat, 10 Mar 2018 04:57:07 -0800

So you mean the user should have a way of registering asynchronous activity
with a callback (the callback must be registered with Beam, because Beam
needs to know not to mark the element as done until all associated
callbacks have completed). I think that's basically what Kenn was
suggesting, unless I'm missing something.



On Fri, Mar 9, 2018 at 11:07 PM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Yes, callback based. Beam today is synchronous and until bundles+combines
> are reactive friendly, beam will be synchronous whatever other parts do.
> Becoming reactive will enable to manage the threading issues properly and
> to have better scalability on the overall execution when remote IO are
> involved.
>
> However it requires to break source, sdf design to use completionstage -
> or equivalent - to chain the processing properly and in an unified fashion.
>
> Le 9 mars 2018 23:48, "Reuven Lax" <re...@google.com> a écrit :
>
> If you're talking about reactive programming, at a certain level beam is
> already reactive. Are you referring to a specific way of writing the code?
>
>
> On Fri, Mar 9, 2018 at 1:59 PM Reuven Lax <re...@google.com> wrote:
>
>> What do you mean by reactive?
>>
>> On Fri, Mar 9, 2018, 6:58 PM Romain Manni-Bucau <rmannibu...@gmail.com>
>> wrote:
>>
>>> @Kenn: why not preferring to make beam reactive? Would alow to scale way
>>> more without having to hardly synchronize multithreading. Elegant and
>>> efficient :). Beam 3?
>>>
>>>
>>> Le 9 mars 2018 22:49, "Kenneth Knowles" <k...@google.com> a écrit :
>>>
>>>> I will start with the "exciting futuristic" answer, which is that we
>>>> envision the new DoFn to be able to provide an automatic ExecutorService
>>>> parameters that you can use as you wish.
>>>>
>>>>     new DoFn<>() {
>>>>       @ProcessElement
>>>>       public void process(ProcessContext ctx, ExecutorService
>>>> executorService) {
>>>>           ... launch some futures, put them in instance vars ...
>>>>       }
>>>>
>>>>       @FinishBundle
>>>>       public void finish(...) {
>>>>          ... block on futures, output results if appropriate ...
>>>>       }
>>>>     }
>>>>
>>>> This way, the Java SDK harness can use its overarching knowledge of
>>>> what is going on in a computation to, for example, share a thread pool
>>>> between different bits. This was one reason to delete
>>>> IntraBundleParallelization - it didn't allow the runner and user code to
>>>> properly manage how many things were going on concurrently. And mostly the
>>>> runner should own parallelizing to max out cores and what user code needs
>>>> is asynchrony hooks that can interact with that. However, this feature is
>>>> not thoroughly considered. TBD how much the harness itself manages blocking
>>>> on outstanding requests versus it being your responsibility in
>>>> FinishBundle, etc.
>>>>
>>>> I haven't explored rolling your own here, if you are willing to do the
>>>> knob tuning to get the threading acceptable for your particular use case.
>>>> Perhaps someone else can weigh in.
>>>>
>>>> Kenn
>>>>
>>>> On Fri, Mar 9, 2018 at 1:38 PM Josh Ferge <
>>>> josh.fe...@bounceexchange.com> wrote:
>>>>
>>>>> Hello all:
>>>>>
>>>>> Our team has a pipeline that make external network calls. These
>>>>> pipelines are currently super slow, and the hypothesis is that they are
>>>>> slow because we are not threading for our network calls. The github issue
>>>>> below provides some discussion around this:
>>>>>
>>>>> https://github.com/apache/beam/pull/957
>>>>>
>>>>> In beam 1.0, there was IntraBundleParallelization, which helped with
>>>>> this. However, this was removed because it didn't comply with a few BEAM
>>>>> paradigms.
>>>>>
>>>>> Questions going forward:
>>>>>
>>>>> What is advised for jobs that make blocking network calls? It seems
>>>>> bundling the elements into groups of size X prior to passing to the DoFn,
>>>>> and managing the threading within the function might work. thoughts?
>>>>> Are these types of jobs even suitable for beam?
>>>>> Are there any plans to develop features that help with this?
>>>>>
>>>>> Thanks
>>>>>
>>>>
>

Re: Advice on parallelizing network calls in DoFn

Reply via email to