Re: Advice on parallelizing network calls in DoFn

Romain Manni-Bucau Sun, 11 Mar 2018 07:19:23 -0700

Which is still a key feature for sdf but agree it can be dropped for an
outputemitter pattern and the dofn moved to a plain parameters injection
based pattern. Both (which completionstage) stays compatible :).


Le 11 mars 2018 13:12, "Reuven Lax" <[email protected]> a écrit :

> I think process context should go away completely. At that point it has
> little use except for a way to send output downstream.
>
> On Sun, Mar 11, 2018, 6:07 AM Romain Manni-Bucau <[email protected]>
> wrote:
>
>> Hmm, thinking out loud but completionstage should/could be extended to
>> replace processcontext since it represents element and output at the same
>> time no?
>>
>> Le 11 mars 2018 00:57, "Kenneth Knowles" <[email protected]> a écrit :
>>
>>> Yea, I think it could. But it is probably more readable to not overload
>>> the term, plus certainly a bit simpler in implementation. So perhaps
>>> @AsyncElement to make it very clear.
>>>
>>> Kenn
>>>
>>> On Sat, Mar 10, 2018 at 1:32 PM Reuven Lax <[email protected]> wrote:
>>>
>>>> Ken, can NewDoFn distinguish at generation time the difference between:
>>>>
>>>>     public void process(@Element CompletionStage<InputT> element, ...) {
>>>>
>>>> and
>>>>
>>>>     public void process(@Element Input element, ...) {
>>>>
>>>> If not, then we would probably need separate annotations....
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Mar 10, 2018 at 11:09 AM Kenneth Knowles <[email protected]>
>>>> wrote:
>>>>
>>>>> Nice! I agree that providing a CompletionStage for chaining is much
>>>>> better than an ExecutorService, and very clear.
>>>>>
>>>>> It is very feasible to add support that looks like
>>>>>
>>>>>   new DoFn<InputT, OutputT>() {
>>>>>     @ProcessElement
>>>>>     public void process(@Element CompletionStage<InputT> element, ...)
>>>>> {
>>>>>       element.thenApply(...)
>>>>>     }
>>>>>   }
>>>>>
>>>>> If we had this available, I think users could even experiment with
>>>>> this often as it might help even where it isn't obvious.
>>>>>
>>>>> My main hesitation is that big part of Beam is giving a
>>>>> basic/imperative style of programming a DoFn that executes in a very smart
>>>>> functional/parallel way. Full future-oriented programming is not
>>>>> explored much outside of Javascript (and maybe Haskell) and requires
>>>>> greater discipline in programming in a functional manner - if you are
>>>>> mutating stuff in your callback you are going to have bugs, and then when
>>>>> you add concurrency control you are going to have bad performance and
>>>>> deadlocks. So I definitely wouldn't make it the default or want to spend
>>>>> all our support effort on teaching advanced programming technique.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Sat, Mar 10, 2018 at 9:31 AM Romain Manni-Bucau <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> 2018-03-10 17:30 GMT+01:00 Reuven Lax <[email protected]>:
>>>>>>
>>>>>>> Have you considered drafting in detail what you think this API might
>>>>>>> look like?
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Yes, but it is after the "enhancements" - for my use cases - and
>>>>>> "bugs" list so didn't started to work on it much.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> If it's a radically different API, it might be more appropriate as
>>>>>>> an alternative parallel Beam API rather than a replacement for the 
>>>>>>> current
>>>>>>> API (there is also one such fluent API in the works).
>>>>>>>
>>>>>>
>>>>>> What I plan is to draft it on top of beam (so the "useless" case I
>>>>>> spoke about before) and then propose to impl it ~natively and move it as
>>>>>> main API for another major.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Mar 10, 2018 at 7:23 AM Romain Manni-Bucau <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2018-03-10 16:19 GMT+01:00 Reuven Lax <[email protected]>:
>>>>>>>>
>>>>>>>>> This is another version (maybe a better, Java 8 idiomatic one?) of
>>>>>>>>> what Kenn suggested.
>>>>>>>>>
>>>>>>>>> Note that with NewDoFn this need not be incompatible (so might not
>>>>>>>>> require waiting till Beam 3.0). We can recognize new parameters to
>>>>>>>>> processElement and populate add needed.
>>>>>>>>>
>>>>>>>>
>>>>>>>> This is right however in my head it was a single way movemenent to
>>>>>>>> enforce the design to be reactive and not fake a reactive API with a 
>>>>>>>> sync
>>>>>>>> and not reactive impl which is what would be done today with both 
>>>>>>>> support I
>>>>>>>> fear.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Mar 10, 2018, 12:13 PM Romain Manni-Bucau <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Yes, for the dofn for instance, instead of having
>>>>>>>>>> processcontext.element()=<T> you get a CompletionStage<T> and output 
>>>>>>>>>> gets
>>>>>>>>>> it as well.
>>>>>>>>>>
>>>>>>>>>> This way you register an execution chain. Mixed with streams you
>>>>>>>>>> get a big data java 8/9/10 API which enabkes any connectivity in a 
>>>>>>>>>> wel
>>>>>>>>>> performing manner ;).
>>>>>>>>>>
>>>>>>>>>> Le 10 mars 2018 13:56, "Reuven Lax" <[email protected]> a écrit :
>>>>>>>>>>
>>>>>>>>>>> So you mean the user should have a way of registering
>>>>>>>>>>> asynchronous activity with a callback (the callback must be 
>>>>>>>>>>> registered with
>>>>>>>>>>> Beam, because Beam needs to know not to mark the element as done 
>>>>>>>>>>> until all
>>>>>>>>>>> associated callbacks have completed). I think that's basically what 
>>>>>>>>>>> Kenn
>>>>>>>>>>> was suggesting, unless I'm missing something.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 9, 2018 at 11:07 PM Romain Manni-Bucau <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, callback based. Beam today is synchronous and until
>>>>>>>>>>>> bundles+combines are reactive friendly, beam will be synchronous 
>>>>>>>>>>>> whatever
>>>>>>>>>>>> other parts do. Becoming reactive will enable to manage the 
>>>>>>>>>>>> threading
>>>>>>>>>>>> issues properly and to have better scalability on the overall 
>>>>>>>>>>>> execution
>>>>>>>>>>>> when remote IO are involved.
>>>>>>>>>>>>
>>>>>>>>>>>> However it requires to break source, sdf design to use
>>>>>>>>>>>> completionstage - or equivalent - to chain the processing properly 
>>>>>>>>>>>> and in
>>>>>>>>>>>> an unified fashion.
>>>>>>>>>>>>
>>>>>>>>>>>> Le 9 mars 2018 23:48, "Reuven Lax" <[email protected]> a écrit :
>>>>>>>>>>>>
>>>>>>>>>>>> If you're talking about reactive programming, at a certain
>>>>>>>>>>>> level beam is already reactive. Are you referring to a specific 
>>>>>>>>>>>> way of
>>>>>>>>>>>> writing the code?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Mar 9, 2018 at 1:59 PM Reuven Lax <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> What do you mean by reactive?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Mar 9, 2018, 6:58 PM Romain Manni-Bucau <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> @Kenn: why not preferring to make beam reactive? Would alow
>>>>>>>>>>>>>> to scale way more without having to hardly synchronize 
>>>>>>>>>>>>>> multithreading.
>>>>>>>>>>>>>> Elegant and efficient :). Beam 3?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Le 9 mars 2018 22:49, "Kenneth Knowles" <[email protected]> a
>>>>>>>>>>>>>> écrit :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I will start with the "exciting futuristic" answer, which is
>>>>>>>>>>>>>>> that we envision the new DoFn to be able to provide an automatic
>>>>>>>>>>>>>>> ExecutorService parameters that you can use as you wish.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     new DoFn<>() {
>>>>>>>>>>>>>>>       @ProcessElement
>>>>>>>>>>>>>>>       public void process(ProcessContext ctx,
>>>>>>>>>>>>>>> ExecutorService executorService) {
>>>>>>>>>>>>>>>           ... launch some futures, put them in instance vars
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>       @FinishBundle
>>>>>>>>>>>>>>>       public void finish(...) {
>>>>>>>>>>>>>>>          ... block on futures, output results if appropriate
>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>     }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This way, the Java SDK harness can use its overarching
>>>>>>>>>>>>>>> knowledge of what is going on in a computation to, for example, 
>>>>>>>>>>>>>>> share a
>>>>>>>>>>>>>>> thread pool between different bits. This was one reason to 
>>>>>>>>>>>>>>> delete
>>>>>>>>>>>>>>> IntraBundleParallelization - it didn't allow the runner and 
>>>>>>>>>>>>>>> user code to
>>>>>>>>>>>>>>> properly manage how many things were going on concurrently. And 
>>>>>>>>>>>>>>> mostly the
>>>>>>>>>>>>>>> runner should own parallelizing to max out cores and what user 
>>>>>>>>>>>>>>> code needs
>>>>>>>>>>>>>>> is asynchrony hooks that can interact with that. However, this 
>>>>>>>>>>>>>>> feature is
>>>>>>>>>>>>>>> not thoroughly considered. TBD how much the harness itself 
>>>>>>>>>>>>>>> manages blocking
>>>>>>>>>>>>>>> on outstanding requests versus it being your responsibility in
>>>>>>>>>>>>>>> FinishBundle, etc.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I haven't explored rolling your own here, if you are willing
>>>>>>>>>>>>>>> to do the knob tuning to get the threading acceptable for your 
>>>>>>>>>>>>>>> particular
>>>>>>>>>>>>>>> use case. Perhaps someone else can weigh in.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Mar 9, 2018 at 1:38 PM Josh Ferge <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello all:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Our team has a pipeline that make external network calls.
>>>>>>>>>>>>>>>> These pipelines are currently super slow, and the hypothesis 
>>>>>>>>>>>>>>>> is that they
>>>>>>>>>>>>>>>> are slow because we are not threading for our network calls. 
>>>>>>>>>>>>>>>> The github
>>>>>>>>>>>>>>>> issue below provides some discussion around this:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/957
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In beam 1.0, there was IntraBundleParallelization, which
>>>>>>>>>>>>>>>> helped with this. However, this was removed because it didn't 
>>>>>>>>>>>>>>>> comply with a
>>>>>>>>>>>>>>>> few BEAM paradigms.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Questions going forward:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What is advised for jobs that make blocking network calls?
>>>>>>>>>>>>>>>> It seems bundling the elements into groups of size X prior to 
>>>>>>>>>>>>>>>> passing to
>>>>>>>>>>>>>>>> the DoFn, and managing the threading within the function might 
>>>>>>>>>>>>>>>> work.
>>>>>>>>>>>>>>>> thoughts?
>>>>>>>>>>>>>>>> Are these types of jobs even suitable for beam?
>>>>>>>>>>>>>>>> Are there any plans to develop features that help with this?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>>>

Re: Advice on parallelizing network calls in DoFn

Reply via email to