Have you considered drafting in detail what you think this API might look like?
If it's a radically different API, it might be more appropriate as an alternative parallel Beam API rather than a replacement for the current API (there is also one such fluent API in the works). On Sat, Mar 10, 2018 at 7:23 AM Romain Manni-Bucau <[email protected]> wrote: > > > 2018-03-10 16:19 GMT+01:00 Reuven Lax <[email protected]>: > >> This is another version (maybe a better, Java 8 idiomatic one?) of what >> Kenn suggested. >> >> Note that with NewDoFn this need not be incompatible (so might not >> require waiting till Beam 3.0). We can recognize new parameters to >> processElement and populate add needed. >> > > This is right however in my head it was a single way movemenent to enforce > the design to be reactive and not fake a reactive API with a sync and not > reactive impl which is what would be done today with both support I fear. > > >> >> On Sat, Mar 10, 2018, 12:13 PM Romain Manni-Bucau <[email protected]> >> wrote: >> >>> Yes, for the dofn for instance, instead of having >>> processcontext.element()=<T> you get a CompletionStage<T> and output gets >>> it as well. >>> >>> This way you register an execution chain. Mixed with streams you get a >>> big data java 8/9/10 API which enabkes any connectivity in a wel performing >>> manner ;). >>> >>> Le 10 mars 2018 13:56, "Reuven Lax" <[email protected]> a écrit : >>> >>>> So you mean the user should have a way of registering asynchronous >>>> activity with a callback (the callback must be registered with Beam, >>>> because Beam needs to know not to mark the element as done until all >>>> associated callbacks have completed). I think that's basically what Kenn >>>> was suggesting, unless I'm missing something. >>>> >>>> >>>> On Fri, Mar 9, 2018 at 11:07 PM Romain Manni-Bucau < >>>> [email protected]> wrote: >>>> >>>>> Yes, callback based. Beam today is synchronous and until >>>>> bundles+combines are reactive friendly, beam will be synchronous whatever >>>>> other parts do. Becoming reactive will enable to manage the threading >>>>> issues properly and to have better scalability on the overall execution >>>>> when remote IO are involved. >>>>> >>>>> However it requires to break source, sdf design to use completionstage >>>>> - or equivalent - to chain the processing properly and in an unified >>>>> fashion. >>>>> >>>>> Le 9 mars 2018 23:48, "Reuven Lax" <[email protected]> a écrit : >>>>> >>>>> If you're talking about reactive programming, at a certain level beam >>>>> is already reactive. Are you referring to a specific way of writing the >>>>> code? >>>>> >>>>> >>>>> On Fri, Mar 9, 2018 at 1:59 PM Reuven Lax <[email protected]> wrote: >>>>> >>>>>> What do you mean by reactive? >>>>>> >>>>>> On Fri, Mar 9, 2018, 6:58 PM Romain Manni-Bucau < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> @Kenn: why not preferring to make beam reactive? Would alow to scale >>>>>>> way more without having to hardly synchronize multithreading. Elegant >>>>>>> and >>>>>>> efficient :). Beam 3? >>>>>>> >>>>>>> >>>>>>> Le 9 mars 2018 22:49, "Kenneth Knowles" <[email protected]> a écrit : >>>>>>> >>>>>>>> I will start with the "exciting futuristic" answer, which is that >>>>>>>> we envision the new DoFn to be able to provide an automatic >>>>>>>> ExecutorService >>>>>>>> parameters that you can use as you wish. >>>>>>>> >>>>>>>> new DoFn<>() { >>>>>>>> @ProcessElement >>>>>>>> public void process(ProcessContext ctx, ExecutorService >>>>>>>> executorService) { >>>>>>>> ... launch some futures, put them in instance vars ... >>>>>>>> } >>>>>>>> >>>>>>>> @FinishBundle >>>>>>>> public void finish(...) { >>>>>>>> ... block on futures, output results if appropriate ... >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> This way, the Java SDK harness can use its overarching knowledge of >>>>>>>> what is going on in a computation to, for example, share a thread pool >>>>>>>> between different bits. This was one reason to delete >>>>>>>> IntraBundleParallelization - it didn't allow the runner and user code >>>>>>>> to >>>>>>>> properly manage how many things were going on concurrently. And mostly >>>>>>>> the >>>>>>>> runner should own parallelizing to max out cores and what user code >>>>>>>> needs >>>>>>>> is asynchrony hooks that can interact with that. However, this feature >>>>>>>> is >>>>>>>> not thoroughly considered. TBD how much the harness itself manages >>>>>>>> blocking >>>>>>>> on outstanding requests versus it being your responsibility in >>>>>>>> FinishBundle, etc. >>>>>>>> >>>>>>>> I haven't explored rolling your own here, if you are willing to do >>>>>>>> the knob tuning to get the threading acceptable for your particular use >>>>>>>> case. Perhaps someone else can weigh in. >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> On Fri, Mar 9, 2018 at 1:38 PM Josh Ferge < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hello all: >>>>>>>>> >>>>>>>>> Our team has a pipeline that make external network calls. These >>>>>>>>> pipelines are currently super slow, and the hypothesis is that they >>>>>>>>> are >>>>>>>>> slow because we are not threading for our network calls. The github >>>>>>>>> issue >>>>>>>>> below provides some discussion around this: >>>>>>>>> >>>>>>>>> https://github.com/apache/beam/pull/957 >>>>>>>>> >>>>>>>>> In beam 1.0, there was IntraBundleParallelization, which helped >>>>>>>>> with this. However, this was removed because it didn't comply with a >>>>>>>>> few >>>>>>>>> BEAM paradigms. >>>>>>>>> >>>>>>>>> Questions going forward: >>>>>>>>> >>>>>>>>> What is advised for jobs that make blocking network calls? It >>>>>>>>> seems bundling the elements into groups of size X prior to passing to >>>>>>>>> the >>>>>>>>> DoFn, and managing the threading within the function might work. >>>>>>>>> thoughts? >>>>>>>>> Are these types of jobs even suitable for beam? >>>>>>>>> Are there any plans to develop features that help with this? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>> >>>>> >
