Re: Limited join with stop condition

2019-10-10 Thread Reza Rokni
Hi, Agreed with the others that this does not sound like a good fit... But to explore ideas... One possible (complicated and error prone) way this could be done, ... Beam does not support cycles, but you could use an external unbounded source as a way of sending impulse out and then back into

Re: Limited join with stop condition

2019-10-10 Thread Kenneth Knowles
Interesting! I agree with Luke that it seems not a great fit for Beam in the most rigorous sense. There are many considerations: 1. We assume ParDo has side effects by default. So the model actual *requires* eager evaluation, not lazy, in order to make all the side effects happen. But for your

Re: Spring with Apache Beam

2019-10-10 Thread Luke Cwik
You shouldn't need to call it before running the pipeline as you are doing (you can if you want but its not necessary). Have you created a service META-INF entry for the JvmInitializer you have created or are using @AutoService? This is the relevant bit of the documentation[1]. Here is some good

ETL with Beam?

2019-10-10 Thread Steve973
Hello, all. I still have not been given the tasking to convert my work project to use Beam, but it is still something that I am looking to do in the fairly near future. Our data workflow consists of ingest and transformation, and I was hoping that there are ETL frameworks that work well with

Re: Joining PCollections to aggregates of themselves

2019-10-10 Thread Robert Bradshaw
Looking at the naive solution PCollectionView agg = input .apply(Windows.sliding(10mins, 1sec hops).trigger(Repeatedly.forever(AfterPane.elementCountAtLeast(1 .apply(Mean.globally()) .apply(View.asSingleton()); PCollection output = input .apply(ParDo.of(new

Re: Joining PCollections to aggregates of themselves

2019-10-10 Thread Eugene Kirpichov
" input elements can pass through the Joiner DoFn before the sideInput corresponding to that element is present" I don't think this is correct. Runners will evaluate a DoFn with side inputs on elements in a given window only after all side inputs are ready (have triggered at least once) in this

Re: Joining PCollections to aggregates of themselves

2019-10-10 Thread rahul patwari
With Stateful DoFn, each instance of DoFn will have elements which belong to the same window and have the same key. So, the parallelism is limited by [no. of keys * no. of Windows] In batch mode, as all the elements belong to the same window, i.e. Global Window, the parallelism will be limited by

Re: Limited join with stop condition

2019-10-10 Thread Luke Cwik
This doesn't seem like a good fit for Apache Beam but have you tried: * using a StatefulDoFn that performs all the joining and signals the service powering the sources to stop sending data once your criteria is met (most services powering these sources won't have a way to be controlled this way)?

Limited join with stop condition

2019-10-10 Thread Alexey Romanenko
Hello, We have a use case and it's not clear how it can be solved/implemented with Beam. I count on community help with this, maybe I miss something that lays on the surface. Let’s say, there are two different bounded sources and one join transform (say GBK) downstream. This Join transform is

Re: Joining PCollections to aggregates of themselves

2019-10-10 Thread Sam Stephens
Hi Rahul, Thanks for the response. I did consider State, but actually I was tentative because of a different requirement that I didn't specify - the same pipeline should work for batch and stream modes. I'm not sure how Stateful DoFn's behave in the batch world: can you get Beam to pass the

Re: Joining PCollections to aggregates of themselves

2019-10-10 Thread rahul patwari
Hi Sam, (Assuming all the tuples have the same key) One solution could be to use ParDo with State(to calculate mean) => For each element as they occur, calculate the Mean(store the sum and count as the state) and emit the tuple with the new average value. But it will limit the parallelism count.

Joining PCollections to aggregates of themselves

2019-10-10 Thread Sam Stephens
My team and I have been puzzling for a while how to solve a specific problem. Say you have an input stream of tuples: And you want to output a stream containing: Where the average is an aggregation over a 10 minute sliding window of the "value" field. There are a couple of extra