Re: Shuffling on apache beam

2019-05-30 Thread pasquale . bonito
Ideally my pipeline requires no shuffling, I just saw that introducing a windowing operation improves performance of BigTable insert. I don't know how to measure time spent in PubSub. I took the time when I message is published, fill the timestamp metadata and than confront that value with the t

Re: Shuffling on apache beam

2019-05-30 Thread Reuven Lax
Do you have any way of knowing how much of this time is being spent in Pub/Sub and how much in the Beam pipeline? If you are using the Dataflow runner and doing any shuffling, 100-150ms is currently not attainable. Writes to shuffle are batched for up to 100ms at a time to keep operational costs d

Re: Shuffling on apache beam

2019-05-30 Thread pasquale . bonito
I'm measuring latency as the difference between the timestamp of the column on BigTable and the one I associate to the message when I publish it to the topic. I also do intermediate measurement after the message is read from PubSub topic and before inserting into BigTable. All timestamps are wri

Re: Shuffling on apache beam

2019-05-30 Thread Reuven Lax
How are you measuring latency? On Thu, May 30, 2019 at 3:08 AM pasquale.bon...@gmail.com < pasquale.bon...@gmail.com> wrote: > This was my first option but I'm using google dataflow as runner and it's > not clear if it supports stateful DoFn. > However my problem is latency, I've been trying diff

Re: Shuffling on apache beam

2019-05-30 Thread Reza Rokni
Hi, Would you mind sharing your latency requirements? For example is it < 1 sec at XX percentile? With regards to Stateful DoFn with a few exceptions it is supported : https://beam.apache.org/documentation/runners/capability-matrix/#cap-full-what Cheers Reza On Thu, 30 May 2019 at 18:08, pa

Re: Shuffling on apache beam

2019-05-30 Thread pasquale . bonito
This was my first option but I'm using google dataflow as runner and it's not clear if it supports stateful DoFn. However my problem is latency, I've been trying different solution but it seems difficult to bring latency under 1s when consuming message (150/s )from PubSub with beam/dataflow. Is

Re: Shuffling on apache beam

2019-05-29 Thread Pablo Estrada
If you add a stateful DoFn to your pipeline, you'll force Beam to shuffle data to their corresponding worker per key. I am not sure what is the latency cost of doing this (as the messages still need to be shuffled). But it may help you accomplish this without adding windowing+triggering. -P. On W

Re: Shuffling on apache beam

2019-05-29 Thread pasquale . bonito
Hi Reza, with GlobalWindow with triggering I was able to reduce hotspot issues gaining satisfying performance for BigTable update. Unfortunately latency when getting messages from PubSub remains around 1.5s that it's too much considering our NFR. This is the code I use to get the messages: PColl

Re: Shuffling on apache beam

2019-05-24 Thread Reza Rokni
PS You can also make use of the GlobalWindow with a stateful DoFn. On Fri, 24 May 2019 at 15:13, Reza Rokni wrote: > Hi, > > Have you explored the use of triggers with your use case? > > > https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/transforms/windowing/Trigger.html > > C

Re: Shuffling on apache beam

2019-05-24 Thread Reza Rokni
Hi, Have you explored the use of triggers with your use case? https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/transforms/windowing/Trigger.html Cheers Reza On Fri, 24 May 2019 at 14:14, pasquale.bon...@gmail.com < pasquale.bon...@gmail.com> wrote: > Hi Reuven, > I would li

Re: Shuffling on apache beam

2019-05-23 Thread pasquale . bonito
Hi Reuven, I would like to know if is possible to guarantee that record are processed by the same thread/task based on a key, as probably happens in a combine/stateful operation, without adding the delay of a windows. This could increase efficiency of caching and reduce same racing condition when

Re: Shuffling on apache beam

2019-05-23 Thread Reuven Lax
Can you explain what you mean by worker? While every runner has workers of course, workers are not part of the programming model. On Thu, May 23, 2019 at 8:13 PM pasquale.bon...@gmail.com < pasquale.bon...@gmail.com> wrote: > Hi all, > I would like to know if Apache Beam has a functionality simil

Shuffling on apache beam

2019-05-23 Thread pasquale . bonito
Hi all, I would like to know if Apache Beam has a functionality similar to fieldsGrouping in Storm that allows to send records to a specific task/worker based on a key. I know that we can achieve that with a combine/grouByKey operation but that implies to add a windowing in our pipeline that we