Re: Sink Parallelism

2016-04-20 Thread Fabian Hueske
In batch / DataSet programs, groupBy() is execute by partitioning the data (usually hash partitioning) and sorting each partition to group all elements with the same key. keyBy() in DataStream programs also partitions the data and results in a KeyedStream. The KeyedStream has information about the

Re: Sink Parallelism

2016-04-20 Thread Ravinder Kaur
Hi Fabian, Thank you for the explanation. Could you also explain how keyBy() would work? I assume it should work same as groupBy(), but in streaming mode since the data is unbounded all elements that arrive in the first window are grouped/partitioned by keys and aggregated and so on until no more

Re: Sink Parallelism

2016-04-20 Thread Fabian Hueske
Hi Ravinder, your drawing is pretty much correct (Flink will inject a combiner between flat map and reduce which locally combines records with the same key). The partitioning between flat map and reduce is done with hash partitioning by default. However, you can also define a custom partitioner

Re: Sink Parallelism

2016-04-19 Thread Chesnay Schepler
The picture you reference does not really show how dataflows are connected. For a better picture, visit this link: https://ci.apache.org/projects/flink/flink-docs-release-1.0/concepts/concepts.html#parallel-dataflows Let me know if this doesn't answer your question. On 19.04.2016 14:22,

Sink Parallelism

2016-04-19 Thread Ravinder Kaur
Hello All, Considering the following streaming dataflow of the example WordCount, I want to understand how Sink is parallelised. Source --> flatMap --> groupBy(), sum() --> Sink If I set the paralellism at runtime using -p, as shown here