Re: Sink Parallelism

Chesnay Schepler Tue, 19 Apr 2016 07:02:06 -0700

The picture you reference does not really show how dataflows are connected.

For a better picture, visit this link:https://ci.apache.org/projects/flink/flink-docs-release-1.0/concepts/concepts.html#parallel-dataflows


Let me know if this doesn't answer your question.

On 19.04.2016 14:22, Ravinder Kaur wrote:

Hello All,
Considering the following streaming dataflow of the example WordCount,I want to understand how Sink is parallelised.
Source --> flatMap --> groupBy(), sum() --> Sink
If I set the paralellism at runtime using -p, as shown herehttps://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-taskmanager-processing-slots
I want to understand how Sink is done parallelly and how the globalresult is distributed.
As far as I understood groupBy(0) is applied to the tuples<String,Integer> emitted from the flatMap funtion, which groupes by the Stringvalue and sum(1) aggregates the Integer value getting the count.
That means streams will be redistributed so that tuples grouped by thesame String value be sent to one taskmanager and the Sink step shouldbe writing the results to the specified path. When Sink step is alsoparallelised then each taskmanager should emit a chunk. These chunksput together must be the global result.
But when I see the pictorial representation it seems that each taskslot will run a copy of the streaming dataflow and will be performingthe operations on the chunk of data it gets and outputs the result.But if this is the case the global result would have duplicates ofstrings and would be wrong.
Could one of you kindly clarify what exactly happens?

Kind Regards,
Ravinder Kaur

Re: Sink Parallelism

Reply via email to