Hi folks,

I would like to start the FLIP discussion thread about supporting local
aggregation in Flink.

In short, this feature can effectively alleviate data skew. This is the
FLIP:

https://cwiki.apache.org/confluence/display/FLINK/FLIP-44%3A+Support+Local+Aggregation+in+Flink


*Motivation* (copied from FLIP)

Currently, keyed streams are widely used to perform aggregating operations
(e.g., reduce, sum and window) on the elements that have the same key. When
executed at runtime, the elements with the same key will be sent to and
aggregated by the same task.

The performance of these aggregating operations is very sensitive to the
distribution of keys. In the cases where the distribution of keys follows a
powerful law, the performance will be significantly downgraded. More
unluckily, increasing the degree of parallelism does not help when a task
is overloaded by a single key.

Local aggregation is a widely-adopted method to reduce the performance
degraded by data skew. We can decompose the aggregating operations into two
phases. In the first phase, we aggregate the elements of the same key at
the sender side to obtain partial results. Then at the second phase, these
partial results are sent to receivers according to their keys and are
combined to obtain the final result. Since the number of partial results
received by each receiver is limited by the number of senders, the
imbalance among receivers can be reduced. Besides, by reducing the amount
of transferred data the performance can be further improved.

*More details*:

Design documentation:
https://docs.google.com/document/d/1gizbbFPVtkPZPRS8AIuH8596BmgkfEa7NRwR6n3pQes/edit?usp=sharing

Old discussion thread:
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Support-Local-Aggregation-in-Flink-td29307.html#a29308

JIRA: FLINK-12786 <https://issues.apache.org/jira/browse/FLINK-12786>

We are looking forwards to your feedback!

Best,
Vino

Reply via email to