[ 
https://issues.apache.org/jira/browse/FLINK-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616853#comment-16616853
 ] 

Jiayi Liao commented on FLINK-10348:
------------------------------------

[~twalthr][~Zentol] What do you think of this feature?

> Solve data skew when consuming data from kafka
> ----------------------------------------------
>
>                 Key: FLINK-10348
>                 URL: https://issues.apache.org/jira/browse/FLINK-10348
>             Project: Flink
>          Issue Type: New Feature
>          Components: Kafka Connector
>    Affects Versions: 1.6.0
>            Reporter: Jiayi Liao
>            Assignee: Jiayi Liao
>            Priority: Major
>
> By using KafkaConsumer, our strategy is to send fetch request to brokers with 
> a fixed fetch size. Assume x topic has n partition and there exists data skew 
> between partitions, now we need to consume data from x topic with earliest 
> offset, and we can get max fetch size data in every fetch request. The 
> problem is that when an task consumes data from both "big" partitions and 
> "small" partitions, the data in "big" partitions may be late elements because 
> "small" partitions are consumed faster.
> *Solution: *
> I think we can leverage two parameters to control this.
> 1. data.skew.check // whether to check data skew
> 2. data.skew.check.interval // the interval between checks
> Every data.skew.check.interval, we will check the latest offset of every 
> specific partition, and calculate (latest offset - current offset), then get 
> partitions which need to slow down and redefine their fetch size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to