[ 
https://issues.apache.org/jira/browse/FLINK-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616933#comment-16616933
 ] 

Elias Levy commented on FLINK-10348:
------------------------------------

I would suggest the strategy employed by Kafka Streams, which performs a best 
effort attempt to align streams by selectively fetching from the stream with 
the lowest watermark of there are messages available. 

Rather than implementing something like this writhin the Kafka connector 
source, which are independent tasks in Flink, I would suggest implementing it 
within multiple input operators. The operator can selectively process messages 
from the input stream with the lowest waternark if they are available. Back 
preassure can the take care of slowing down the higher volume input of 
nessesary. 

> Solve data skew when consuming data from kafka
> ----------------------------------------------
>
>                 Key: FLINK-10348
>                 URL: https://issues.apache.org/jira/browse/FLINK-10348
>             Project: Flink
>          Issue Type: New Feature
>          Components: Kafka Connector
>    Affects Versions: 1.6.0
>            Reporter: Jiayi Liao
>            Assignee: Jiayi Liao
>            Priority: Major
>
> By using KafkaConsumer, our strategy is to send fetch request to brokers with 
> a fixed fetch size. Assume x topic has n partition and there exists data skew 
> between partitions, now we need to consume data from x topic with earliest 
> offset, and we can get max fetch size data in every fetch request. The 
> problem is that when an task consumes data from both "big" partitions and 
> "small" partitions, the data in "big" partitions may be late elements because 
> "small" partitions are consumed faster.
> *Solution: *
> I think we can leverage two parameters to control this.
> 1. data.skew.check // whether to check data skew
> 2. data.skew.check.interval // the interval between checks
> Every data.skew.check.interval, we will check the latest offset of every 
> specific partition, and calculate (latest offset - current offset), then get 
> partitions which need to slow down and redefine their fetch size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to