Jamie Grier created FLINK-10886:
-----------------------------------

             Summary: Event time synchronization across sources
                 Key: FLINK-10886
                 URL: https://issues.apache.org/jira/browse/FLINK-10886
             Project: Flink
          Issue Type: Improvement
          Components: Streaming Connectors
            Reporter: Jamie Grier
            Assignee: Jamie Grier


When reading from a source with many parallel partitions, especially when 
reading lots of historical data (or recovering from downtime and there is a 
backlog to read), it's quite common for there to develop an event-time skew 
across those partitions.
 
When doing event-time windowing -- or in fact any event-time driven processing 
-- the event time skew across partitions results directly in increased 
buffering in Flink and of course the corresponding state/checkpoint size growth.
 
As the event-time skew and state size grows larger this can have a major effect 
on application performance and in some cases result in a "death spiral" where 
the application performance get's worse and worse as the state size grows and 
grows.
 
So, one solution to this problem, outside of core changes in Flink itself, 
seems to be to try to coordinate sources across partitions so that they make 
progress through event time at roughly the same rate.  In fact if there is 
large skew the idea would be to slow or even stop reading from some partitions 
with newer data while first reading the partitions with older data.  Anyway, to 
do this we need to share state somehow amongst sub-tasks.
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to