[ https://issues.apache.org/jira/browse/FLINK-10886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687333#comment-16687333 ]
Jamie Grier commented on FLINK-10886: ------------------------------------- ML discussion: [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Sharing-state-between-subtasks-td24489.html] > Event time synchronization across sources > ----------------------------------------- > > Key: FLINK-10886 > URL: https://issues.apache.org/jira/browse/FLINK-10886 > Project: Flink > Issue Type: Improvement > Components: Streaming Connectors > Reporter: Jamie Grier > Assignee: Jamie Grier > Priority: Major > Original Estimate: 336h > Remaining Estimate: 336h > > When reading from a source with many parallel partitions, especially when > reading lots of historical data (or recovering from downtime and there is a > backlog to read), it's quite common for there to develop an event-time skew > across those partitions. > > When doing event-time windowing -- or in fact any event-time driven > processing -- the event time skew across partitions results directly in > increased buffering in Flink and of course the corresponding state/checkpoint > size growth. > > As the event-time skew and state size grows larger this can have a major > effect on application performance and in some cases result in a "death > spiral" where the application performance get's worse and worse as the state > size grows and grows. > > So, one solution to this problem, outside of core changes in Flink itself, > seems to be to try to coordinate sources across partitions so that they make > progress through event time at roughly the same rate. In fact if there is > large skew the idea would be to slow or even stop reading from some > partitions with newer data while first reading the partitions with older > data. Anyway, to do this we need to share state somehow amongst sub-tasks. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)