[ https://issues.apache.org/jira/browse/FLINK-29099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Weise resolved FLINK-29099. ---------------------------------- Fix Version/s: 1.17.0 Resolution: Fixed > Deadlock for Single Subtask in Kinesis Consumer > ----------------------------------------------- > > Key: FLINK-29099 > URL: https://issues.apache.org/jira/browse/FLINK-29099 > Project: Flink > Issue Type: Bug > Components: Connectors / Kinesis > Affects Versions: 1.9.3, 1.10.3, 1.11.6, 1.12.7, 1.13.6, 1.14.5, 1.15.3 > Reporter: seth saperstein > Priority: Minor > Labels: connector, consumer, kinesis, pull-request-available > Fix For: 1.17.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > Deadlock is reached as the result of: > * max lookahead reached for local watermark > * idle state for subtask > The lookahead prevents the RecordEmitter from emitting a new record. The idle > state prevents the global watermark from being updated. > To exit this deadlock state, we need to complete the [TODO > here|https://github.com/apache/flink/blob/221d70d9930f72147422ea24b399f006ebbfb8d7/flink-connectors/flink-connector-kinesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/internals/KinesisDataFetcher.java#L1268] > which updates the global watermark while the subtask is marked idle, which > will then allow us to emit a record again as the lookahead is no longer > reached. > > *Context:* > We reached this scenario at Lyft as a result of prolonged CPU throttling on > all FlinkKinesisConsumer threads for multiple minutes. > Walking through the series of events for a single subtask: > * prolonged CPU throttling occurs and no logs are seen from any > FlinkKinesisConsumer thread for up to 15 minutes > * after CPU throttling the subtask is marked idle > * the subtask has reached the lookahead for its local watermark relative to > the global watermark > * WatermarkSyncCallback indicates the subtask as idle and does not update > the global watermark > * emitQueue fills to max > * RecordEmitter cannot emit records due to the max lookahead > * Deadlock on subtask > At this point, we had not realized what had happened and processing of all > other shards/subtasks had continued for multiple days. When we finally > restarted the application, we saw the following behavior: > * global watermark recalculated after all subtasks consumed data based on > the last kinesis record sequence number > * global watermark moved back in time multiple days, to when the subtask was > first marked idle > * the single subtask processed data while all others remained idle due to > the lookahead > This would have continued until the subtask had caught up to the others and > thus the global watermark is within reach of the lookahead for other subtasks. > > *Repro:* > Too difficult to repro the exact scenario. -- This message was sent by Atlassian Jira (v8.20.10#820010)