Damien Gasparina created KAFKA-12951: ----------------------------------------
Summary: Infinite loop while restoring a GlobalKTable Key: KAFKA-12951 URL: https://issues.apache.org/jira/browse/KAFKA-12951 Project: Kafka Issue Type: Bug Components: streams Reporter: Damien Gasparina We encountered an issue a few time in some of our Kafka Streams application. After an unexpected restart of our application, some instances have not been able to resume operating. They got stuck while trying to restore the state store of a GlobalKTable. The only way to resume operating was to manually delete their `state.dir`. We observed the following timeline: * After the restart of the Kafka Streams application, it tries to restore its GlobalKTable * It seeks to the last checkpoint available on the {{state.dir}}: 382 ([https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L259]) * The watermark ({{endOffset}} results) returned the offset 383 {code:java} handling ListOffsetResponse response for XX. Fetched offset 383, timestamp -1{code} * We enter the loop: [https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L279] * Then we invoked the {{poll()}}, but the poll returns nothing, so we enter: [https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L306] and we crash (x) {code:java} Global task did not make progress to restore state within 300000 ms.{code} * The POD restart, and we encounter the same issue until we manually delete the {{state.dir}} Regarding the topic, by leveraging the {{DumpLogSegment}} tool, I can see: * {{Offset 381}} - Last business message received * {{Offset 382}} - Txn COMMIT (last message) I think the real culprit is that the checkpoint is {{383}} instead of being {{382}}. For information, this is a compacted topic, and just before the outage, we encountered some ISR shrinking and leader changes. While experimenting with the API, it seems that the {{consumer.position()}} call is a bit tricky, after a {{seek()}} and a {{poll()}}, it seems that the {{position()}} is actually returning the seek position. After the {{poll()}} call, even if no data is returned, the {{position()}} is returning the LSO. I did an example on [https://gist.github.com/Dabz/9aa0b4d1804397af6e7b6ad8cba82dcb] . -- This message was sent by Atlassian Jira (v8.3.4#803005)