[ https://issues.apache.org/jira/browse/KAFKA-12951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias J. Sax updated KAFKA-12951: ------------------------------------ Affects Version/s: (was: 2.7.1) (was: 2.8.0) 2.7.0 > Infinite loop while restoring a GlobalKTable > -------------------------------------------- > > Key: KAFKA-12951 > URL: https://issues.apache.org/jira/browse/KAFKA-12951 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.7.0 > Reporter: Damien Gasparina > Assignee: Matthias J. Sax > Priority: Major > > We encountered an issue a few time in some of our Kafka Streams application. > After an unexpected restart of our applications, some instances have not > been able to resume operating. > They got stuck while trying to restore the state store of a GlobalKTable. The > only way to resume operating was to manually delete their `state.dir`. > We observed the following timeline: > * After the restart of the Kafka Streams application, it tries to restore > its GlobalKTable > * It seeks to the last checkpoint available on the {{{{state.dir}}}}: 382 > ([https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L259]) > * The watermark ({{endOffset}} results) returned the offset 383 > {code:java} > handling ListOffsetResponse response for XX. Fetched offset 383, timestamp > -1{code} > * We enter the loop: > [https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L279] > * Then we invoked the {{poll()}}, but the poll returns nothing, so we enter: > [https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L306] > and we crash (x) > {code:java} > Global task did not make progress to restore state within 300000 ms.{code} > * The POD restart, and we encounter the same issue until we manually delete > the {{state.dir}} > > Regarding the topic, by leveraging the {{DumpLogSegment}} tool, I can see: > * {{Offset 381}} - Last business message received > * {{Offset 382}} - Txn COMMIT (last message) > I think the real culprit is that the checkpoint is {{383}} instead of being > {{382}}. For information, the global topic is a *transactional topic*. > While experimenting with the API, it seems that the {{consumer.position()}} > call is a bit tricky, after a {{seek()}} and a {{poll()}}, it seems that the > {{position()}} is actually returning the seek position. After the {{poll()}} > call, even if no data is returned, the {{position()}} is returning the LSO. I > did an example on > [https://gist.github.com/Dabz/9aa0b4d1804397af6e7b6ad8cba82dcb] . -- This message was sent by Atlassian Jira (v8.3.4#803005)