[
https://issues.apache.org/jira/browse/KAFKA-12951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias J. Sax reassigned KAFKA-12951:
---------------------------------------
Assignee: Matthias J. Sax
> Infinite loop while restoring a GlobalKTable
> --------------------------------------------
>
> Key: KAFKA-12951
> URL: https://issues.apache.org/jira/browse/KAFKA-12951
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Reporter: Damien Gasparina
> Assignee: Matthias J. Sax
> Priority: Major
>
> We encountered an issue a few time in some of our Kafka Streams application.
> After an unexpected restart of our applications, some instances have not
> been able to resume operating.
> They got stuck while trying to restore the state store of a GlobalKTable. The
> only way to resume operating was to manually delete their `state.dir`.
> We observed the following timeline:
> * After the restart of the Kafka Streams application, it tries to restore
> its GlobalKTable
> * It seeks to the last checkpoint available on the {{{{state.dir}}}}: 382
> ([https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L259])
> * The watermark ({{endOffset}} results) returned the offset 383
> {code:java}
> handling ListOffsetResponse response for XX. Fetched offset 383, timestamp
> -1{code}
> * We enter the loop:
> [https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L279]
> * Then we invoked the {{poll()}}, but the poll returns nothing, so we enter:
> [https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L306]
> and we crash (x)
> {code:java}
> Global task did not make progress to restore state within 300000 ms.{code}
> * The POD restart, and we encounter the same issue until we manually delete
> the {{state.dir}}
>
> Regarding the topic, by leveraging the {{DumpLogSegment}} tool, I can see:
> * {{Offset 381}} - Last business message received
> * {{Offset 382}} - Txn COMMIT (last message)
> I think the real culprit is that the checkpoint is {{383}} instead of being
> {{382}}. For information, this is a compacted topic, and just before the
> outage, we encountered some ISR shrinking and leader changes.
> While experimenting with the API, it seems that the {{consumer.position()}}
> call is a bit tricky, after a {{seek()}} and a {{poll()}}, it seems that the
> {{position()}} is actually returning the seek position. After the {{poll()}}
> call, even if no data is returned, the {{position()}} is returning the LSO. I
> did an example on
> [https://gist.github.com/Dabz/9aa0b4d1804397af6e7b6ad8cba82dcb] .
--
This message was sent by Atlassian Jira
(v8.3.4#803005)