Damien Gasparina created KAFKA-12951:
----------------------------------------
Summary: Infinite loop while restoring a GlobalKTable
Key: KAFKA-12951
URL: https://issues.apache.org/jira/browse/KAFKA-12951
Project: Kafka
Issue Type: Bug
Components: streams
Reporter: Damien Gasparina
We encountered an issue a few time in some of our Kafka Streams application.
After an unexpected restart of our application, some instances have not been
able to resume operating.
They got stuck while trying to restore the state store of a GlobalKTable. The
only way to resume operating was to manually delete their `state.dir`.
We observed the following timeline:
* After the restart of the Kafka Streams application, it tries to restore its
GlobalKTable
* It seeks to the last checkpoint available on the {{state.dir}}: 382
([https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L259])
* The watermark ({{endOffset}} results) returned the offset 383
{code:java}
handling ListOffsetResponse response for XX. Fetched offset 383, timestamp
-1{code}
* We enter the loop:
[https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L279]
* Then we invoked the {{poll()}}, but the poll returns nothing, so we enter:
[https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L306]
and we crash (x)
{code:java}
Global task did not make progress to restore state within 300000 ms.{code}
* The POD restart, and we encounter the same issue until we manually delete
the {{state.dir}}
Regarding the topic, by leveraging the {{DumpLogSegment}} tool, I can see:
* {{Offset 381}} - Last business message received
* {{Offset 382}} - Txn COMMIT (last message)
I think the real culprit is that the checkpoint is {{383}} instead of being
{{382}}. For information, this is a compacted topic, and just before the
outage, we encountered some ISR shrinking and leader changes.
While experimenting with the API, it seems that the {{consumer.position()}}
call is a bit tricky, after a {{seek()}} and a {{poll()}}, it seems that the
{{position()}} is actually returning the seek position. After the {{poll()}}
call, even if no data is returned, the {{position()}} is returning the LSO. I
did an example on
[https://gist.github.com/Dabz/9aa0b4d1804397af6e7b6ad8cba82dcb] .
--
This message was sent by Atlassian Jira
(v8.3.4#803005)