[jira] [Assigned] (KAFKA-12951) Infinite loop while restoring a GlobalKTable

Matthias J. Sax (Jira) Tue, 15 Jun 2021 22:41:49 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-12951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matthias J. Sax reassigned KAFKA-12951:
---------------------------------------

    Assignee: Matthias J. Sax

> Infinite loop while restoring a GlobalKTable
> --------------------------------------------
>
>                 Key: KAFKA-12951
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12951
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Damien Gasparina
>            Assignee: Matthias J. Sax
>            Priority: Major
>
> We encountered an issue a few time in some of our Kafka Streams application.
>  After an unexpected restart of our applications, some instances have not 
> been able to resume operating.
> They got stuck while trying to restore the state store of a GlobalKTable. The 
> only way to resume operating was to manually delete their `state.dir`.
> We observed the following timeline:
>  * After the restart of the Kafka Streams application, it tries to restore 
> its GlobalKTable
>  * It seeks to the last checkpoint available on the {{{{state.dir}}}}: 382 
> ([https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L259])
>  * The watermark ({{endOffset}} results) returned the offset 383 
> {code:java}
> handling ListOffsetResponse response for XX. Fetched offset 383, timestamp 
> -1{code}
>  * We enter the loop: 
> [https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L279]
>  * Then we invoked the {{poll()}}, but the poll returns nothing, so we enter: 
> [https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L306]
>  and we crash (x)
> {code:java}
> Global task did not make progress to restore state within 300000 ms.{code}
>  * The POD restart, and we encounter the same issue until we manually delete 
> the {{state.dir}}
>  
> Regarding the topic, by leveraging the {{DumpLogSegment}} tool, I can see:
>  * {{Offset 381}} - Last business message received
>  * {{Offset 382}} - Txn COMMIT (last message)
> I think the real culprit is that the checkpoint is {{383}} instead of being 
> {{382}}. For information, this is a compacted topic, and just before the 
> outage, we encountered some ISR shrinking and leader changes.
> While experimenting with the API, it seems that the {{consumer.position()}} 
> call is a bit tricky, after a {{seek()}} and a {{poll()}}, it seems that the 
> {{position()}} is actually returning the seek position. After the {{poll()}} 
> call, even if no data is returned, the {{position()}} is returning the LSO. I 
> did an example on 
> [https://gist.github.com/Dabz/9aa0b4d1804397af6e7b6ad8cba82dcb] .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (KAFKA-12951) Infinite loop while restoring a GlobalKTable

Reply via email to