[ 
https://issues.apache.org/jira/browse/KAFKA-19853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18035150#comment-18035150
 ] 

Colt McNealy commented on KAFKA-19853:
--------------------------------------

[~lucasbru] also, in our soak tests we were dealing with 500GB+ of compressed 
data with a roughly 80% compression ratio, which means close to 2.5TB total 
data without compression. We cannot reproduce this bug with smaller amounts of 
state.

 

With that setup, and limited CPU resources, it was perfectly reliable to 
reproduce the bug. I made a few 'hack' changes on a private branch and was able 
to make it work ~90% of the time by:
 * Flushing once every 1M records instead of 10k records in the ChangelogReader 
(`StateManagerUtil.java`) to reduce the number of L0 files
 * Calling `taskManager.commitAll()` in the beginning of the 
`StreamsPartitionAssignor#onAssignment()` method which helped quite a bit.

> StreamThread blocks on StateUpdater during onAssignment()
> ---------------------------------------------------------
>
>                 Key: KAFKA-19853
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19853
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.9.0
>            Reporter: Colt McNealy
>            Priority: Major
>         Attachments: image (3).png, image (4).png, image (5).png
>
>
> We've observed that the `StreamThread` blocks waiting for a `Future` from the 
> `StateUpdater` in the `StreamsPartitionAssigner#onAssignment()` method when 
> we are moving a task out of the `StateUpdater` and onto the `StreamThread`.
>  
> This can cause problems because, during restoration or with warmup replicas, 
> the `StateUpdater#runOnce()` method can take a long time (upwards of 20 
> seconds) when RocksDB stalls writes to allow compaction to keep up. In EOS 
> this blockage may cause the transaction to time out, which is a big mess. 
> This is because the `StreamThread` may have an open transaction before the 
> `StreamsPartitionAssignor#onAssignment()` method is called.
>  
> Some screenshots from the JFR below (credit to [~eduwerc]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to