[ 
https://issues.apache.org/jira/browse/KAFKA-19853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18035148#comment-18035148
 ] 

Colt McNealy commented on KAFKA-19853:
--------------------------------------

[~lucasbru] 

Hi Lucas—the write stalls only occur on RocksDB instances being updated by the 
`StateUpdater`. We don't have any fancy RocksDB configs that would cause this.

RocksDB by default (which we haven't touched) slows down writes when you reach 
20 L0 files, and stops writes when you reach 36 L0 files. When you have 
full-speed write as fast as you can, RocksDB compaction (either with the 
default KS configs or with an optimized config that performs better) struggles 
to keep up, so the RocksDB instances under the state updater normally stall 
writes. In fact, one thing the `ChangelogReader` does exacerbates this: it 
flushes the RocksDB store every 10k records, which means we end up rapidly 
accumulating a bunch of small (1MB) files in L0.

As to your first question, doesn't this cause problems during normal 
processing? The answer is, it doesn't, because the rate of writing to the 
normal stores (active tasks) is much lower and we also don't flush every 1MB or 
so, so the rate of flushing files is dramatically lower, which means RocksDB 
does not stall writes enough.

Separately, it is my hope that KIP-1035 will allow us to no longer flush 
manually in the `ChangelogReader.`In my tests when I disabled this manual 
flushing (it was a hack...don't judge) the restoration throughput improved 4x. 
In most cases, during restoration your throughput is bottlenecked by Disk 
Bandwidth (used up by compaction), and reducing the rate of flushing by 
allowing the RocksDB WriteBufferManager to flush whenever it needs to can 
dramatically reduce the intensity of compactions.

> StreamThread blocks on StateUpdater during onAssignment()
> ---------------------------------------------------------
>
>                 Key: KAFKA-19853
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19853
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.9.0
>            Reporter: Colt McNealy
>            Priority: Major
>         Attachments: image (3).png, image (4).png, image (5).png
>
>
> We've observed that the `StreamThread` blocks waiting for a `Future` from the 
> `StateUpdater` in the `StreamsPartitionAssigner#onAssignment()` method when 
> we are moving a task out of the `StateUpdater` and onto the `StreamThread`.
>  
> This can cause problems because, during restoration or with warmup replicas, 
> the `StateUpdater#runOnce()` method can take a long time (upwards of 20 
> seconds) when RocksDB stalls writes to allow compaction to keep up. In EOS 
> this blockage may cause the transaction to time out, which is a big mess. 
> This is because the `StreamThread` may have an open transaction before the 
> `StreamsPartitionAssignor#onAssignment()` method is called.
>  
> Some screenshots from the JFR below (credit to [~eduwerc]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to