[
https://issues.apache.org/jira/browse/KAFKA-19853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18035148#comment-18035148
]
Colt McNealy commented on KAFKA-19853:
--------------------------------------
[~lucasbru]
Hi Lucas—the write stalls only occur on RocksDB instances being updated by the
`StateUpdater`. We don't have any fancy RocksDB configs that would cause this.
RocksDB by default (which we haven't touched) slows down writes when you reach
20 L0 files, and stops writes when you reach 36 L0 files. When you have
full-speed write as fast as you can, RocksDB compaction (either with the
default KS configs or with an optimized config that performs better) struggles
to keep up, so the RocksDB instances under the state updater normally stall
writes. In fact, one thing the `ChangelogReader` does exacerbates this: it
flushes the RocksDB store every 10k records, which means we end up rapidly
accumulating a bunch of small (1MB) files in L0.
As to your first question, doesn't this cause problems during normal
processing? The answer is, it doesn't, because the rate of writing to the
normal stores (active tasks) is much lower and we also don't flush every 1MB or
so, so the rate of flushing files is dramatically lower, which means RocksDB
does not stall writes enough.
Separately, it is my hope that KIP-1035 will allow us to no longer flush
manually in the `ChangelogReader.`In my tests when I disabled this manual
flushing (it was a hack...don't judge) the restoration throughput improved 4x.
In most cases, during restoration your throughput is bottlenecked by Disk
Bandwidth (used up by compaction), and reducing the rate of flushing by
allowing the RocksDB WriteBufferManager to flush whenever it needs to can
dramatically reduce the intensity of compactions.
> StreamThread blocks on StateUpdater during onAssignment()
> ---------------------------------------------------------
>
> Key: KAFKA-19853
> URL: https://issues.apache.org/jira/browse/KAFKA-19853
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 3.9.0
> Reporter: Colt McNealy
> Priority: Major
> Attachments: image (3).png, image (4).png, image (5).png
>
>
> We've observed that the `StreamThread` blocks waiting for a `Future` from the
> `StateUpdater` in the `StreamsPartitionAssigner#onAssignment()` method when
> we are moving a task out of the `StateUpdater` and onto the `StreamThread`.
>
> This can cause problems because, during restoration or with warmup replicas,
> the `StateUpdater#runOnce()` method can take a long time (upwards of 20
> seconds) when RocksDB stalls writes to allow compaction to keep up. In EOS
> this blockage may cause the transaction to time out, which is a big mess.
> This is because the `StreamThread` may have an open transaction before the
> `StreamsPartitionAssignor#onAssignment()` method is called.
>
> Some screenshots from the JFR below (credit to [~eduwerc]).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)