[
https://issues.apache.org/jira/browse/KAFKA-19853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18035362#comment-18035362
]
Colt McNealy commented on KAFKA-19853:
--------------------------------------
Oh, sorry, I misunderstood your question (didn't realize you were talking about
before SU).
That's a good question—I confess that our soak tests in mid-2024 (around 3.8)
were not up to par, so we didn't run into this. However, let's look at the two
cases when we have unbridled writes into RocksDB: 1) we lose Instance X, so
Instance X's standby's get rescheduled onto Instance Y as standby's / warmups.
2) active task restoration.
In the old code path, for Case 1, the writes into RocksDB would occur in the
main processing loop and as such may have been slowed down by the processing
logic (which has to do deserialization / serialization for the store, etc). I'm
not sure if that would be enough, though.
For case 2), there was a ticket that was similar to this problem but not
exactly the same: a transaction that wasn't closed:
https://issues.apache.org/jira/browse/KAFKA-13295
I do like your idea of committing any open transactions at the start of a
rebalance, so long as it's possible.
> StreamThread blocks on StateUpdater during onAssignment()
> ---------------------------------------------------------
>
> Key: KAFKA-19853
> URL: https://issues.apache.org/jira/browse/KAFKA-19853
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 3.9.0
> Reporter: Colt McNealy
> Priority: Major
> Attachments: image (3).png, image (4).png, image (5).png
>
>
> We've observed that the `StreamThread` blocks waiting for a `Future` from the
> `StateUpdater` in the `StreamsPartitionAssigner#onAssignment()` method when
> we are moving a task out of the `StateUpdater` and onto the `StreamThread`.
>
> This can cause problems because, during restoration or with warmup replicas,
> the `StateUpdater#runOnce()` method can take a long time (upwards of 20
> seconds) when RocksDB stalls writes to allow compaction to keep up. In EOS
> this blockage may cause the transaction to time out, which is a big mess.
> This is because the `StreamThread` may have an open transaction before the
> `StreamsPartitionAssignor#onAssignment()` method is called.
>
> Some screenshots from the JFR below (credit to [~eduwerc]).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)