[ https://issues.apache.org/jira/browse/KAFKA-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487148#comment-17487148 ]
Bruno Cadonna commented on KAFKA-13600: --------------------------------------- I am fine with discussing the improvement on the PR and not in a KIP. I actually realized that the other improvements to the assignment algorithm included changes to the public API and therefore a KIP was needed. For me it is just important that we look really careful at the improvements because the assignment algorithm is a quite critical part of the system. Additionally, I did not want to discuss about a totally new assignment algorithm. I just linked the information for general interest. Looking forward to the PR. > Rebalances while streams is in degraded state can cause stores to be > reassigned and restore from scratch > -------------------------------------------------------------------------------------------------------- > > Key: KAFKA-13600 > URL: https://issues.apache.org/jira/browse/KAFKA-13600 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 3.1.0, 2.8.1, 3.0.0 > Reporter: Tim Patterson > Priority: Major > > Consider this scenario: > # A node is lost from the cluster. > # A rebalance is kicked off with a new "target assignment"'s(ie the > rebalance is attempting to move a lot of tasks - see > https://issues.apache.org/jira/browse/KAFKA-10121). > # The kafka cluster is now a bit more sluggish from the increased load. > # A Rolling Deploy happens triggering rebalances, during the rebalance > processing continues but offsets can't be committed(Or nodes are restarted > but fail to commit offsets) > # The most caught up nodes now aren't within `acceptableRecoveryLag` and so > the task is started in it's "target assignment" location, restoring all state > from scratch and delaying further processing instead of using the "almost > caught up" node. > We've hit this a few times and having lots of state (~25TB worth) and being > heavy users of IQ this is not ideal for us. > While we can increase `acceptableRecoveryLag` to larger values to try get > around this that causes other issues (ie a warmup becoming active when its > still quite far behind) > The solution seems to be to balance "balanced assignment" with "most caught > up nodes". > We've got a fork where we do just this and it's made a huge difference to the > reliability of our cluster. > Our change is to simply use the most caught up node if the "target node" is > more than `acceptableRecoveryLag` behind. > This gives up some of the load balancing type behaviour of the existing code > but in practise doesn't seem to matter too much. > I guess maybe an algorithm that identified candidate nodes as those being > within `acceptableRecoveryLag` of the most caught up node might allow the > best of both worlds. > > Our fork is > [https://github.com/apache/kafka/compare/trunk...tim-patterson:fix_balance_uncaughtup?expand=1] > (We also moved the capacity constraint code to happen after all the stateful > assignment to prioritise standby tasks over warmup tasks) > Ideally we don't want to maintain a fork of kafka streams going forward so > are hoping to get a bit of discussion / agreement on the best way to handle > this. > More than happy to contribute code/test different algo's in production system > or anything else to help with this issue -- This message was sent by Atlassian Jira (v8.20.1#820001)