[
https://issues.apache.org/jira/browse/KAFKA-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485343#comment-17485343
]
Bruno Cadonna commented on KAFKA-13600:
---------------------------------------
[~tim.patterson] To come up with a perfect solution is not easy with such an
optimization problem. If you think your solution is a good trade-off and you
have good arguments then I think it is worth discussing on a KIP.
I am not sure that changing the definition of caught-up as you proposed is so
straight forward. For example, in the case where all clients have no or little
state all clients would be considered caught-up and the balance might suffer.
What I want to say is that it is probably not just a change of a definition.
If you are interested, in the past I looked at the [linear balanced assignment
problem|https://en.wikipedia.org/wiki/Assignment_problem#Balanced_assignment].
I considered the [Hungarian
algorithm|https://en.wikipedia.org/wiki/Hungarian_algorithm] for a solution,
more specifically I looked at [one of the most popular
variants|https://link.springer.com/article/10.1007%2FBF02278710]. But I did not
have time to go deep enough.
> Rebalances while streams is in degraded state can cause stores to be
> reassigned and restore from scratch
> --------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-13600
> URL: https://issues.apache.org/jira/browse/KAFKA-13600
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 2.8.1, 3.0.0
> Reporter: Tim Patterson
> Priority: Major
>
> Consider this scenario:
> # A node is lost from the cluster.
> # A rebalance is kicked off with a new "target assignment"'s(ie the
> rebalance is attempting to move a lot of tasks - see
> https://issues.apache.org/jira/browse/KAFKA-10121).
> # The kafka cluster is now a bit more sluggish from the increased load.
> # A Rolling Deploy happens triggering rebalances, during the rebalance
> processing continues but offsets can't be committed(Or nodes are restarted
> but fail to commit offsets)
> # The most caught up nodes now aren't within `acceptableRecoveryLag` and so
> the task is started in it's "target assignment" location, restoring all state
> from scratch and delaying further processing instead of using the "almost
> caught up" node.
> We've hit this a few times and having lots of state (~25TB worth) and being
> heavy users of IQ this is not ideal for us.
> While we can increase `acceptableRecoveryLag` to larger values to try get
> around this that causes other issues (ie a warmup becoming active when its
> still quite far behind)
> The solution seems to be to balance "balanced assignment" with "most caught
> up nodes".
> We've got a fork where we do just this and it's made a huge difference to the
> reliability of our cluster.
> Our change is to simply use the most caught up node if the "target node" is
> more than `acceptableRecoveryLag` behind.
> This gives up some of the load balancing type behaviour of the existing code
> but in practise doesn't seem to matter too much.
> I guess maybe an algorithm that identified candidate nodes as those being
> within `acceptableRecoveryLag` of the most caught up node might allow the
> best of both worlds.
>
> Our fork is
> [https://github.com/apache/kafka/compare/trunk...tim-patterson:fix_balance_uncaughtup?expand=1]
> (We also moved the capacity constraint code to happen after all the stateful
> assignment to prioritise standby tasks over warmup tasks)
> Ideally we don't want to maintain a fork of kafka streams going forward so
> are hoping to get a bit of discussion / agreement on the best way to handle
> this.
> More than happy to contribute code/test different algo's in production system
> or anything else to help with this issue
--
This message was sent by Atlassian Jira
(v8.20.1#820001)