[ 
https://issues.apache.org/jira/browse/KAFKA-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480081#comment-17480081
 ] 

Bruno Cadonna commented on KAFKA-13600:
---------------------------------------

[~tim.patterson] I checked the assignor code and I agree with you that we only 
distinguish between caught-up and not caught-up. IIRC, we decided to go that 
way since considering task load and the rank of non caught-up clients turned 
out to be more complicated that we wanted to make the assignment algorithm, at 
that time.
If you have found a good trade-off between those dimensions, the best thing is 
to write up your proposal in a KIP. Although this change would not affect the 
public API, we usually discuss changes to the assignment algorithms in KIPs 
because they imply important behavioral changes. 
[KIP-441|https://cwiki.apache.org/confluence/x/0i4lBg] and 
[KIP-708|https://cwiki.apache.org/confluence/x/UQ5RCg] are examples for such 
KIPs.

> Rebalances while streams is in degraded state can cause stores to be 
> reassigned and restore from scratch
> --------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-13600
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13600
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.8.1, 3.0.0
>            Reporter: Tim Patterson
>            Priority: Major
>
> Consider this scenario:
>  # A node is lost from the cluster.
>  # A rebalance is kicked off with a new "target assignment"'s(ie the 
> rebalance is attempting to move a lot of tasks - see 
> https://issues.apache.org/jira/browse/KAFKA-10121).
>  # The kafka cluster is now a bit more sluggish from the increased load.
>  # A Rolling Deploy happens triggering rebalances, during the rebalance 
> processing continues but offsets can't be committed(Or nodes are restarted 
> but fail to commit offsets)
>  # The most caught up nodes now aren't within `acceptableRecoveryLag` and so 
> the task is started in it's "target assignment" location, restoring all state 
> from scratch and delaying further processing instead of using the "almost 
> caught up" node.
> We've hit this a few times and having lots of state (~25TB worth) and being 
> heavy users of IQ this is not ideal for us.
> While we can increase `acceptableRecoveryLag` to larger values to try get 
> around this that causes other issues (ie a warmup becoming active when its 
> still quite far behind)
> The solution seems to be to balance "balanced assignment" with "most caught 
> up nodes".
> We've got a fork where we do just this and it's made a huge difference to the 
> reliability of our cluster.
> Our change is to simply use the most caught up node if the "target node" is 
> more than `acceptableRecoveryLag` behind.
> This gives up some of the load balancing type behaviour of the existing code 
> but in practise doesn't seem to matter too much.
> I guess maybe an algorithm that identified candidate nodes as those being 
> within `acceptableRecoveryLag` of the most caught up node might allow the 
> best of both worlds.
>  
> Our fork is
> [https://github.com/apache/kafka/compare/trunk...tim-patterson:fix_balance_uncaughtup?expand=1]
> (We also moved the capacity constraint code to happen after all the stateful 
> assignment to prioritise standby tasks over warmup tasks)
> Ideally we don't want to maintain a fork of kafka streams going forward so 
> are hoping to get a bit of discussion / agreement on the best way to handle 
> this.
> More than happy to contribute code/test different algo's in production system 
> or anything else to help with this issue



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to