Guozhang Wang created KAFKA-7203:
------------------------------------

             Summary: Improve Streams StickyTaskAssingor
                 Key: KAFKA-7203
                 URL: https://issues.apache.org/jira/browse/KAFKA-7203
             Project: Kafka
          Issue Type: Improvement
          Components: streams
            Reporter: Guozhang Wang


This is a inspired discussion while trying to fix KAFKA-7144.

Currently we are not striking a very good trade-off sweet point between 
stickiness and workload balance: we are honoring the former more than the 
latter. One idea to improve on this is the following:

{code}
I'd like to propose a slightly different approach to fix 7114 while making 
no-worse tradeoffs between stickiness and sub-topology balance. The key idea is 
to try to adjust the assignment to gets the distribution as closer as to the 
sub-topologies' num.tasks distribution.

Here is a detailed workflow:

1. at the beginning, we first calculate for each client C, how many tasks 
should it be assigned ideally, as num.total_tasks / num.total_capacity * 
C_capacity rounded down, call it C_a. Note that since we round down this 
number, the summing C_a across all C would be <= num.total_tasks, but this does 
not matter.

2. and then for each client C, based on its num. previous assigned tasks C_p, 
we calculate how many tasks it should take over, or give up as C_a - C_p (if it 
is positive, it should take over some, otherwise it should give up some).

Note that because of the round down, when we calculate the C_a - C_p for each 
client, we need to make sure that the total number of give ups and total number 
of take overs should be equal, some ad-hoc heuristics can be used.

3. then we calculate the tasks distribution across the sub-topologies as a 
whole. For example, if we have three sub-topologies, st0 and st1, and st0 has 4 
total tasks, st1 has 4 total tasks, and st2 has 8 total tasks, then the 
distribution between st0, st1 and st2 should be 1:1:2. Let's call it the global 
distribution, and note that currently since num.tasks per sub-topology never 
change, this distribution should NEVER change.

4. then for each client that should give up some, we decides which tasks it 
should give up so that the remaining tasks distribution is proportional to the 
above global distribution.

For example, if a client previously own 4 tasks of st0, no tasks of st1, and 2 
tasks of st2, and now it needs to give up 3 tasks, I should then give up 2 of 
st0 and 1 of st1, so that the remaining distribution is closer to 1:1:2.

5. now we've collected a list of given-up tasks plus the ones that does not 
have any prev active assignment (normally operations it should not happen since 
all tasks should have been created since day one), we now migrate them to those 
who needs to take over some, similarly proportional to the global distribution.

For example if a client previously own 1 task of st0, and nothing of st1 and 
st2, and now it needs to take over 3 tasks, we would try to give it 1 task of 
st1 and 2 tasks of st2, so that the resulted distribution becomes 1:1:2. And we 
ONLY consider prev-standby tasks when we decide which one of st1 / st2 should 
we get for that client.

Now, consider the following scenarios:

a) this is a clean start and there is no prev-assignment at all, step 4 would 
be a no-op; the result should still be fine.

b) a client leaves the group, no client needs to give up and all clients may 
need to take over some, so step 4 is no-op, and the cumulated step 5 only 
contains the tasks of the left client.

c) a new client joins the group, all clients need to give up some, and only the 
new client need to take over all the given-up ones. Hence step 5 is 
straight-forward.
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to