[
https://issues.apache.org/jira/browse/KAFKA-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias J. Sax resolved KAFKA-7203.
------------------------------------
Resolution: Fixed
> Improve Streams StickyTaskAssingor
> ----------------------------------
>
> Key: KAFKA-7203
> URL: https://issues.apache.org/jira/browse/KAFKA-7203
> Project: Kafka
> Issue Type: Improvement
> Components: streams
> Reporter: Guozhang Wang
> Priority: Major
>
> This is a inspired discussion while trying to fix KAFKA-7144.
> Currently we are not striking a very good trade-off sweet point between
> stickiness and workload balance: we are honoring the former more than the
> latter. One idea to improve on this is the following:
> {code}
> I'd like to propose a slightly different approach to fix 7114 while making
> no-worse tradeoffs between stickiness and sub-topology balance. The key idea
> is to try to adjust the assignment to gets the distribution as closer as to
> the sub-topologies' num.tasks distribution.
> Here is a detailed workflow:
> 1. at the beginning, we first calculate for each client C, how many tasks
> should it be assigned ideally, as num.total_tasks / num.total_capacity *
> C_capacity rounded down, call it C_a. Note that since we round down this
> number, the summing C_a across all C would be <= num.total_tasks, but this
> does not matter.
> 2. and then for each client C, based on its num. previous assigned tasks C_p,
> we calculate how many tasks it should take over, or give up as C_a - C_p (if
> it is positive, it should take over some, otherwise it should give up some).
> Note that because of the round down, when we calculate the C_a - C_p for each
> client, we need to make sure that the total number of give ups and total
> number of take overs should be equal, some ad-hoc heuristics can be used.
> 3. then we calculate the tasks distribution across the sub-topologies as a
> whole. For example, if we have three sub-topologies, st0 and st1, and st0 has
> 4 total tasks, st1 has 4 total tasks, and st2 has 8 total tasks, then the
> distribution between st0, st1 and st2 should be 1:1:2. Let's call it the
> global distribution, and note that currently since num.tasks per sub-topology
> never change, this distribution should NEVER change.
> 4. then for each client that should give up some, we decides which tasks it
> should give up so that the remaining tasks distribution is proportional to
> the above global distribution.
> For example, if a client previously own 4 tasks of st0, no tasks of st1, and
> 2 tasks of st2, and now it needs to give up 3 tasks, I should then give up 2
> of st0 and 1 of st1, so that the remaining distribution is closer to 1:1:2.
> 5. now we've collected a list of given-up tasks plus the ones that does not
> have any prev active assignment (normally operations it should not happen
> since all tasks should have been created since day one), we now migrate them
> to those who needs to take over some, similarly proportional to the global
> distribution.
> For example if a client previously own 1 task of st0, and nothing of st1 and
> st2, and now it needs to take over 3 tasks, we would try to give it 1 task of
> st1 and 2 tasks of st2, so that the resulted distribution becomes 1:1:2. And
> we ONLY consider prev-standby tasks when we decide which one of st1 / st2
> should we get for that client.
> Now, consider the following scenarios:
> a) this is a clean start and there is no prev-assignment at all, step 4 would
> be a no-op; the result should still be fine.
> b) a client leaves the group, no client needs to give up and all clients may
> need to take over some, so step 4 is no-op, and the cumulated step 5 only
> contains the tasks of the left client.
> c) a new client joins the group, all clients need to give up some, and only
> the new client need to take over all the given-up ones. Hence step 5 is
> straight-forward.
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)