[ 
https://issues.apache.org/jira/browse/KAFKA-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555999#comment-16555999
 ] 

Matthias J. Sax commented on KAFKA-7203:
----------------------------------------

Maybe KAFKA-4969 even completely contains this ticket.

> Improve Streams StickyTaskAssingor
> ----------------------------------
>
>                 Key: KAFKA-7203
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7203
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Guozhang Wang
>            Priority: Major
>
> This is a inspired discussion while trying to fix KAFKA-7144.
> Currently we are not striking a very good trade-off sweet point between 
> stickiness and workload balance: we are honoring the former more than the 
> latter. One idea to improve on this is the following:
> {code}
> I'd like to propose a slightly different approach to fix 7114 while making 
> no-worse tradeoffs between stickiness and sub-topology balance. The key idea 
> is to try to adjust the assignment to gets the distribution as closer as to 
> the sub-topologies' num.tasks distribution.
> Here is a detailed workflow:
> 1. at the beginning, we first calculate for each client C, how many tasks 
> should it be assigned ideally, as num.total_tasks / num.total_capacity * 
> C_capacity rounded down, call it C_a. Note that since we round down this 
> number, the summing C_a across all C would be <= num.total_tasks, but this 
> does not matter.
> 2. and then for each client C, based on its num. previous assigned tasks C_p, 
> we calculate how many tasks it should take over, or give up as C_a - C_p (if 
> it is positive, it should take over some, otherwise it should give up some).
> Note that because of the round down, when we calculate the C_a - C_p for each 
> client, we need to make sure that the total number of give ups and total 
> number of take overs should be equal, some ad-hoc heuristics can be used.
> 3. then we calculate the tasks distribution across the sub-topologies as a 
> whole. For example, if we have three sub-topologies, st0 and st1, and st0 has 
> 4 total tasks, st1 has 4 total tasks, and st2 has 8 total tasks, then the 
> distribution between st0, st1 and st2 should be 1:1:2. Let's call it the 
> global distribution, and note that currently since num.tasks per sub-topology 
> never change, this distribution should NEVER change.
> 4. then for each client that should give up some, we decides which tasks it 
> should give up so that the remaining tasks distribution is proportional to 
> the above global distribution.
> For example, if a client previously own 4 tasks of st0, no tasks of st1, and 
> 2 tasks of st2, and now it needs to give up 3 tasks, I should then give up 2 
> of st0 and 1 of st1, so that the remaining distribution is closer to 1:1:2.
> 5. now we've collected a list of given-up tasks plus the ones that does not 
> have any prev active assignment (normally operations it should not happen 
> since all tasks should have been created since day one), we now migrate them 
> to those who needs to take over some, similarly proportional to the global 
> distribution.
> For example if a client previously own 1 task of st0, and nothing of st1 and 
> st2, and now it needs to take over 3 tasks, we would try to give it 1 task of 
> st1 and 2 tasks of st2, so that the resulted distribution becomes 1:1:2. And 
> we ONLY consider prev-standby tasks when we decide which one of st1 / st2 
> should we get for that client.
> Now, consider the following scenarios:
> a) this is a clean start and there is no prev-assignment at all, step 4 would 
> be a no-op; the result should still be fine.
> b) a client leaves the group, no client needs to give up and all clients may 
> need to take over some, so step 4 is no-op, and the cumulated step 5 only 
> contains the tasks of the left client.
> c) a new client joins the group, all clients need to give up some, and only 
> the new client need to take over all the given-up ones. Hence step 5 is 
> straight-forward.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to