[ 
https://issues.apache.org/jira/browse/KAFKA-10678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227067#comment-17227067
 ] 

A. Sophie Blee-Goldman commented on KAFKA-10678:
------------------------------------------------

Thanks for opening a separate ticket for this. There seem to be two main 
problems/unanswered questions here:

1) Why was there a rebalance at all if static membership was enabled?
2) Why did the rebalance result in a large shuffling of tasks?

For 1) it's difficult to say with only the broker side logs, since they won't 
tell us _why_ the client triggered a new rebalance after it was bounced. Would 
it be possible to collect logs from the client covering the period immediately 
after it was bounced, when it apparently tried to trigger a rebalance?

I was discussing question 2) with [~cadonna] and it seems to be a combination 
of a few things: first, the "eventual" assignment is currently performed 
without regard to the previous placement of tasks. It just tries to distribute 
tasks as evenly as possible, using intermediate assignments and probing 
rebalances as needed. [~vvcephei] wrote up some thoughts on this in 
KAFKA-10121. We're aware of this limitation but haven't addressed it since the 
assignor is deterministic and therefore no-op group changes – such as an 
existing member being bounced – shouldn't result in a different eventual 
assignment than the stable one pre-bounce.

Unfortunately this assignment identifies clients based on the encoded 
processId, which is actually randomly generated during StreamThread startup. So 
the processId identifier would change after a bounce, meaning different initial 
conditions to the assignor function and therefore a different final result :/

I think if the shuffling of tasks wasn't so bad then even if you did still get 
a rebalance even with static membership, then it would hardly be noticeable 
(given that it can continue to actively process during a cooperative 
rebalance). We could probably improve a majority of cases just by fixing the 
processId thing, but I feel like we might as well skip that and just go ahead 
with implementing KAFKA-10121 at that point to improve it for all cases.

> Re-deploying Streams app causes rebalance and task migration
> ------------------------------------------------------------
>
>                 Key: KAFKA-10678
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10678
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.6.0, 2.6.1
>            Reporter: Bradley Peterson
>            Priority: Major
>         Attachments: after, before, broker
>
>
> Re-deploying our Streams app causes a rebalance, even when using static group 
> membership. Worse, the rebalance creates standby tasks, even when the 
> previous task assignment was balanced and stable.
> Our app is currently using Streams 2.6.1-SNAPSHOT (due to [KAFKA-10633]) but 
> we saw the same behavior in 2.6.0. The app runs on 4 EC2 instances, each with 
> 4 streams threads, and data stored on persistent EBS volumes.. During a 
> redeploy, all EC2 instances are stopped, new instances are launched, and the 
> EBS volumes are attached to the new instances. We do not use interactive 
> queries. {{session.timeout.ms}} is set to 30 minutes, and the deployment 
> finishes well under that. {{num.standby.replicas}} is 0.
> h2. Expected Behavior
> Given a stable and balanced task assignment prior to deploying, we expect to 
> see the same task assignment after deploying. Even if a rebalance is 
> triggered, we do not expect to see new standby tasks.
> h2. Observed Behavior
> Attached are the "Assigned tasks to clients" log lines from before and after 
> deploying. The "before" is from over 24 hours ago, the task assignment is 
> well balanced and "Finished stable assignment of tasks, no followup 
> rebalances required." is logged. The "after" log lines show the same 
> assignment of active tasks, but some additional standby tasks. There are 
> additional log lines about adding and removing active tasks, which I don't 
> quite understand.
> I've also included logs from the broker showing the rebalance was triggered 
> for "Updating metadata".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to