C0urante commented on PR #14562:
URL: https://github.com/apache/kafka/pull/14562#issuecomment-1780284540

   Okay, I've pushed a couple new commits that:
   - Introduce the notion of a completion time for a callback stage
   - Add granularity to the callback stages for the distributed herder's tick 
thread
   
   I know that this doesn't cover everything we discussed, but I did give the 
rest a try.
   
   I explored an approach where we defined broader tick thread stages 
(declaring them in `DistributedHerder:;tick` and not in methods it invokes, and 
following a similar approach for herder requests). This turned out to be 
infeasible because of the control flow during a rebalance, where the herder 
invokes `WorkerGroupMember::poll` or `WorkerGroupMember::ensureActive`, which 
in turn can end up invoking `DistributedHerder.RebalanceListener::onRevoked`, 
which in turn can perform operations that warrant a distinct tick stage from, 
e.g., "ensuring membership in the cluster".
   
   Instead, I've tried for an approach where the tick thread stages are defined 
as narrowly as possible, and only around operations that we can reasonably 
anticipate will block. This does slightly increase the odds of a stage being 
completed when a request times out, but since the information about that stage 
isn't lost anymore, the fallout from that scenario is limited.
   
   I also experimented with the `Supplier<Stage>` approach to reduce the 
runtime complexity of stage tracking for herder requests, but found that this 
was more difficult to unit test. Instead of being able to track the set of all 
recorded stages for a callback, we would have to manually query the `Supplier` 
after each anticipated herder stage update, which is more work and can fail to 
collect some stages if not queried at the correct time (especially if it's too 
difficult to query at a specific point in time during a call to 
`DistributedHerder::tick`). Since we both agree that performance shouldn't be a 
concern here, I hope this is acceptable.
   
   I've also verified with three consecutive Jenkins runs that the new unit 
test should finally be flake-free.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to