C0urante commented on PR #14562: URL: https://github.com/apache/kafka/pull/14562#issuecomment-1780284540
Okay, I've pushed a couple new commits that: - Introduce the notion of a completion time for a callback stage - Add granularity to the callback stages for the distributed herder's tick thread I know that this doesn't cover everything we discussed, but I did give the rest a try. I explored an approach where we defined broader tick thread stages (declaring them in `DistributedHerder:;tick` and not in methods it invokes, and following a similar approach for herder requests). This turned out to be infeasible because of the control flow during a rebalance, where the herder invokes `WorkerGroupMember::poll` or `WorkerGroupMember::ensureActive`, which in turn can end up invoking `DistributedHerder.RebalanceListener::onRevoked`, which in turn can perform operations that warrant a distinct tick stage from, e.g., "ensuring membership in the cluster". Instead, I've tried for an approach where the tick thread stages are defined as narrowly as possible, and only around operations that we can reasonably anticipate will block. This does slightly increase the odds of a stage being completed when a request times out, but since the information about that stage isn't lost anymore, the fallout from that scenario is limited. I also experimented with the `Supplier<Stage>` approach to reduce the runtime complexity of stage tracking for herder requests, but found that this was more difficult to unit test. Instead of being able to track the set of all recorded stages for a callback, we would have to manually query the `Supplier` after each anticipated herder stage update, which is more work and can fail to collect some stages if not queried at the correct time (especially if it's too difficult to query at a specific point in time during a call to `DistributedHerder::tick`). Since we both agree that performance shouldn't be a concern here, I hope this is acceptable. I've also verified with three consecutive Jenkins runs that the new unit test should finally be flake-free. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org