junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-681361030
The following is my understand. The current PR introduces a new deadlock through the following path. path 1 hold group lock -> joinPurgatory.tryCompleteElseWatch(delayedJoin) -> watchForOperation (now delayedJoin visible through other threads) -> operation.maybeTryComplete() -> hold delayedJoin.lock path 2 delayedJoin.maybeTryComplete -> hold hold delayedJoin.lock -> tryComplete() -> hold group lock The existing code doesn't have this deadlock since (1) delayedJoin.lock is the same as the group lock held in the caller and (2) a delayed join operation is registered under the group key (so each time we check completeness for a group key, only one delayed join operation will be affected). By switching back to this code, we avoid the new deadlock. The existing code has a different deadlock issue that groupManager.storeGroup() in GroupCoordinator.onCompleteJoin may need to complete to other delayed operations and potentially hold a different group lock while already holding a group lock. This issue will be resolved if we complete those delayed operations due to groupManager.storeGroup() elsewhere without holding any locks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org