reswqa commented on PR #21137: URL: https://github.com/apache/flink/pull/21137#issuecomment-1296821182
@XComp About your [proposal](https://github.com/reswqa/flink/pull/1) , I add a very short sleep after `triggerClassLoaderLeaseRelease.await();`, the deadlock problem can be 100% reproduced on my own machine. Although theoretically there is still the possibility of non recurrence, but I am sure that this short sleep can almost absolutely trigger `grantLeadership` before `running` state becomes false. Personally, I think it is acceptable to introduce a very short sleep (I used 5ms) to fight for the deadlock problem to recur almost every time, and it will not significantly increase the running time of CI. Of course, if we can think of a better way to solve this problem, I will be more willing to accept it. In addition, I found that if sleep is not introduced, run the whole test class on my and my colleagues' machines, the deadlock problem can hardly be reproduced. However, if run `testJobMasterServiceLeadershipRunnerCloseWhenElectionServiceGrantLeaderShip` separately, the probability of deadlock will be greatly increased. I don't have a deep understanding of Junit's testing framework, so I don't know why there is such a difference. You can try to run the entire test class directly to help you reproduce the problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org