reswqa commented on PR #21137:
URL: https://github.com/apache/flink/pull/21137#issuecomment-1296821182

   @XComp About your [proposal](https://github.com/reswqa/flink/pull/1) , I add 
a very short sleep after `triggerClassLoaderLeaseRelease.await();`, the 
deadlock problem can be 100% reproduced on my own machine. Although 
theoretically there is still the possibility of non recurrence, but I am sure 
that this short sleep can almost absolutely trigger `grantLeadership` before 
`running` state becomes false. 
   Personally, I think it is acceptable to introduce a very short sleep (I used 
5ms) to fight for the deadlock problem to recur almost every time, and it will 
not significantly increase the running time of CI. Of course, if we can think 
of a better way to solve this problem, I will be more willing to accept it.
   In addition, I found that if sleep is not introduced, run the whole test 
class on my and my colleagues' machines, the deadlock problem can hardly be 
reproduced. However, if run 
`testJobMasterServiceLeadershipRunnerCloseWhenElectionServiceGrantLeaderShip` 
separately, the probability of deadlock will be greatly increased. I don't have 
a deep understanding of Junit's testing framework, so I don't know why there is 
such a difference. You can try to run the entire test class directly to help 
you reproduce the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to