[jira] [Comment Edited] (FLINK-29234) Dead lock in DefaultLeaderElectionService
[ https://issues.apache.org/jira/browse/FLINK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627706#comment-17627706 ] Weijie Guo edited comment on FLINK-29234 at 11/2/22 1:35 PM: - [~fpaul] , The PR changes the production code very little, and it may take 1~2 days to complete the review. was (Author: weijie guo): [~fpaul] , The PR changes the production code very little, and it may take 1~2 days to complete the review > Dead lock in DefaultLeaderElectionService > - > > Key: FLINK-29234 > URL: https://issues.apache.org/jira/browse/FLINK-29234 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.13.5, 1.14.5, 1.15.2 >Reporter: Yu Wang >Assignee: Weijie Guo >Priority: Critical > Labels: pull-request-available > > Jobmanager stop working because the deadlock in DefaultLeaderElectionService. > The log stopped at > {code:java} > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - > Stopping DefaultLeaderElectionService. {code} > Which may similar to this ticket > https://issues.apache.org/jira/browse/FLINK-20008 > Here is the jstack info > {code:java} > Found one Java-level deadlock: > = > "flink-akka.actor.default-dispatcher-18": waiting to lock monitor > 0x7f15c7eae3a8 (object 0x000678d395e8, a java.lang.Object), which is > held by "main-EventThread" "main-EventThread": waiting to lock monitor > 0x7f15a3811258 (object 0x000678cf1be0, a java.lang.Object), which is > held by "flink-akka.actor.default-dispatcher-18" Java stack information for > the threads listed above: > === > "flink-akka.actor.default-dispatcher-18": > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:104) > - waiting to lock <0x000678d395e8> (a java.lang.Object) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$closeAsync$0(JobMasterServiceLeadershipRunner.java:147) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$735/1742012752.run(Unknown > Source) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:687) > at > org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$736/6716561.accept(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) > at > java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:543) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:765) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:795) > at > java.util.concurrent.CompletableFuture.whenCompleteAsync(CompletableFuture.java:2163) > at > org.apache.flink.runtime.concurrent.FutureUtils.runAfterwardsAsync(FutureUtils.java:684) > at > org.apache.flink.runtime.concurrent.FutureUtils.runAfterwards(FutureUtils.java:651) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.closeAsync(JobMasterServiceLeadershipRunner.java:143) > - locked <0x000678cf1be0> (a java.lang.Object) > at > org.apache.flink.runtime.dispatcher.Dispatcher.terminateJob(Dispatcher.java:807) > at > org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobs(Dispatcher.java:799) > at > org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobsAndGetTerminationFuture(Dispatcher.java:812) > at org.apache.flink.runtime.dispatcher.Dispatcher.onStop(Dispatcher.java:268) > at > org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$444/1289054037.apply(Unknown > Source) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) > at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) > at
[jira] [Comment Edited] (FLINK-29234) Dead lock in DefaultLeaderElectionService
[ https://issues.apache.org/jira/browse/FLINK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17627706#comment-17627706 ] Weijie Guo edited comment on FLINK-29234 at 11/2/22 1:30 PM: - [~fpaul] , The PR changes the production code very little, and it may take 1~2 days to complete the review was (Author: weijie guo): [~fpaul] , The PR changes the production code very little, and it may take 1~2 days to review. > Dead lock in DefaultLeaderElectionService > - > > Key: FLINK-29234 > URL: https://issues.apache.org/jira/browse/FLINK-29234 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.13.5, 1.14.5, 1.15.2 >Reporter: Yu Wang >Assignee: Weijie Guo >Priority: Critical > Labels: pull-request-available > > Jobmanager stop working because the deadlock in DefaultLeaderElectionService. > The log stopped at > {code:java} > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - > Stopping DefaultLeaderElectionService. {code} > Which may similar to this ticket > https://issues.apache.org/jira/browse/FLINK-20008 > Here is the jstack info > {code:java} > Found one Java-level deadlock: > = > "flink-akka.actor.default-dispatcher-18": waiting to lock monitor > 0x7f15c7eae3a8 (object 0x000678d395e8, a java.lang.Object), which is > held by "main-EventThread" "main-EventThread": waiting to lock monitor > 0x7f15a3811258 (object 0x000678cf1be0, a java.lang.Object), which is > held by "flink-akka.actor.default-dispatcher-18" Java stack information for > the threads listed above: > === > "flink-akka.actor.default-dispatcher-18": > at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:104) > - waiting to lock <0x000678d395e8> (a java.lang.Object) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$closeAsync$0(JobMasterServiceLeadershipRunner.java:147) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$735/1742012752.run(Unknown > Source) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:687) > at > org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$736/6716561.accept(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > at > org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) > at > java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:543) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:765) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:795) > at > java.util.concurrent.CompletableFuture.whenCompleteAsync(CompletableFuture.java:2163) > at > org.apache.flink.runtime.concurrent.FutureUtils.runAfterwardsAsync(FutureUtils.java:684) > at > org.apache.flink.runtime.concurrent.FutureUtils.runAfterwards(FutureUtils.java:651) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.closeAsync(JobMasterServiceLeadershipRunner.java:143) > - locked <0x000678cf1be0> (a java.lang.Object) > at > org.apache.flink.runtime.dispatcher.Dispatcher.terminateJob(Dispatcher.java:807) > at > org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobs(Dispatcher.java:799) > at > org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobsAndGetTerminationFuture(Dispatcher.java:812) > at org.apache.flink.runtime.dispatcher.Dispatcher.onStop(Dispatcher.java:268) > at > org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$444/1289054037.apply(Unknown > Source) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) > at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > at
[jira] [Comment Edited] (FLINK-29234) Dead lock in DefaultLeaderElectionService
[ https://issues.apache.org/jira/browse/FLINK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602689#comment-17602689 ] Yu Wang edited comment on FLINK-29234 at 9/10/22 9:48 AM: -- [~martijnvisser] I think this issue still exists in the master branch. The issue is caused by the ordering of Flink to get the lock. In this line, the class *DefaultLeaderElectionService* will try to get the lock of itself, then invoke the method of *leaderContender(which is JobMasterServiceLeaderShipRunner in this case).* [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L204] So the order of getting locks is # DefaultLeaderElectionService # JobMasterServiceLeaderShipRunner And in this line *JobMasterServiceLeaderShipRunner* will try to get the lock of itself, then invoke the method of *leaderElectionService(which is* {*}DefaultLeaderElectionService in this case{*}{*}).{*} [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMasterServiceLeadershipRunner.java#L148] So the order of getting the locks is # JobMasterServiceLeaderShipRunner # DefaultLeaderElectionService So if these two functions are invoked nearly at the same time, it will cause the dead lock issue. was (Author: lucentwong): [~martijnvisser] I think this issue still exists in the master branch. The issue is caused by the ordering of Flink to get the lock. In this line, the class *DefaultLeaderElectionService* will try to get the lock of itself, then invoke the method of *leaderContender(which is JobMasterServiceLeaderShipRunner in this case).* [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L204] So the order of getting locks is # DefaultLeaderElectionService # JobMasterServiceLeaderShipRunner{*}{*} And in this line *JobMasterServiceLeaderShipRunner* will try to get the lock of itself, then invoke the method of *leaderElectionService(which is* {*}DefaultLeaderElectionService in this case{*}{*}).{*} [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMasterServiceLeadershipRunner.java#L148] So the order of getting the locks is # JobMasterServiceLeaderShipRunner # DefaultLeaderElectionService So if these two functions are invoked nearly at the same time, it will cause the dead lock issue. > Dead lock in DefaultLeaderElectionService > - > > Key: FLINK-29234 > URL: https://issues.apache.org/jira/browse/FLINK-29234 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.13.5 >Reporter: Yu Wang >Priority: Critical > > Jobmanager stop working because the deadlock in DefaultLeaderElectionService. > The log stopped at > {code:java} > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - > Stopping DefaultLeaderElectionService. {code} > Which may similar to this ticket > https://issues.apache.org/jira/browse/FLINK-20008 > Here is the jstack info > {code:java} > Found one Java-level deadlock: = > "flink-akka.actor.default-dispatcher-18": waiting to lock monitor > 0x7f15c7eae3a8 (object 0x000678d395e8, a java.lang.Object), which is > held by "main-EventThread" "main-EventThread": waiting to lock monitor > 0x7f15a3811258 (object 0x000678cf1be0, a java.lang.Object), which is > held by "flink-akka.actor.default-dispatcher-18" Java stack information for > the threads listed above: === > "flink-akka.actor.default-dispatcher-18": at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:104) > - waiting to lock <0x000678d395e8> (a java.lang.Object) at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$closeAsync$0(JobMasterServiceLeadershipRunner.java:147) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$735/1742012752.run(Unknown > Source) at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:687) > at > org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$736/6716561.accept(Unknown > Source) at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > at >
[jira] [Comment Edited] (FLINK-29234) Dead lock in DefaultLeaderElectionService
[ https://issues.apache.org/jira/browse/FLINK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602689#comment-17602689 ] Yu Wang edited comment on FLINK-29234 at 9/10/22 9:47 AM: -- [~martijnvisser] I think this issue still exists in the master branch. The issue is caused by the ordering of Flink to get the lock. In this line, the class *DefaultLeaderElectionService* will try to get the lock of itself, then invoke the method of *leaderContender(which is JobMasterServiceLeaderShipRunner in this case).* [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L204] So the order of getting locks is # DefaultLeaderElectionService # JobMasterServiceLeaderShipRunner{*}{*} And in this line *JobMasterServiceLeaderShipRunner* will try to get the lock of itself, then invoke the method of *leaderElectionService(which is* {*}DefaultLeaderElectionService in this case{*}{*}).{*} [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMasterServiceLeadershipRunner.java#L148] So the order of getting the locks is # JobMasterServiceLeaderShipRunner # DefaultLeaderElectionService So if these two functions are invoked nearly at the same time, it will cause the dead lock issue. was (Author: lucentwong): [~martijnvisser] I think this issue still exists in the master branch. The issue is caused by the ordering of Flink to get the lock. In this line, the class *DefaultLeaderElectionService* will try to get the lock of itself, then invoke the method of *leaderContender(which is JobMasterServiceLeaderShipRunner in this case).* [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L204] So the order of getting locks is # DefaultLeaderElectionService # JobMasterServiceLeaderShipRunner{*}{*} And in this line *JobMasterServiceLeaderShipRunner* will try to get the lock of itself, then invoke the method of *leaderElectionService(which is* {*}DefaultLeaderElectionService in this case{*}{*}).{*} [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMasterServiceLeadershipRunner.java#L148] So the order of getting the locks is # JobMasterServiceLeaderShipRunner # DefaultLeaderElectionService So if these two functions are invoked nearly at the same time, it will cause the dead lock issue. > Dead lock in DefaultLeaderElectionService > - > > Key: FLINK-29234 > URL: https://issues.apache.org/jira/browse/FLINK-29234 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.13.5 >Reporter: Yu Wang >Priority: Critical > > Jobmanager stop working because the deadlock in DefaultLeaderElectionService. > The log stopped at > {code:java} > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - > Stopping DefaultLeaderElectionService. {code} > Which may similar to this ticket > https://issues.apache.org/jira/browse/FLINK-20008 > Here is the jstack info > {code:java} > Found one Java-level deadlock: = > "flink-akka.actor.default-dispatcher-18": waiting to lock monitor > 0x7f15c7eae3a8 (object 0x000678d395e8, a java.lang.Object), which is > held by "main-EventThread" "main-EventThread": waiting to lock monitor > 0x7f15a3811258 (object 0x000678cf1be0, a java.lang.Object), which is > held by "flink-akka.actor.default-dispatcher-18" Java stack information for > the threads listed above: === > "flink-akka.actor.default-dispatcher-18": at > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:104) > - waiting to lock <0x000678d395e8> (a java.lang.Object) at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$closeAsync$0(JobMasterServiceLeadershipRunner.java:147) > at > org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$735/1742012752.run(Unknown > Source) at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:687) > at > org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$736/6716561.accept(Unknown > Source) at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > at >