[jira] [Updated] (FLINK-29234) Dead lock in DefaultLeaderElectionService

2022-10-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-29234:
---
Labels: pull-request-available  (was: )

> Dead lock in DefaultLeaderElectionService
> -
>
> Key: FLINK-29234
> URL: https://issues.apache.org/jira/browse/FLINK-29234
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5, 1.14.5, 1.15.2
>Reporter: Yu Wang
>Assignee: Weijie Guo
>Priority: Critical
>  Labels: pull-request-available
>
> Jobmanager stop working because the deadlock in DefaultLeaderElectionService.
> The log stopped at
> {code:java}
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Stopping DefaultLeaderElectionService. {code}
> Which may similar to this ticket 
> https://issues.apache.org/jira/browse/FLINK-20008
> Here is the jstack info
> {code:java}
> Found one Java-level deadlock: = 
> "flink-akka.actor.default-dispatcher-18": waiting to lock monitor 
> 0x7f15c7eae3a8 (object 0x000678d395e8, a java.lang.Object), which is 
> held by "main-EventThread" "main-EventThread": waiting to lock monitor 
> 0x7f15a3811258 (object 0x000678cf1be0, a java.lang.Object), which is 
> held by "flink-akka.actor.default-dispatcher-18" Java stack information for 
> the threads listed above: === 
> "flink-akka.actor.default-dispatcher-18": at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:104)
>  - waiting to lock <0x000678d395e8> (a java.lang.Object) at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$closeAsync$0(JobMasterServiceLeadershipRunner.java:147)
>  at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$735/1742012752.run(Unknown
>  Source) at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:687)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$736/6716561.accept(Unknown
>  Source) at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  at 
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
>  at 
> java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:543)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:765)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:795)
>  at 
> java.util.concurrent.CompletableFuture.whenCompleteAsync(CompletableFuture.java:2163)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.runAfterwardsAsync(FutureUtils.java:684)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.runAfterwards(FutureUtils.java:651)
>  at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.closeAsync(JobMasterServiceLeadershipRunner.java:143)
>  - locked <0x000678cf1be0> (a java.lang.Object) at 
> org.apache.flink.runtime.dispatcher.Dispatcher.terminateJob(Dispatcher.java:807)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobs(Dispatcher.java:799)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobsAndGetTerminationFuture(Dispatcher.java:812)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.onStop(Dispatcher.java:268) at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$444/1289054037.apply(Unknown
>  Source) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at 
> scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at 
> scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at 
> akka.actor.Actor.aroundReceive(Actor.scala:517) at 
> akka.actor.Actor.aroundReceive$(Actor.scala:515) at 
> akka.actor.AbstractActor.aroundReceive(A

[jira] [Updated] (FLINK-29234) Dead lock in DefaultLeaderElectionService

2022-10-26 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-29234:
--
Description: 
Jobmanager stop working because the deadlock in DefaultLeaderElectionService.

The log stopped at
{code:java}
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
Stopping DefaultLeaderElectionService. {code}
Which may similar to this ticket 
https://issues.apache.org/jira/browse/FLINK-20008

Here is the jstack info
{code:java}
Found one Java-level deadlock: = 
"flink-akka.actor.default-dispatcher-18": waiting to lock monitor 
0x7f15c7eae3a8 (object 0x000678d395e8, a java.lang.Object), which is 
held by "main-EventThread" "main-EventThread": waiting to lock monitor 
0x7f15a3811258 (object 0x000678cf1be0, a java.lang.Object), which is 
held by "flink-akka.actor.default-dispatcher-18" Java stack information for the 
threads listed above: === 
"flink-akka.actor.default-dispatcher-18": at 


org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:104)
 - waiting to lock <0x000678d395e8> (a java.lang.Object)
 at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$closeAsync$0(JobMasterServiceLeadershipRunner.java:147)
 at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$735/1742012752.run(Unknown
 Source)
 at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:687)
 at 
org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$736/6716561.accept(Unknown
 Source)
 at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
 at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
 at 
org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
 at 
java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:543)
 at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:765)
 at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 at 
java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:795)
 at 
java.util.concurrent.CompletableFuture.whenCompleteAsync(CompletableFuture.java:2163)
 at 
org.apache.flink.runtime.concurrent.FutureUtils.runAfterwardsAsync(FutureUtils.java:684)
 at 
org.apache.flink.runtime.concurrent.FutureUtils.runAfterwards(FutureUtils.java:651)
 at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.closeAsync(JobMasterServiceLeadershipRunner.java:143)
 - locked <0x000678cf1be0> (a java.lang.Object)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.terminateJob(Dispatcher.java:807)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobs(Dispatcher.java:799)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobsAndGetTerminationFuture(Dispatcher.java:812)
 at org.apache.flink.runtime.dispatcher.Dispatcher.onStop(Dispatcher.java:268)
 at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$444/1289054037.apply(Unknown
 Source)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
 at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
 at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
 at akka.actor.Actor.aroundReceive(Actor.scala:517)
 at akka.actor.Actor.aroundReceive$(Actor.scala:515)
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


"main-EventThread":
 at 
org.apache.flink.runtime.j

[jira] [Updated] (FLINK-29234) Dead lock in DefaultLeaderElectionService

2022-10-26 Thread Matthias Pohl (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-29234:
--
Description: 
Jobmanager stop working because the deadlock in DefaultLeaderElectionService.

The log stopped at
{code:java}
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
Stopping DefaultLeaderElectionService. {code}
Which may similar to this ticket 
https://issues.apache.org/jira/browse/FLINK-20008

Here is the jstack info
{code:java}
Found one Java-level deadlock: 
= 
"flink-akka.actor.default-dispatcher-18": waiting to lock monitor 
0x7f15c7eae3a8 (object 0x000678d395e8, a java.lang.Object), which is 
held by "main-EventThread" "main-EventThread": waiting to lock monitor 
0x7f15a3811258 (object 0x000678cf1be0, a java.lang.Object), which is 
held by "flink-akka.actor.default-dispatcher-18" Java stack information for the 
threads listed above: 
=== 

"flink-akka.actor.default-dispatcher-18": 
 at 
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:104)
 - waiting to lock <0x000678d395e8> (a java.lang.Object)
 at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$closeAsync$0(JobMasterServiceLeadershipRunner.java:147)
 at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$735/1742012752.run(Unknown
 Source)
 at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:687)
 at 
org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$736/6716561.accept(Unknown
 Source)
 at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
 at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
 at 
org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
 at 
java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:543)
 at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:765)
 at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 at 
java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:795)
 at 
java.util.concurrent.CompletableFuture.whenCompleteAsync(CompletableFuture.java:2163)
 at 
org.apache.flink.runtime.concurrent.FutureUtils.runAfterwardsAsync(FutureUtils.java:684)
 at 
org.apache.flink.runtime.concurrent.FutureUtils.runAfterwards(FutureUtils.java:651)
 at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.closeAsync(JobMasterServiceLeadershipRunner.java:143)
 - locked <0x000678cf1be0> (a java.lang.Object)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.terminateJob(Dispatcher.java:807)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobs(Dispatcher.java:799)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobsAndGetTerminationFuture(Dispatcher.java:812)
 at org.apache.flink.runtime.dispatcher.Dispatcher.onStop(Dispatcher.java:268)
 at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$444/1289054037.apply(Unknown
 Source)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
 at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
 at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
 at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
 at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
 at akka.actor.Actor.aroundReceive(Actor.scala:517)
 at akka.actor.Actor.aroundReceive$(Actor.scala:515)
 at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
 at akka.actor.ActorCell.invoke(ActorCell.scala:561)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
 at akka.dispatch.Mailbox.run(Mailbox.scala:225)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


"main-EventThread":
 at 
org.apache.flink.runtim

[jira] [Updated] (FLINK-29234) Dead lock in DefaultLeaderElectionService

2022-11-11 Thread Fabian Paul (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Paul updated FLINK-29234:

Affects Version/s: 1.15.3
   (was: 1.15.2)

> Dead lock in DefaultLeaderElectionService
> -
>
> Key: FLINK-29234
> URL: https://issues.apache.org/jira/browse/FLINK-29234
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5, 1.14.5, 1.15.3
>Reporter: Yu Wang
>Assignee: Weijie Guo
>Priority: Critical
>  Labels: pull-request-available
>
> Jobmanager stop working because the deadlock in DefaultLeaderElectionService.
> The log stopped at
> {code:java}
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Stopping DefaultLeaderElectionService. {code}
> Which may similar to this ticket 
> https://issues.apache.org/jira/browse/FLINK-20008
> Here is the jstack info
> {code:java}
> Found one Java-level deadlock: 
> = 
> "flink-akka.actor.default-dispatcher-18": waiting to lock monitor 
> 0x7f15c7eae3a8 (object 0x000678d395e8, a java.lang.Object), which is 
> held by "main-EventThread" "main-EventThread": waiting to lock monitor 
> 0x7f15a3811258 (object 0x000678cf1be0, a java.lang.Object), which is 
> held by "flink-akka.actor.default-dispatcher-18" Java stack information for 
> the threads listed above: 
> === 
> "flink-akka.actor.default-dispatcher-18": 
>  at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:104)
>  - waiting to lock <0x000678d395e8> (a java.lang.Object)
>  at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$closeAsync$0(JobMasterServiceLeadershipRunner.java:147)
>  at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$735/1742012752.run(Unknown
>  Source)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:687)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$736/6716561.accept(Unknown
>  Source)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  at 
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
>  at 
> java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:543)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:765)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:795)
>  at 
> java.util.concurrent.CompletableFuture.whenCompleteAsync(CompletableFuture.java:2163)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.runAfterwardsAsync(FutureUtils.java:684)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.runAfterwards(FutureUtils.java:651)
>  at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.closeAsync(JobMasterServiceLeadershipRunner.java:143)
>  - locked <0x000678cf1be0> (a java.lang.Object)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.terminateJob(Dispatcher.java:807)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobs(Dispatcher.java:799)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobsAndGetTerminationFuture(Dispatcher.java:812)
>  at org.apache.flink.runtime.dispatcher.Dispatcher.onStop(Dispatcher.java:268)
>  at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$444/1289054037.apply(Unknown
>  Source)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>  at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>  at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
>  at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
>  at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>  at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
>  at akka.actor.Actor.aroundReceive(Actor.scala:517)
>  at akka.actor.Actor.aroundReceive$(Actor.scala:515)

[jira] [Updated] (FLINK-29234) Dead lock in DefaultLeaderElectionService

2022-09-08 Thread Yu Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang updated FLINK-29234:

Priority: Critical  (was: Major)

> Dead lock in DefaultLeaderElectionService
> -
>
> Key: FLINK-29234
> URL: https://issues.apache.org/jira/browse/FLINK-29234
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
>Reporter: Yu Wang
>Priority: Critical
>
> Jobmanager stop working because the deadlock in DefaultLeaderElectionService. 
> Which may similar to this ticket 
> https://issues.apache.org/jira/browse/FLINK-20008
> Here is the jstak info
> {code:java}
> Found one Java-level deadlock: = 
> "flink-akka.actor.default-dispatcher-18": waiting to lock monitor 
> 0x7f15c7eae3a8 (object 0x000678d395e8, a java.lang.Object), which is 
> held by "main-EventThread" "main-EventThread": waiting to lock monitor 
> 0x7f15a3811258 (object 0x000678cf1be0, a java.lang.Object), which is 
> held by "flink-akka.actor.default-dispatcher-18" Java stack information for 
> the threads listed above: === 
> "flink-akka.actor.default-dispatcher-18": at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:104)
>  - waiting to lock <0x000678d395e8> (a java.lang.Object) at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$closeAsync$0(JobMasterServiceLeadershipRunner.java:147)
>  at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$735/1742012752.run(Unknown
>  Source) at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:687)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$736/6716561.accept(Unknown
>  Source) at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  at 
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
>  at 
> java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:543)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:765)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:795)
>  at 
> java.util.concurrent.CompletableFuture.whenCompleteAsync(CompletableFuture.java:2163)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.runAfterwardsAsync(FutureUtils.java:684)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.runAfterwards(FutureUtils.java:651)
>  at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.closeAsync(JobMasterServiceLeadershipRunner.java:143)
>  - locked <0x000678cf1be0> (a java.lang.Object) at 
> org.apache.flink.runtime.dispatcher.Dispatcher.terminateJob(Dispatcher.java:807)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobs(Dispatcher.java:799)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobsAndGetTerminationFuture(Dispatcher.java:812)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.onStop(Dispatcher.java:268) at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$444/1289054037.apply(Unknown
>  Source) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at 
> scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at 
> scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at 
> akka.actor.Actor.aroundReceive(Actor.scala:517) at 
> akka.actor.Actor.aroundReceive$(Actor.scala:515) at 
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at 
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at 
> akka.actor.ActorCell.invoke(ActorCell.scala:561) at 
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at 
> akka.dispatch.Mailbox.run(Mailbox.scala:225) at 
> akka.dispatch.Ma

[jira] [Updated] (FLINK-29234) Dead lock in DefaultLeaderElectionService

2022-09-08 Thread Yu Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang updated FLINK-29234:

Description: 
Jobmanager stop working because the deadlock in DefaultLeaderElectionService. 
Which may similar to this ticket 
https://issues.apache.org/jira/browse/FLINK-20008

Here is the jstak info
{code:java}
Found one Java-level deadlock: = 
"flink-akka.actor.default-dispatcher-18": waiting to lock monitor 
0x7f15c7eae3a8 (object 0x000678d395e8, a java.lang.Object), which is 
held by "main-EventThread" "main-EventThread": waiting to lock monitor 
0x7f15a3811258 (object 0x000678cf1be0, a java.lang.Object), which is 
held by "flink-akka.actor.default-dispatcher-18" Java stack information for the 
threads listed above: === 
"flink-akka.actor.default-dispatcher-18": at 


org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:104)
 - waiting to lock <0x000678d395e8> (a java.lang.Object) at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$closeAsync$0(JobMasterServiceLeadershipRunner.java:147)
 at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$735/1742012752.run(Unknown
 Source) at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:687)
 at 
org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$736/6716561.accept(Unknown
 Source) at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
 at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
 at 
org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
 at 
java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:543)
 at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:765)
 at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 at 
java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:795)
 at 
java.util.concurrent.CompletableFuture.whenCompleteAsync(CompletableFuture.java:2163)
 at 
org.apache.flink.runtime.concurrent.FutureUtils.runAfterwardsAsync(FutureUtils.java:684)
 at 
org.apache.flink.runtime.concurrent.FutureUtils.runAfterwards(FutureUtils.java:651)
 at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.closeAsync(JobMasterServiceLeadershipRunner.java:143)
 - locked <0x000678cf1be0> (a java.lang.Object) at 
org.apache.flink.runtime.dispatcher.Dispatcher.terminateJob(Dispatcher.java:807)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobs(Dispatcher.java:799)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobsAndGetTerminationFuture(Dispatcher.java:812)
 at org.apache.flink.runtime.dispatcher.Dispatcher.onStop(Dispatcher.java:268) 
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$444/1289054037.apply(Unknown
 Source) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at 
scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at 
scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at 
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at 
akka.actor.Actor.aroundReceive(Actor.scala:517) at 
akka.actor.Actor.aroundReceive$(Actor.scala:515) at 
akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at 
akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at 
akka.actor.ActorCell.invoke(ActorCell.scala:561) at 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at 
akka.dispatch.Mailbox.run(Mailbox.scala:225) at 
akka.dispatch.Mailbox.exec(Mailbox.scala:235) at 
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 


"main-EventThread": at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:468)
 - waiting to lock <0x000678cf1be0> (a java.lang.Object) at 
org.

[jira] [Updated] (FLINK-29234) Dead lock in DefaultLeaderElectionService

2022-09-08 Thread Yu Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang updated FLINK-29234:

Description: 
Jobmanager stop working because the deadlock in DefaultLeaderElectionService.

The log stopped at
{code:java}
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
Stopping DefaultLeaderElectionService. {code}
Which may similar to this ticket 
https://issues.apache.org/jira/browse/FLINK-20008

Here is the jstack info
{code:java}
Found one Java-level deadlock: = 
"flink-akka.actor.default-dispatcher-18": waiting to lock monitor 
0x7f15c7eae3a8 (object 0x000678d395e8, a java.lang.Object), which is 
held by "main-EventThread" "main-EventThread": waiting to lock monitor 
0x7f15a3811258 (object 0x000678cf1be0, a java.lang.Object), which is 
held by "flink-akka.actor.default-dispatcher-18" Java stack information for the 
threads listed above: === 
"flink-akka.actor.default-dispatcher-18": at 


org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:104)
 - waiting to lock <0x000678d395e8> (a java.lang.Object) at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$closeAsync$0(JobMasterServiceLeadershipRunner.java:147)
 at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$735/1742012752.run(Unknown
 Source) at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:687)
 at 
org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$736/6716561.accept(Unknown
 Source) at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
 at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
 at 
org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
 at 
java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:543)
 at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:765)
 at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 at 
java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:795)
 at 
java.util.concurrent.CompletableFuture.whenCompleteAsync(CompletableFuture.java:2163)
 at 
org.apache.flink.runtime.concurrent.FutureUtils.runAfterwardsAsync(FutureUtils.java:684)
 at 
org.apache.flink.runtime.concurrent.FutureUtils.runAfterwards(FutureUtils.java:651)
 at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.closeAsync(JobMasterServiceLeadershipRunner.java:143)
 - locked <0x000678cf1be0> (a java.lang.Object) at 
org.apache.flink.runtime.dispatcher.Dispatcher.terminateJob(Dispatcher.java:807)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobs(Dispatcher.java:799)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobsAndGetTerminationFuture(Dispatcher.java:812)
 at org.apache.flink.runtime.dispatcher.Dispatcher.onStop(Dispatcher.java:268) 
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$444/1289054037.apply(Unknown
 Source) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at 
scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at 
scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at 
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at 
akka.actor.Actor.aroundReceive(Actor.scala:517) at 
akka.actor.Actor.aroundReceive$(Actor.scala:515) at 
akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at 
akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at 
akka.actor.ActorCell.invoke(ActorCell.scala:561) at 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at 
akka.dispatch.Mailbox.run(Mailbox.scala:225) at 
akka.dispatch.Mailbox.exec(Mailbox.scala:235) at 
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 


"main-EventThread": at 
org.apache.flink.runtime.jobmaster.JobMasterS

[jira] [Updated] (FLINK-29234) Dead lock in DefaultLeaderElectionService

2022-09-10 Thread Yu Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Wang updated FLINK-29234:

Affects Version/s: 1.15.2
   1.14.5

> Dead lock in DefaultLeaderElectionService
> -
>
> Key: FLINK-29234
> URL: https://issues.apache.org/jira/browse/FLINK-29234
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5, 1.14.5, 1.15.2
>Reporter: Yu Wang
>Priority: Critical
>
> Jobmanager stop working because the deadlock in DefaultLeaderElectionService.
> The log stopped at
> {code:java}
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService [] - 
> Stopping DefaultLeaderElectionService. {code}
> Which may similar to this ticket 
> https://issues.apache.org/jira/browse/FLINK-20008
> Here is the jstack info
> {code:java}
> Found one Java-level deadlock: = 
> "flink-akka.actor.default-dispatcher-18": waiting to lock monitor 
> 0x7f15c7eae3a8 (object 0x000678d395e8, a java.lang.Object), which is 
> held by "main-EventThread" "main-EventThread": waiting to lock monitor 
> 0x7f15a3811258 (object 0x000678cf1be0, a java.lang.Object), which is 
> held by "flink-akka.actor.default-dispatcher-18" Java stack information for 
> the threads listed above: === 
> "flink-akka.actor.default-dispatcher-18": at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:104)
>  - waiting to lock <0x000678d395e8> (a java.lang.Object) at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$closeAsync$0(JobMasterServiceLeadershipRunner.java:147)
>  at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$735/1742012752.run(Unknown
>  Source) at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:687)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$736/6716561.accept(Unknown
>  Source) at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  at 
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
>  at 
> java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:543)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:765)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:795)
>  at 
> java.util.concurrent.CompletableFuture.whenCompleteAsync(CompletableFuture.java:2163)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.runAfterwardsAsync(FutureUtils.java:684)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.runAfterwards(FutureUtils.java:651)
>  at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.closeAsync(JobMasterServiceLeadershipRunner.java:143)
>  - locked <0x000678cf1be0> (a java.lang.Object) at 
> org.apache.flink.runtime.dispatcher.Dispatcher.terminateJob(Dispatcher.java:807)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobs(Dispatcher.java:799)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.terminateRunningJobsAndGetTerminationFuture(Dispatcher.java:812)
>  at 
> org.apache.flink.runtime.dispatcher.Dispatcher.onStop(Dispatcher.java:268) at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186)
>  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$444/1289054037.apply(Unknown
>  Source) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at 
> scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at 
> scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at 
> akka.actor.Actor.aroundReceive(Actor.scala:517) at 
> akka.actor.Actor.aroundReceive$(Actor.scala:515) at 
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at 
> akka.actor.ActorCell.receiveMessage(ActorCell.sc