[ 
https://issues.apache.org/jira/browse/FLINK-34672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848648#comment-17848648
 ] 

Matthias Pohl commented on FLINK-34672:
---------------------------------------

I'm still trying to find a reviewer. It's on my plate. But it's not a blocker 
because the issue already existed in older versions of Flink:
{quote}
I also verified that this is not something that was introduced in Flink 1.18 
with the FLIP-285 changes. AFAIS, it can also happen in 1.17- (I didn't check 
the pre-FLINK-24038 code but only looked into release-1.17).
{quote}

> HA deadlock between JobMasterServiceLeadershipRunner and 
> DefaultLeaderElectionService
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-34672
>                 URL: https://issues.apache.org/jira/browse/FLINK-34672
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.17.2, 1.19.0, 1.18.1, 1.20.0
>            Reporter: Chesnay Schepler
>            Assignee: Matthias Pohl
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.18.2, 1.20.0, 1.19.1
>
>
> We recently observed a deadlock in the JM within the HA system.
> (see below for the thread dump)
> [~mapohl] and I looked a bit into it and there appears to be a race condition 
> when leadership is revoked while a JobMaster is being started.
> It appears to be caused by 
> {{JobMasterServiceLeadershipRunner#createNewJobMasterServiceProcess}} 
> forwarding futures while holding a lock; depending on whether the forwarded 
> future is already complete the next stage may or may not run while holding 
> that same lock.
> We haven't determined yet whether we should be holding that lock or not.
> {code}
> "DefaultLeaderElectionService-leadershipOperationExecutor-thread-1" #131 
> daemon prio=5 os_prio=0 cpu=157.44ms elapsed=78749.65s tid=0x00007f531f43d000 
> nid=0x19d waiting for monitor entry  [0x00007f53084fd000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:462)
>         - waiting to lock <0x00000000f1c0e088> (a java.lang.Object)
>         at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.revokeLeadership(JobMasterServiceLeadershipRunner.java:397)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.notifyLeaderContenderOfLeadershipLoss(DefaultLeaderElectionService.java:484)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1252/0x0000000840ddec40.accept(Unknown
>  Source)
>         at java.util.HashMap.forEach(java.base@11.0.22/HashMap.java:1337)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.onRevokeLeadershipInternal(DefaultLeaderElectionService.java:452)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1251/0x0000000840dcf840.run(Unknown
>  Source)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.lambda$runInLeaderEventThread$3(DefaultLeaderElectionService.java:549)
>         - locked <0x00000000f0e3f4d8> (a java.lang.Object)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService$$Lambda$1075/0x0000000840c23040.run(Unknown
>  Source)
>         at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(java.base@11.0.22/CompletableFuture.java:1736)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.22/ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.22/ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(java.base@11.0.22/Thread.java:829)
> {code}
> {code}
> "jobmanager-io-thread-1" #636 daemon prio=5 os_prio=0 cpu=125.56ms 
> elapsed=78699.01s tid=0x00007f5321c6e800 nid=0x396 waiting for monitor entry  
> [0x00007f530567d000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.hasLeadership(DefaultLeaderElectionService.java:366)
>         - waiting to lock <0x00000000f0e3f4d8> (a java.lang.Object)
>         at 
> org.apache.flink.runtime.leaderelection.DefaultLeaderElection.hasLeadership(DefaultLeaderElection.java:52)
>         at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.isValidLeader(JobMasterServiceLeadershipRunner.java:509)
>         at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$forwardIfValidLeader$15(JobMasterServiceLeadershipRunner.java:520)
>         - locked <0x00000000f1c0e088> (a java.lang.Object)
>         at 
> org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner$$Lambda$1320/0x0000000840e1a840.accept(Unknown
>  Source)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(java.base@11.0.22/CompletableFuture.java:859)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(java.base@11.0.22/CompletableFuture.java:837)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(java.base@11.0.22/CompletableFuture.java:506)
>         at 
> java.util.concurrent.CompletableFuture.complete(java.base@11.0.22/CompletableFuture.java:2079)
>         at 
> org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.registerJobMasterServiceFutures(DefaultJobMasterServiceProcess.java:124)
>         at 
> org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:114)
>         at 
> org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess$$Lambda$1319/0x0000000840e1a440.accept(Unknown
>  Source)
>         at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(java.base@11.0.22/CompletableFuture.java:859)
>         at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(java.base@11.0.22/CompletableFuture.java:837)
>         at 
> java.util.concurrent.CompletableFuture.postComplete(java.base@11.0.22/CompletableFuture.java:506)
>         at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(java.base@11.0.22/CompletableFuture.java:1705)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.22/ThreadPoolExecutor.java:1128)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.22/ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(java.base@11.0.22/Thread.java:829)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to