[ https://issues.apache.org/jira/browse/TEZ-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
László Bodor reassigned TEZ-4334: --------------------------------- Assignee: Sungwoo Park > Fix deadlock in ShuffleScheduler > -------------------------------- > > Key: TEZ-4334 > URL: https://issues.apache.org/jira/browse/TEZ-4334 > Project: Apache Tez > Issue Type: Bug > Reporter: Sungwoo Park > Assignee: Sungwoo Park > Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > Deadlock can be generated between a thread calling ShuffleScheduler.close() > and the ShufflePenaltyReferee thread. > Example (produced with an earlier version): > {{"Fetcher_O { attempt_1611850856294_0026_1_03_000000_0_10344 Reducer_3} #13" > #2669 daemon prio=5 os_prio=0 tid=0x00002b9de869d000 nid=0xf99 in > Object.wait() [0x00002b9de4983000] > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.close(ShuffleScheduler.java:481) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupShuffleScheduler(Shuffle.java:352) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupShuffleSchedulerIgnoreErrors(Shuffle.java:343) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.reportException(Shuffle.java:407) > - locked <0x00002b96bbb9d7a8> (a > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1033) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:781) > - locked <0x00002b96b98a7860> (a > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:414) > "ShufflePenaltyReferee {Reducer_3}" #2645 daemon prio=5 os_prio=0 > tid=0x00002b9560fae800 nid=0xf7d waiting for monitor entry > [0x00002b9de733b000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$Referee.run(ShuffleScheduler.java:1322) > - waiting to lock <0x00002b96b98a7860> (a > org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)}} > We can fix the deadlock with: > 1) do not hold ShuffleScheduler.this when calling > exceptionReporter.reportException() > 2) remove synchronized in copyFailed() -- This message was sent by Atlassian Jira (v8.20.10#820010)