[ 
https://issues.apache.org/jira/browse/TEZ-4334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated TEZ-4334:
------------------------------
    Summary: Fix deadlock in ShuffleScheduler between ShuffleScheduler.close() 
and the ShufflePenaltyReferee thread  (was: Fix deadlock in ShuffleScheduler)

> Fix deadlock in ShuffleScheduler between ShuffleScheduler.close() and the 
> ShufflePenaltyReferee thread
> ------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4334
>                 URL: https://issues.apache.org/jira/browse/TEZ-4334
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Sungwoo Park
>            Assignee: Sungwoo Park
>            Priority: Major
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Deadlock can be generated between a thread calling ShuffleScheduler.close() 
> and the ShufflePenaltyReferee thread.
> Example (produced with an earlier version):
> {{"Fetcher_O { attempt_1611850856294_0026_1_03_000000_0_10344 Reducer_3} #13" 
> #2669 daemon prio=5 os_prio=0 tid=0x00002b9de869d000 nid=0xf99 in 
> Object.wait() [0x00002b9de4983000]
>  at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.close(ShuffleScheduler.java:481)
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupShuffleScheduler(Shuffle.java:352)
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.cleanupShuffleSchedulerIgnoreErrors(Shuffle.java:343)
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle.reportException(Shuffle.java:407)
>         - locked <0x00002b96bbb9d7a8> (a 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle)
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1033)
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:781)
>         - locked <0x00002b96b98a7860> (a 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:414)
> "ShufflePenaltyReferee {Reducer_3}" #2645 daemon prio=5 os_prio=0 
> tid=0x00002b9560fae800 nid=0xf7d waiting for monitor entry 
> [0x00002b9de733b000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler$Referee.run(ShuffleScheduler.java:1322)
>         - waiting to lock <0x00002b96b98a7860> (a 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler)}}
> We can fix the deadlock with:
> 1) do not hold ShuffleScheduler.this when calling 
> exceptionReporter.reportException()
> 2) remove synchronized in copyFailed()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to