[
https://issues.apache.org/jira/browse/TEZ-4416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor resolved TEZ-4416.
-------------------------------
Resolution: Duplicate
> Dead lock triggered by ShuffleScheduler
> ---------------------------------------
>
> Key: TEZ-4416
> URL: https://issues.apache.org/jira/browse/TEZ-4416
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.10.1
> Reporter: Omega-Ariston
> Priority: Major
> Attachments: container.jstack, screenshot.PNG
>
>
> How this bug is found:
> I was executing a sql with Hive on tez on a cluster that has low disk
> capacity. An exception was thrown during the execution (which is quite
> reasonable). Yet the task didn't stop normally, but keep hanging there for a
> very long while. Therefore, I printed out the jstack and did some
> investigation. Here's what I found.
> (The .jstack file and the screenshot of jstack segment are attached below.)
>
> How this dead lock is triggered:
> # Fail to copy files on local disk, which will trigger copyFailed() from
> FetcherOrderedGrouped.copyFromHost(), which is a synchronized method on
> ShuffleScheduler instance.
> # Method called from 1 will eventually goes to ShuffleScheduler.close(), in
> which it tries to kill the Referee's thread by calling referee.interrupt()
> and referee.join().
> # Meanwhile, Referee is waiting for ShuffleScheduler's instance lock in its
> run() method, which is hold by the process from 1. Hence a dead lock happens.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)