[ 
https://issues.apache.org/jira/browse/TEZ-4416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiechuan Chen updated TEZ-4416:
-------------------------------
    Description: 
How this bug is found:

I was executing a sql with Hive on tez on a cluster that has low disk capacity. 
An exception was thrown during the execution (which is quite reasonable). Yet 
the task didn't stop normally, but keep hanging there for a very long while. 
Therefore, I printed out the jstack and did some investigation. Here's what I 
found.

(The .jstack file and the screenshot of  jstack segment are attached below.)

 

How this dead lock is triggered:
 # Fail to copy files on hdfs, which will trigger copyFailed() from 
FetcherOrderedGrouped.copyFromHost(), which is a synchronized method on 
ShuffleScheduler instance. 
 # Method called from 1 will eventually goes to ShuffleScheduler.close(), in 
which it tries to kill the Referee's thread by calling referee.interrupt() and 
referee.join().
 # Meanwhile, Referee is waiting for ShuffleScheduler's instance lock in its 
run() method, which is hold by the process from 1. Hence a dead lock happens.

  was:
How this bug is found:

I was executing a sql with Hive on tez on a cluster that has low disk capacity. 
An exception was thrown during the execution of one of the reducer, due to the 
failure of reading intermediate files. The task didn't stop normally, but keep 
hanging for a long while. Therefore, I printed out the jstack and did some 
investigation. Here's what I found.

(The jstack file and the screenshot of corresponding jstack segment are 
attached below.)

 

How this dead lock is triggered:
 # Fail to copy files on hdfs, which will trigger copyFailed() from 
FetcherOrderedGrouped.copyFromHost(), which is a synchronized method on 
ShuffleScheduler instance. 
 # Method called from 1 will eventually goes to ShuffleScheduler.close(), in 
which it tries to kill the Referee's thread by calling referee.interrupt() and 
referee.join().
 # Meanwhile, Referee is waiting for ShuffleScheduler's lock in its run() 
method, which is hold by the method called from 1. Hence a dead lock happens.


> Dead lock triggered by ShuffleScheduler
> ---------------------------------------
>
>                 Key: TEZ-4416
>                 URL: https://issues.apache.org/jira/browse/TEZ-4416
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.10.1
>            Reporter: Jiechuan Chen
>            Priority: Major
>         Attachments: container.jstack, screenshot.PNG
>
>
> How this bug is found:
> I was executing a sql with Hive on tez on a cluster that has low disk 
> capacity. An exception was thrown during the execution (which is quite 
> reasonable). Yet the task didn't stop normally, but keep hanging there for a 
> very long while. Therefore, I printed out the jstack and did some 
> investigation. Here's what I found.
> (The .jstack file and the screenshot of  jstack segment are attached below.)
>  
> How this dead lock is triggered:
>  # Fail to copy files on hdfs, which will trigger copyFailed() from 
> FetcherOrderedGrouped.copyFromHost(), which is a synchronized method on 
> ShuffleScheduler instance. 
>  # Method called from 1 will eventually goes to ShuffleScheduler.close(), in 
> which it tries to kill the Referee's thread by calling referee.interrupt() 
> and referee.join().
>  # Meanwhile, Referee is waiting for ShuffleScheduler's instance lock in its 
> run() method, which is hold by the process from 1. Hence a dead lock happens.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to