[ 
https://issues.apache.org/jira/browse/HDFS-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13395862#comment-13395862
 ] 

Vinay commented on HDFS-3541:
-----------------------------

By seeing the ThreadDump attached, recoverBlock(..) call is waiting to join the 
writer thread in ReplicaInPipeline#stopWriter().

{code}  public void stopWriter() throws IOException {
    if (writer != null && writer != Thread.currentThread() && writer.isAlive()) 
{
      writer.interrupt();
      try {
        writer.join();
      } catch (InterruptedException e) {
        throw new IOException("Waiting for writer thread is interrupted.");
      }
    }
  }{code}

FSDataSetImpl#initReplicaRecovery will call the above Method, but it have 
already locked the FSDataSet.

In the current thread dump, writer thread is one of the DataXceiver threads, 
which are waiting on their respective PacketResponder threads. 

# Here *writer.interrupt()* will succeed in interrupting the thread only in 
case if the it is in waiting/sleeping state. otherwise it will not actually 
intterrupt it. So it will wait till the thread completes its execution.
# writer thread is DataXceiver thread, which is waiting to join PacketResponder 
Thread.
# Packet Responders are waiting on *fsdataset* lock to finalize the block.

So its a deadlock.

Here ReplicaInPipeline#stopWriter() should ensure that thread is interrupted 
successfully.

following changes should work in this case
{code}  public void stopWriter() throws IOException {
    if (writer != null && writer != Thread.currentThread()) {
      while (writer.isAlive()) {
        writer.interrupt();
        try {
          writer.wait(100);
        } catch (InterruptedException e) {
          throw new IOException("Waiting for writer thread is interrupted.");
        }
      }
    }
  }{code}
                
> Deadlock between recovery, xceiver and packet responder
> -------------------------------------------------------
>
>                 Key: HDFS-3541
>                 URL: https://issues.apache.org/jira/browse/HDFS-3541
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>    Affects Versions: 2.0.1-alpha
>            Reporter: suja s
>            Assignee: Vinay
>         Attachments: DN_dump.rar
>
>
> Block Recovery initiated while write in progress at Datanode side. Found a 
> lock between recovery, xceiver and packet responder.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to