[ 
https://issues.apache.org/jira/browse/HDFS-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15080594#comment-15080594
 ] 

Tsz Wo Nicholas Sze commented on HDFS-9294:
-------------------------------------------

+1 we should as well cherry-pick this to branch-2.6.

> DFSClient  deadlock when close file and failed to renew lease
> -------------------------------------------------------------
>
>                 Key: HDFS-9294
>                 URL: https://issues.apache.org/jira/browse/HDFS-9294
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.2.0, 2.7.1
>         Environment: Hadoop 2.2.0
>            Reporter: 邓飞
>            Assignee: Brahma Reddy Battula
>            Priority: Blocker
>             Fix For: 2.7.2
>
>         Attachments: HDFS-9294-002.patch, HDFS-9294-002.patch, 
> HDFS-9294-branch-2.7.patch, HDFS-9294-branch-2.patch, HDFS-9294.patch
>
>
> We found a deadlock at our HBase(0.98) cluster(and the Hadoop Version is 
> 2.2.0),and it should be HDFS BUG,at the time our network is not stable.
>  below is the stack:
> *************************************************************************************************************************************
> Found one Java-level deadlock:
> =============================
> "MemStoreFlusher.1":
>   waiting to lock monitor 0x00007ff27cfa5218 (object 0x00000002fae5ebe0, a 
> org.apache.hadoop.hdfs.LeaseRenewer),
>   which is held by "LeaseRenewer:hbaseadmin@hbase-ns-gdt-sh-marvel"
> "LeaseRenewer:hbaseadmin@hbase-ns-gdt-sh-marvel":
>   waiting to lock monitor 0x00007ff2e67e16a8 (object 0x0000000486ce6620, a 
> org.apache.hadoop.hdfs.DFSOutputStream),
>   which is held by "MemStoreFlusher.0"
> "MemStoreFlusher.0":
>   waiting to lock monitor 0x00007ff27cfa5218 (object 0x00000002fae5ebe0, a 
> org.apache.hadoop.hdfs.LeaseRenewer),
>   which is held by "LeaseRenewer:hbaseadmin@hbase-ns-gdt-sh-marvel"
> Java stack information for the threads listed above:
> ===================================================
> "MemStoreFlusher.1":
>       at org.apache.hadoop.hdfs.LeaseRenewer.addClient(LeaseRenewer.java:216)
>       - waiting to lock <0x00000002fae5ebe0> (a 
> org.apache.hadoop.hdfs.LeaseRenewer)
>       at org.apache.hadoop.hdfs.LeaseRenewer.getInstance(LeaseRenewer.java:81)
>       at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:648)
>       at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:659)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:1882)
>       - locked <0x000000055b606cb0> (a org.apache.hadoop.hdfs.DFSOutputStream)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:71)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:104)
>       at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:250)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.close(HFileWriterV2.java:402)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:974)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:78)
>       at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75)
>       - locked <0x000000059869eed8> (a java.lang.Object)
>       at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:812)
>       at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:1974)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1795)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1678)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1591)
>       at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:472)
>       at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:211)
>       at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$500(MemStoreFlusher.java:66)
>       at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:238)
>       at java.lang.Thread.run(Thread.java:744)
> "LeaseRenewer:hbaseadmin@hbase-ns-gdt-sh-marvel":
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:1822)
>       - waiting to lock <0x0000000486ce6620> (a 
> org.apache.hadoop.hdfs.DFSOutputStream)
>       at 
> org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:780)
>       at org.apache.hadoop.hdfs.DFSClient.abort(DFSClient.java:753)
>       at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:453)
>       - locked <0x00000002fae5ebe0> (a org.apache.hadoop.hdfs.LeaseRenewer)
>       at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
>       at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
>       at java.lang.Thread.run(Thread.java:744)
> "MemStoreFlusher.0":
>       at org.apache.hadoop.hdfs.LeaseRenewer.addClient(LeaseRenewer.java:216)
>       - waiting to lock <0x00000002fae5ebe0> (a 
> org.apache.hadoop.hdfs.LeaseRenewer)
>       at org.apache.hadoop.hdfs.LeaseRenewer.getInstance(LeaseRenewer.java:81)
>       at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:648)
>       at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:659)
>       at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:1882)
>       - locked <0x0000000486ce6620> (a org.apache.hadoop.hdfs.DFSOutputStream)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:71)
>       at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:104)
>       at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:250)
>       at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.close(HFileWriterV2.java:402)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:974)
>       at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:78)
>       at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75)
>       - locked <0x00000004888f6848> (a java.lang.Object)
>       at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:812)
>       at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:1974)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1795)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1678)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1591)
>       at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:472)
>       at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:435)
>       at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:66)
>       at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:253)
>       at java.lang.Thread.run(Thread.java:744)
> Found 1 deadlock. 
> **********************************************************************
> the thread "MemStoreFlusher.0" is closing outputStream and remove it's lease ;
> other side the daemon thread "LeaseRenewer" failed to connect active nn  for 
> renewing  lease,but  got SocketTimeoutException   cause of network is not 
> good,so abort outputstream.
> then deadlock is made.
> and it seems not solved at Hadoop 2.7.1 .If confirmed , we can fixed the 
> issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to