Noah Lorang created HDFS-7489:
---------------------------------

             Summary: Slow FsVolumeList checkDirs can hang datanodes
                 Key: HDFS-7489
                 URL: https://issues.apache.org/jira/browse/HDFS-7489
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
    Affects Versions: 2.6.0, 2.5.0
            Reporter: Noah Lorang


Starting after upgrading to 2.5.0 (CDH 5.2.1), we started to see datanodes 
hanging their heartbeat and requests from clients. After some digging, I 
identified the culprit as being the checkDiskError() triggered by catching 
IOExceptions (in our case, SocketExceptions being triggered on one datanode by 
ReplicaAlreadyExistsExceptions on another datanode).

Thread dumps reveal that the checkDiskErrors() thread is holding a lock on the 
FsVolumeList:
{code}
"Thread-409" daemon prio=10 tid=0x00007f4e50200800 nid=0x5b8e runnable 
[0x00007f4e2f855000]
   java.lang.Thread.State: RUNNABLE
        at java.io.UnixFileSystem.list(Native Method)
        at java.io.File.list(File.java:973)
        at java.io.File.listFiles(File.java:1051)
        at org.apache.hadoop.util.DiskChecker.checkDirs(DiskChecker.java:89)
        at org.apache.hadoop.util.DiskChecker.checkDirs(DiskChecker.java:91)
        at org.apache.hadoop.util.DiskChecker.checkDirs(DiskChecker.java:91)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:257)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:210)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:180)
        - locked <0x000000063b182ea0> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:1396)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$5.run(DataNode.java:2832)
        at java.lang.Thread.run(Thread.java:662)
{code}

Other things would then lock the FsDatasetImpl while waiting for the 
FsVolumeList, e.g.:

{code}
"DataXceiver for client  at /10.10.0.52:46643 [Receiving block 
BP-1573746465-127.0.1.1-1352244533715:blk_1073770670_1099996962574]" daemon 
prio=10 tid=0x00007f4e55561000 nid=0x406d waiting for monitor entry 
[0x00007f4e3106d000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.getNextVolume(FsVolumeList.java:64)
        - waiting to lock <0x000000063b182ea0> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:927)
        - locked <0x000000063b1f9a48> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:101)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:167)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:604)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:126)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:72)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
        at java.lang.Thread.run(Thread.java:662)
{code}

That lock on the FsDatasetImpl then causes other threads to block:

{code}
"Thread-127" daemon prio=10 tid=0x00007f4e4c67d800 nid=0x2e02 waiting for 
monitor entry [0x00007f4e33390000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:228)
        - waiting to lock <0x000000063b1f9a48> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyBlock(BlockPoolSliceScanner.java:436)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyFirstBlock(BlockPoolSliceScanner.java:523)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scan(BlockPoolSliceScanner.java:684)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:650)
        at 
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.run(DataBlockScanner.java:101)

"DataNode: [[[DISK]file:/data/0/dfs/dn/, [DISK]file:/data/1/dfs/dn/, 
[DISK]file:/data/2/dfs/dn/, [DISK]file:/data/3/dfs/dn/, 
[DISK]file:/data/4/dfs/dn/, [DISK]file:/data/5/dfs/dn/, 
[DISK]file:/data/6/dfs/dn/, [DISK]file:/data/7/dfs/dn/, 
[DISK]file:/data/8/dfs/dn/, [DISK]file:/data/9/dfs/dn/, 
[DISK]file:/data/10/dfs/dn/, [DISK]file:/data/11/dfs/dn/, 
[DISK]file:/data/12/dfs/dn/, [DISK]file:/data/13/dfs/dn/, 
[DISK]file:/data/14/dfs/dn/, [DISK]file:/data/15/dfs/dn/, 
[DISK]file:/data/16/dfs/dn/, [DISK]file:/data/17/dfs/dn/, 
[DISK]file:/data/18/dfs/dn/, [DISK]file:/data/19/dfs/dn/, 
[DISK]file:/data/20/dfs/dn/, [DISK]file:/data/21/dfs/dn/, 
[DISK]file:/data/22/dfs/dn/, [DISK]file:/data/23/dfs/dn/, 
[DISK]file:/data/24/dfs/dn/, [DISK]file:/data/25/dfs/dn/, 
[DISK]file:/data/26/dfs/dn/, [DISK]file:/data/27/dfs/dn/, 
[DISK]file:/data/28/dfs/dn/, [DISK]file:/data/29/dfs/dn/, 
[DISK]file:/data/30/dfs/dn/, [DISK]file:/data/31/dfs/dn/, 
[DISK]file:/data/32/dfs/dn/, [DISK]file:/data/33/dfs/dn/, 
[DISK]file:/data/34/dfs/dn/, [DISK]file:/data/35/dfs/dn/]]  heartbeating to 
bigdata-01.sc-chi-int.37signals.com/10.10.0.211:8020" daemon prio=10 
tid=0x00007f4e553a5800 nid=0x2d66 waiting for monitor entry [0x00007f4e361be000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:116)
        - waiting to lock <0x000000063b1f9a48> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:132)
        - locked <0x000000063b182d80> (a java.lang.Object)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:572)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:677)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:847)
        at java.lang.Thread.run(Thread.java:662)

"DataXceiver for client DFSClient_NONMAPREDUCE_-1948416574_103 at 
/10.10.0.169:44229 [Receiving block 
BP-1573746465-127.0.1.1-1352244533715:blk_1073776794_1099996963072]" daemon 
prio=10 tid=0x00007f4e55431000 nid=0x4ab8 waiting for monitor entry 
[0x00007f4e2a448000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:782)
        - waiting to lock <0x000000063b1f9a48> (a 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:101)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:171)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:604)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:126)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:72)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
        at java.lang.Thread.run(Thread.java:662)

{code}


This occurs, I believe, because FsVolumeList#checkDirs() locks the entire 
FsVolumeList to prevent multiple simultaneous checkDirs() calls. On a slow 
system under high IO or with many disks, the checkDirs() call can take dozens 
of seconds or longer (empirically).

I think this can be improved by holding a separate mutex limited to checkDirs 
(must like datanode.checkDiskErrors() does) and only locking the full 
FsVolumeList if needed to update the volume list if any disks are removed. I'll 
attach a patch that does this. 

We're running this patch in production and it's working as expected -- a lock 
is held locally instead of on the entire FsVolumeList while checkDirs() is 
running:

{code}
"Thread-614" daemon prio=10 tid=0x000000004037b000 nid=0x7331 runnable 
[0x00007f4d45391000]
   java.lang.Thread.State: RUNNABLE
        at java.io.UnixFileSystem.createDirectory(Native Method)
        at java.io.File.mkdir(File.java:1157)
        at 
org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
        at org.apache.hadoop.util.DiskChecker.checkDirs(DiskChecker.java:88)
        at org.apache.hadoop.util.DiskChecker.checkDirs(DiskChecker.java:91)
        at org.apache.hadoop.util.DiskChecker.checkDirs(DiskChecker.java:91)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:257)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:210)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:182)
        - locked <0x000000063b24f540> (a java.lang.Object)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:1396)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$5.run(DataNode.java:2832)
        at java.lang.Thread.run(Thread.java:662)
{code}


Feedback would be appreciated!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to