[ 
https://issues.apache.org/jira/browse/HDFS-9874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205137#comment-15205137
 ] 

Wei-Chiu Chuang commented on HDFS-9874:
---------------------------------------

Thanks for looking into it. Maybe the NPE is unrelated.
I'm not able to fail the test, it could be an intermittent flaky test.
But in anyway, it would be great if you could improve the test diagnostics 
using {{GenericTestUtils#assertExceptionContains}}. This utility method prints 
the stack trace if the exception message doesn't match the expected value.

> Long living DataXceiver threads cause volume shutdown to block.
> ---------------------------------------------------------------
>
>                 Key: HDFS-9874
>                 URL: https://issues.apache.org/jira/browse/HDFS-9874
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.7.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>            Priority: Critical
>             Fix For: 2.7.3
>
>         Attachments: HDFS-9874-trunk-1.patch, HDFS-9874-trunk-2.patch, 
> HDFS-9874-trunk.patch
>
>
> One of the failed volume shutdown took 3 days to complete.
> Below are the relevant datanode logs while shutting down a volume (due to 
> disk failure)
> {noformat}
> 2016-02-21 10:12:55,333 [Thread-49277] WARN impl.FsDatasetImpl: Removing 
> failed volume volumeA/current: 
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not 
> writable: volumeA/current/BP-1788428031-nnIp-1351700107344/current/finalized
>         at 
> org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:194)
>         at 
> org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:308)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:786)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:242)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-02-21 10:12:55,334 [Thread-49277] INFO datanode.BlockScanner: Removing 
> scanner for volume volumeA (StorageID DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23)
> 2016-02-21 10:12:55,334 [VolumeScannerThread(volumeA)] INFO 
> datanode.VolumeScanner: VolumeScanner(volumeA, 
> DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23) exiting.
> 2016-02-21 10:12:55,335 [VolumeScannerThread(volumeA)] WARN 
> datanode.VolumeScanner: VolumeScanner(volumeA, 
> DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23): error saving 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@4169ad8b.
> java.io.FileNotFoundException: 
> volumeA/current/BP-1788428031-nnIp-1351700107344/scanner.cursor.tmp 
> (Read-only file system)
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.save(FsVolumeImpl.java:669)
>         at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.saveBlockIterator(VolumeScanner.java:314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)
> 2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to 
> delete old dfsUsed file in 
> volumeA/current/BP-1788428031-nnIp-1351700107344/current
> 2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Failed to 
> write dfsUsed to 
> volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed
> java.io.FileNotFoundException: 
> volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed (Read-only 
> file system)
>               at java.io.FileOutputStream.open(Native Method)
>               at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
>               at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
>               at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.saveDfsUsed(BlockPoolSlice.java:247)
>               at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.shutdown(BlockPoolSlice.java:698)
>               at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.shutdown(FsVolumeImpl.java:815)
>               at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.removeVolume(FsVolumeList.java:328)
>               at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:250)
>               at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
>               at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
>               at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
>               at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
>               at java.lang.Thread.run(Thread.java:745)
> 2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removed 
> volume: volumeA/current
> 2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Completed 
> checkDirs. Found 1 failure volumes.
> 2016-02-24 16:05:53,286 [Thread-49277] INFO datanode.DataNode: Deactivating 
> volumes (clear failure=false): volumeA
> 2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removing 
> volumeA from FsDataset.
> 2016-02-24 16:05:53,342 [Thread-49277] INFO common.Storage: Removing block 
> level storage: volumeA/current/BP-1788428031-nnIp-1351700107344
> 2016-02-24 16:05:53,345 [Thread-49277] WARN datanode.DataNode: 
> DataNode.handleDiskError: Keep Running: true
> {noformat} 
> The datanode waits for the reference count to become zero before shutting 
> down the volume.
> {code:title=FsVolumeImpl.java|borderStyle=solid}
> while (this.reference.getReferenceCount() > 0) {
>      if (FsDatasetImpl.LOG.isDebugEnabled()) {
>        FsDatasetImpl.LOG.debug(String.format(
>            "The reference count for %s is %d, wait to be 0.",
>            this, reference.getReferenceCount()));
>      }
>      try {
>        Thread.sleep(SLEEP_MILLIS);
>      } catch (InterruptedException e) {
>        throw new IOException(e);
>      }
>    }
> {code}
> Just before datanode logged the following line, 
> {noformat} 
> 2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to 
> delete old dfsUsed file in 
> volumeA/current/BP-1788428031-nnIp-1351700107344/current
> {noformat}
> I saw the following stack trace
> {noformat}
> 2016-02-24 16:05:53,017 [DataXceiver for client 
> DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO 
> datanode.DataNode: Exception for 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736
> java.io.IOException: Premature EOF from inputStream
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-02-24 16:05:53,018 [PacketResponder: 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, 
> type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: 
> PacketResponder: 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, 
> type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
> 2016-02-24 16:05:53,018 [PacketResponder: 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, 
> type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: 
> PacketResponder: 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, 
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
> 2016-02-24 16:05:53,018 [DataXceiver for client 
> DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO 
> datanode.DataNode: opWriteBlock 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736 received 
> exception java.io.IOException: Prematur
> e EOF from inputStream
> 2016-02-24 16:05:53,018 [DataXceiver for client 
> DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] ERROR 
> datanode.DataNode: thisDNName:1004:DataXceiver error processing WRITE_BLOCK 
> operation  src: /upStreamDNIp:55710 dst: /thisDNIp
> :1004
> java.io.IOException: Premature EOF from inputStream
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> On tracking the block blk_7059462432_1109832821906, I see that the block was 
> created on 2016-02-17 15:06:28,256
> {noformat}
> 2016-02-17 15:06:28,928 [DataXceiver for client 
> DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55590 [Receiving block 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO 
> datanode.DataNode: Receiving 
> BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906 src: 
> /upStreamDNIp:55590 dest: /thisDNIp:1004
> {noformat}
> The job which created this file was running for more than 7 days and the 
> client eventually failed to renew the delegation token so the lease manager 
> failed to renew lease for this file.
> Once commitBlockSynchronization kicked in, it tried to recover the block and 
> eventually the DataXceiver thread died decrementing the reference count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to