[ https://issues.apache.org/jira/browse/HDFS-9781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453821#comment-15453821 ]
Manoj Govindassamy commented on HDFS-9781: ------------------------------------------ Ref: {noformat} @Override public Map<DatanodeStorage, BlockListAsLongs> getBlockReports(String bpid) { Map<DatanodeStorage, BlockListAsLongs> blockReportsMap = new HashMap<DatanodeStorage, BlockListAsLongs>(); Map<String, BlockListAsLongs.Builder> builders = new HashMap<String, BlockListAsLongs.Builder>(); List<FsVolumeImpl> curVolumes = volumes.getVolumes(); for (FsVolumeSpi v : curVolumes) { builders.put(v.getStorageID(), BlockListAsLongs.builder(maxDataLength)); } try (AutoCloseableLock lock = datasetLock.acquire()) { for (ReplicaInfo b : volumeMap.replicas(bpid)) { switch(b.getState()) { case FINALIZED: case RBW: case RWR: builders.get(b.getVolume().getStorageID()).add(b); <=== NPE break; case RUR: {noformat} [~daryn], {quote} One way to fix this might be to call volumes.waitVolumeRemoved() before volumes.removeVolume() in FsDatasetImpl#removeVolumes() {quote} {noformat} LOG.info("Removing " + absRoot + " from FsDataset."); // Disable the volume from the service. asyncDiskService.removeVolume(sd.getCurrentDir()); volumes.removeVolume(absRoot, clearFailure); volumes.waitVolumeRemoved(5000, this); {noformat} * I believe, {{volumes.waitVolumeRemoved(5000, this)}} cannot be moved before {{volumes.removeVolume(absRoot, clearFailure);}} as the waitVolumesRemoved() is dependent on the volumesBeingRemoved queue which is filled in by the removeVolume(). * Also, the intention of waitVolumesRemoved() is to wait for the removed volume's reference to go to zero. [~virajith],[~daryn], {quote} make FsDatasetImpl#getBlockReports() hold the lock while it is referring to both volumes and volumeMap i.e., move synchronized(this) before volumes.getVolumes(). {quote} * I guess your proposal here is is something like below. (in the latest code base, i am assuming datasetLock as the synchronized block you were referring) ** {noformat} try (AutoCloseableLock lock = datasetLock.acquire()) { curVolumes = volumes.getVolumes(); for (FsVolumeSpi v : curVolumes) { builders.put(v.getStorageID(), BlockListAsLongs.builder(maxDataLength)); } {noformat} * IMHO, moving volumes.getVolumes() alone inside the lock will not solve the problem, particularly the NPE we are seeing in TestDataNodeHotSwapVolumes. Because, the NPE is happening when doing add after builders.get(<storageID>). Which means, builders map doesn't have the storageID being passed in. * That is, ** {{curVolumes}} list is not the problem here. This list has the right set of volumes after the previous volume removal operation. ** Its the {{ReplicaInfo}} from {{volumeMap.replicas(bpid)}} which has the reference to the removed volume ** But, {{builders}} map doesn't have the removed volume ** So, {{builders.get(b.getVolume().getStorageID()).add(b);}} will turn to *null.add(b)* and so the NPE. Please let me know if my understanding is wrong. > FsDatasetImpl#getBlockReports can occasionally throw NullPointerException > ------------------------------------------------------------------------- > > Key: HDFS-9781 > URL: https://issues.apache.org/jira/browse/HDFS-9781 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 3.0.0-alpha1 > Environment: Jenkins > Reporter: Wei-Chiu Chuang > Assignee: Manoj Govindassamy > Attachments: HDFS-9781.01.patch > > > FsDatasetImpl#getBlockReports occasionally throws NPE. The NPE caused > TestFsDatasetImpl#testRemoveVolumeBeingWritten to time out, because the test > waits for the call to FsDatasetImpl#getBlockReports to complete without > exceptions. > Additionally, the test should be updated to identify an expected exception, > using {{GenericTestUtils.assertExceptionContains()}} > {noformat} > Exception in thread "Thread-20" java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1709) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl$1BlockReportThread.run(TestFsDatasetImpl.java:587) > 2016-02-08 15:47:30,379 [Thread-21] WARN impl.TestFsDatasetImpl > (TestFsDatasetImpl.java:run(606)) - Exception caught. This should not affect > the test > java.io.IOException: Failed to move meta file for ReplicaBeingWritten, > blk_0_0, RBW > getNumBytes() = 0 > getBytesOnDisk() = 0 > getVisibleLength()= 0 > getVolume() = > /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current > getBlockFile() = > /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0 > bytesAcked=0 > bytesOnDisk=0 from > /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0_0.meta > to > /home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/finalized/subdir0/subdir0/blk_0_0.meta > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.moveBlockFiles(FsDatasetImpl.java:857) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addFinalizedBlock(BlockPoolSlice.java:295) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addFinalizedBlock(FsVolumeImpl.java:819) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeReplica(FsDatasetImpl.java:1620) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:1601) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl$1ResponderThread.run(TestFsDatasetImpl.java:603) > Caused by: java.io.IOException: > renameTo(src=/home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/rbw/blk_0_0.meta, > > dst=/home/weichiu/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/Nmi6rYndvr/data0/current/bpid-0/current/finalized/subdir0/subdir0/blk_0_0.meta) > failed. > at org.apache.hadoop.io.nativeio.NativeIO.renameTo(NativeIO.java:873) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.moveBlockFiles(FsDatasetImpl.java:855) > ... 5 more > 2016-02-08 15:47:34,381 [Thread-19] INFO impl.FsDatasetImpl > (FsVolumeList.java:waitVolumeRemoved(287)) - Volume reference is released. > 2016-02-08 15:47:34,384 [Thread-19] INFO impl.TestFsDatasetImpl > (TestFsDatasetImpl.java:testRemoveVolumeBeingWritten(622)) - Volumes removed > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org