Stephen O'Donnell created HDFS-14459:
----------------------------------------

             Summary: ClosedChannelException silently ignored in 
FsVolumeList.addBlockPool()
                 Key: HDFS-14459
                 URL: https://issues.apache.org/jira/browse/HDFS-14459
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
    Affects Versions: 3.3.0
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell
             Fix For: 3.3.0


Following on HDFS-14333, I encountered another scenario when a volume has some 
sort of disk level errors it can silently fail to have the blockpool added to 
itself in FsVolumeList.addBlockPool().

In the logs for a recent issue we see the following pattern:

{code}
2019-04-24 04:21:27,690 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added 
volume - /CDH/sdi1/dfs/dn/current, StorageType: DISK
2019-04-24 04:21:27,691 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added new 
volume: DS-694ae931-8a4e-42d5-b2b3-d946e35c6b47
...
2019-04-24 04:21:27,703 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-936404344-xxx-1426594942733 on volume /CDH/sdi1/dfs/dn/current...
...

<HERE WE ARE MISSING THE LOG LIKE:
2019-04-24 04:21:27,722 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken 
to scan block pool BP-936404344-xxx-1426594942733 on /CDH/sdi1/dfs/dn/current: 
19ms
>
...
2019-04-24 04:21:29,871 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Adding 
replicas to map for block pool BP-936404344-xxx-1426594942733 on volume 
/CDH/sdi1/dfs/dn/current...
...
2019-04-24 04:21:29,872 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught 
exception while adding replicas from /CDH/sdi1/dfs/dn/current. Will throw later.
java.io.IOException: block pool BP-936404344-10.7.192.215-1426594942733 is not 
found
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getBlockPoolSlice(FsVolumeImpl.java:407)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getVolumeMap(FsVolumeImpl.java:864)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$1.run(FsVolumeList.java:191
{code}

The notable point, is that the 'scanning block pool' step must not have 
completed properly for this volume but nothing was logged and then the slightly 
confusing error is logged when attempting to add the replicas. That error 
occurs as the block pool was not added to the volume by the addBlockPool step.

The relevant part of the code in 'addBlockPool()' from current trunk looks like:

{code}
    for (final FsVolumeImpl v : volumes) {
      Thread t = new Thread() {
        public void run() {
          try (FsVolumeReference ref = v.obtainReference()) {
            FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
                " on volume " + v + "...");
            long startTime = Time.monotonicNow();
            v.addBlockPool(bpid, conf);
            long timeTaken = Time.monotonicNow() - startTime;
            FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
                " on " + v + ": " + timeTaken + "ms");
          } catch (ClosedChannelException e) {
            // ignore.
          } catch (IOException ioe) {
            FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
                ". Will throw later.", ioe);
            unhealthyDataDirs.put(v, ioe);
          }
        }
      };
      blockPoolAddingThreads.add(t);
      t.start();
    }
{code}

As we get the first log message (Scanning block pool ... ), but not the second 
(Time take to scan block pool ...), and we don't get anything logged or an 
exception thrown, then the operation must have encountered a 
ClosedChannelException which is silently ignored.

I am also not sure if we should ignore a ClosedChannelException, as it means 
the volume failed to add fully. As ClosedChannelException is a subclass of 
IOException perhaps we can remove that catch block entirely?

Finally, HDFS-14333 refactored the above code to allow the DN to better handle 
a disk failure on DN startup. However, if addBlockPool does throw an exception, 
it will mean getAllVolumesMap() will not get called and the DN will end up 
partly initialized.

DataNode.initBlockPool() calls FsDatasetImpl.addBlockPool() which looks like 
the following, calling addBlockPool() and then getAllVolumesMap():

{code}
public void addBlockPool(String bpid, Configuration conf)
      throws IOException {
    LOG.info("Adding block pool " + bpid);
    try (AutoCloseableLock lock = datasetLock.acquire()) {
      volumes.addBlockPool(bpid, conf);
      volumeMap.initBlockPool(bpid);
    }
    volumes.getAllVolumesMap(bpid, volumeMap, ramDiskReplicaTracker);
  }
{code}

This needs refactored to catch any AddBlockPoolException raised in 
addBlockPool, then continue to call getAllVolumesMap() before re-throwing any 
of the caught exceptions to allow the DN to handle the individual volume 
failures.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to