[ 
https://issues.apache.org/jira/browse/HDFS-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-9908:
----------------------------------
    Priority: Major  (was: Critical)

> Datanode should tolerate disk failure during NN handshake
> ---------------------------------------------------------
>
>                 Key: HDFS-9908
>                 URL: https://issues.apache.org/jira/browse/HDFS-9908
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.5.0
>         Environment: CDH5.3.3
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>
> DN may treat a disk failure exception as NN handshake exception, and this can 
> prevent a DN to join a cluster even if most of its disks are healthy.
> During NN handshake, DN initializes block pools. It will create a lock files 
> per disk, and then scan the volumes. However, if the scanning throws 
> exceptions due to disk failure, DN will think it's an exception because NN is 
> inconsistent with the local storage (see {{DataNode#initBlockPool}}. As a 
> result, it will attempt to reconnect to NN again.
> However, at this point, DN has not deleted its lock files on the disks. If it 
> reconnects to NN again, it will think the same disks are already being used, 
> and then it will fail handshake again because all disks can not be used (due 
> to locking), and repeatedly. This will happen even if the DN has multiple 
> disks, and only one of them fails. The DN will not be able to connect to NN 
> despite just one failing disk. Note that it is possible to successfully 
> create a lock file on a disk, and then has error scanning the disk.
> We saw this on a CDH 5.3.3 cluster (which is based on Apache Hadoop 2.5.0, 
> and we still see the same bug in 3.0.0 trunk branch). The root cause is that 
> DN treats an internal error (single disk failure) as an external one (NN 
> handshake failure) and we should fix it.
> {code:title=DataNode.java}
> /**
>    * One of the Block Pools has successfully connected to its NN.
>    * This initializes the local storage for that block pool,
>    * checks consistency of the NN's cluster ID, etc.
>    * 
>    * If this is the first block pool to register, this also initializes
>    * the datanode-scoped storage.
>    * 
>    * @param bpos Block pool offer service
>    * @throws IOException if the NN is inconsistent with the local storage.
>    */
>   void initBlockPool(BPOfferService bpos) throws IOException {
>     NamespaceInfo nsInfo = bpos.getNamespaceInfo();
>     if (nsInfo == null) {
>       throw new IOException("NamespaceInfo not found: Block pool " + bpos
>           + " should have retrieved namespace info before initBlockPool.");
>     }
>     
>     setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());
>     // Register the new block pool with the BP manager.
>     blockPoolManager.addBlockPool(bpos);
>     
>     // In the case that this is the first block pool to connect, initialize
>     // the dataset, block scanners, etc.
>     initStorage(nsInfo);
>     // Exclude failed disks before initializing the block pools to avoid 
> startup
>     // failures.
>     checkDiskError();
>     data.addBlockPool(nsInfo.getBlockPoolID(), conf);  <----- this line 
> throws disk error exception
>     blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
>     initDirectoryScanner(conf);
>   }
> {code}
> {{FsVolumeList#addBlockPool}} is the source of exception.
> {code:title=FsVolumeList.java}
>   void addBlockPool(final String bpid, final Configuration conf) throws 
> IOException {
>     long totalStartTime = Time.monotonicNow();
>     
>     final List<IOException> exceptions = Collections.synchronizedList(
>         new ArrayList<IOException>());
>     List<Thread> blockPoolAddingThreads = new ArrayList<Thread>();
>     for (final FsVolumeImpl v : volumes) {
>       Thread t = new Thread() {
>         public void run() {
>           try (FsVolumeReference ref = v.obtainReference()) {
>             FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
>                 " on volume " + v + "...");
>             long startTime = Time.monotonicNow();
>             v.addBlockPool(bpid, conf);
>             long timeTaken = Time.monotonicNow() - startTime;
>             FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
>                 " on " + v + ": " + timeTaken + "ms");
>           } catch (ClosedChannelException e) {
>             // ignore.
>           } catch (IOException ioe) {
>             FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
>                 ". Will throw later.", ioe);
>             exceptions.add(ioe);
>           }
>         }
>       };
>       blockPoolAddingThreads.add(t);
>       t.start();
>     }
>     for (Thread t : blockPoolAddingThreads) {
>       try {
>         t.join();
>       } catch (InterruptedException ie) {
>         throw new IOException(ie);
>       }
>     }
>     if (!exceptions.isEmpty()) {
>       throw exceptions.get(0); <----- here's the original of exception
>     }
>     
>     long totalTimeTaken = Time.monotonicNow() - totalStartTime;
>     FsDatasetImpl.LOG.info("Total time to scan all replicas for block pool " +
>         bpid + ": " + totalTimeTaken + "ms");
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to