Wei-Chiu Chuang created HDFS-13672:
--------------------------------------

             Summary: clearCorruptLazyPersistFiles could crash NameNode
                 Key: HDFS-13672
                 URL: https://issues.apache.org/jira/browse/HDFS-13672
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Wei-Chiu Chuang


I started a NameNode on a pretty large fsimage. Since the NameNode is started 
without any DataNodes, all blocks (100 million) are "corrupt".

Afterwards I observed FSNamesystem#clearCorruptLazyPersistFiles() held write 
lock for a long time:

{noformat}
18/06/12 12:37:03 INFO namenode.FSNamesystem: FSNamesystem write lock held for 
46024 ms via
java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:945)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:198)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1689)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.clearCorruptLazyPersistFiles(FSNamesystem.java:5532)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:5543)
java.lang.Thread.run(Thread.java:748)
        Number of suppressed write-lock reports: 0
        Longest write-lock held interval: 46024
{noformat}

Here's the relevant code:

{code}
      writeLock();

      try {
        final Iterator<BlockInfo> it =
            blockManager.getCorruptReplicaBlockIterator();

        while (it.hasNext()) {
          Block b = it.next();
          BlockInfo blockInfo = blockManager.getStoredBlock(b);
          if (blockInfo.getBlockCollection().getStoragePolicyID() == 
lpPolicy.getId()) {
            filesToDelete.add(blockInfo.getBlockCollection());
          }
        }

        for (BlockCollection bc : filesToDelete) {
          LOG.warn("Removing lazyPersist file " + bc.getName() + " with no 
replicas.");
          changed |= deleteInternal(bc.getName(), false, false, false);
        }
      } finally {
        writeUnlock();
      }
{code}
In essence, the iteration over corrupt replica list should be broken down into 
smaller iterations to avoid a single long wait.

Since this operation holds NameNode write lock for more than 45 seconds, the 
default ZKFC connection timeout, it implies an extreme case like this (100 
million corrupt blocks) could lead to NameNode failover.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to