[ https://issues.apache.org/jira/browse/HDFS-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063360#comment-14063360 ]
Hudson commented on HDFS-5809: ------------------------------ FAILURE: Integrated in Hadoop-Yarn-trunk #614 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/614/]) HDFS-5809. BlockPoolSliceScanner and high speed hdfs appending make datanode to drop into infinite loop (cmccabe) (cmccabe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1610790) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceScanner.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/DataNodeTestUtils.java > BlockPoolSliceScanner and high speed hdfs appending make datanode to drop > into infinite loop > -------------------------------------------------------------------------------------------- > > Key: HDFS-5809 > URL: https://issues.apache.org/jira/browse/HDFS-5809 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.0.0-alpha > Environment: jdk1.6, centos6.4, 2.0.0-cdh4.5.0 > Reporter: ikweesung > Assignee: Colin Patrick McCabe > Priority: Critical > Labels: blockpoolslicescanner, datanode, infinite-loop > Fix For: 2.6.0 > > Attachments: HDFS-5809.001.patch > > > {{BlockPoolSliceScanner#scan}} contains a "while" loop that continues to > verify (i.e. scan) blocks until the {{blockInfoSet}} is empty (or some other > conditions like a timeout have occurred.) In order to do this, it calls > {{BlockPoolSliceScanner#verifyFirstBlock}}. This is intended to grab the > first block in the {{blockInfoSet}}, verify it, and remove it from that set. > ({{blockInfoSet}} is sorted by last scan time.) Unfortunately, if we hit a > certain bug in {{updateScanStatus}}, the block may never be removed from > {{blockInfoSet}}. When this happens, we keep rescanning the exact same block > until the timeout hits. > The bug is triggered when a block winds up in {{blockInfoSet}} but not in > {{blockMap}}. You can see it clearly in this code: > {code} > private synchronized void updateScanStatus(Block block, > > ScanType type, > boolean scanOk) { > > BlockScanInfo info = blockMap.get(block); > > > if ( info != null ) { > delBlockInfo(info); > } else { > > // It might already be removed. Thats ok, it will be caught next time. > > info = new BlockScanInfo(block); > > } > {code} > If {{info == null}}, we never call {{delBlockInfo}}, the function which is > intended to remove the {{blockInfoSet}} entry. > Luckily, there is a simple fix here... the variable that {{updateScanStatus}} > is being passed is actually a BlockInfo object, so we can simply call > {{delBlockInfo}} on it directly, without doing a lookup in the {{blockMap}}. > This is both faster and more robust. -- This message was sent by Atlassian JIRA (v6.2#6252)