[ https://issues.apache.org/jira/browse/HDFS-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vishal Rajan updated HDFS-5809: ------------------------------- Environment: jdk1.6/java 1.7, centos6.4/debian6, 2.0.0-cdh4.5.0 (was: jdk1.6, centos6.4, 2.0.0-cdh4.5.0) > BlockPoolSliceScanner and high speed hdfs appending make datanode to drop > into infinite loop > -------------------------------------------------------------------------------------------- > > Key: HDFS-5809 > URL: https://issues.apache.org/jira/browse/HDFS-5809 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.0.0-alpha > Environment: jdk1.6/java 1.7, centos6.4/debian6, 2.0.0-cdh4.5.0 > Reporter: ikweesung > Assignee: Colin Patrick McCabe > Priority: Critical > Labels: blockpoolslicescanner, datanode, infinite-loop > Fix For: 2.6.0 > > Attachments: HDFS-5809.001.patch > > > {{BlockPoolSliceScanner#scan}} contains a "while" loop that continues to > verify (i.e. scan) blocks until the {{blockInfoSet}} is empty (or some other > conditions like a timeout have occurred.) In order to do this, it calls > {{BlockPoolSliceScanner#verifyFirstBlock}}. This is intended to grab the > first block in the {{blockInfoSet}}, verify it, and remove it from that set. > ({{blockInfoSet}} is sorted by last scan time.) Unfortunately, if we hit a > certain bug in {{updateScanStatus}}, the block may never be removed from > {{blockInfoSet}}. When this happens, we keep rescanning the exact same block > until the timeout hits. > The bug is triggered when a block winds up in {{blockInfoSet}} but not in > {{blockMap}}. You can see it clearly in this code: > {code} > private synchronized void updateScanStatus(Block block, > > ScanType type, > boolean scanOk) { > > BlockScanInfo info = blockMap.get(block); > > > if ( info != null ) { > delBlockInfo(info); > } else { > > // It might already be removed. Thats ok, it will be caught next time. > > info = new BlockScanInfo(block); > > } > {code} > If {{info == null}}, we never call {{delBlockInfo}}, the function which is > intended to remove the {{blockInfoSet}} entry. > Luckily, there is a simple fix here... the variable that {{updateScanStatus}} > is being passed is actually a BlockInfo object, so we can simply call > {{delBlockInfo}} on it directly, without doing a lookup in the {{blockMap}}. > This is both faster and more robust. -- This message was sent by Atlassian JIRA (v6.3.4#6332)