[ https://issues.apache.org/jira/browse/HDFS-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733502#action_12733502 ]
Ruyue Ma commented on HDFS-200: ------------------------------- to: dhruba borthakur > This is not related to HDFS-4379. let me explain why. > The problem is actually related to HDFS-xxx. The namenode waits for 10 > minutes after losing heartbeats from a datanode to declare it dead. During > this 10 minutes, the NN is free to choose the dead datanode as a possible > replica for a newly allocated block. > If during a write, the dfsclient sees that a block replica location for a > newly allocated block is not-connectable, it re-requests the NN to get a > fresh set of replica locations of the block. It tries this > dfs.client.block.write.retries times (default 3), sleeping 6 seconds between > each retry ( see DFSClient.nextBlockOutputStream). > This setting works well > when you have a reasonable size cluster; if u have only 4 datanodes in the > cluster, every retry picks the dead-datanode and the above logic bails out. > One solution is to change the value of dfs.client.block.write.retries to a > much much larger value, say 200 or so. Better still, increase the number of > nodes in ur cluster. Our modification: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. +++ hadoop-new/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java 2009-07-20 00:19:03.000000000 +0800 @@ -2734,6 +2734,7 @@ LocatedBlock lb = null; boolean retry = false; DatanodeInfo[] nodes; + DatanodeInfo[] exludedNodes = null; int count = conf.getInt("dfs.client.block.write.retries", 3); boolean success; do { @@ -2745,7 +2746,7 @@ success = false; long startTime = System.currentTimeMillis(); - lb = locateFollowingBlock(startTime); + lb = locateFollowingBlock(startTime, exludedNodes); block = lb.getBlock(); nodes = lb.getLocations(); @@ -2755,6 +2756,19 @@ success = createBlockOutputStream(nodes, clientName, false); if (!success) { + + LOG.info("Excluding node: " + nodes[errorIndex]); + // Mark datanode as excluded + DatanodeInfo errorNode = nodes[errorIndex]; + if (exludedNodes != null) { + DatanodeInfo[] newExcludedNodes = new DatanodeInfo[exludedNodes.length + 1]; + System.arraycopy(exludedNodes, 0, newExcludedNodes, 0, exludedNodes.length); + newExcludedNodes[exludedNodes.length] = errorNode; + exludedNodes = newExcludedNodes; + } else { + exludedNodes = new DatanodeInfo[] { errorNode }; + } + LOG.info("Abandoning block " + block); namenode.abandonBlock(block, src, clientName); > In HDFS, sync() not yet guarantees data available to the new readers > -------------------------------------------------------------------- > > Key: HDFS-200 > URL: https://issues.apache.org/jira/browse/HDFS-200 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Tsz Wo (Nicholas), SZE > Assignee: dhruba borthakur > Priority: Blocker > Attachments: 4379_20081010TC3.java, fsyncConcurrentReaders.txt, > fsyncConcurrentReaders11_20.txt, fsyncConcurrentReaders12_20.txt, > fsyncConcurrentReaders3.patch, fsyncConcurrentReaders4.patch, > fsyncConcurrentReaders5.txt, fsyncConcurrentReaders6.patch, > fsyncConcurrentReaders9.patch, > hadoop-stack-namenode-aa0-000-12.u.powerset.com.log.gz, > hypertable-namenode.log.gz, namenode.log, namenode.log, Reader.java, > Reader.java, reopen_test.sh, ReopenProblem.java, Writer.java, Writer.java > > > In the append design doc > (https://issues.apache.org/jira/secure/attachment/12370562/Appends.doc), it > says > * A reader is guaranteed to be able to read data that was 'flushed' before > the reader opened the file > However, this feature is not yet implemented. Note that the operation > 'flushed' is now called "sync". -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.