[jira] [Commented] (HDFS-1195) Offer rate limits for replicating data
[ https://issues.apache.org/jira/browse/HDFS-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267740#comment-14267740 ] Cosmin Lehene commented on HDFS-1195: - [~kevinweil] is this still valid? Offer rate limits for replicating data --- Key: HDFS-1195 URL: https://issues.apache.org/jira/browse/HDFS-1195 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 0.20.2 Environment: Linux, Hadoop 0.20.1 CDH Reporter: Kevin Weil If a rack of Hadoop nodes goes down, there is a lot of data to re-replicate. It would be great to have a configuration option to rate-limit the amount of bandwidth used for re-replication so as not to saturate network backlinks. There is a similar option for rate limiting the speed at which a DFS rebalance takes place: dfs.balance.bandwidthPerSec. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] Commented: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12875519#action_12875519 ] Cosmin Lehene commented on HDFS-630: There's a patch for 0.20 adapted by tlipcon. Can we use that? In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client, name-node Affects Versions: 0.20-append Reporter: Ruyue Ma Assignee: Cosmin Lehene Fix For: 0.21.0 Attachments: 0001-Fix-HDFS-630-0.21-svn-1.patch, 0001-Fix-HDFS-630-0.21-svn-2.patch, 0001-Fix-HDFS-630-0.21-svn.patch, 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-4.patch, hdfs-630-0.20.txt, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-1024) SecondaryNamenode fails to checkpoint because namenode fails with CancelledKeyException
[ https://issues.apache.org/jira/browse/HDFS-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-1024: Affects Version/s: 0.21.0 0.20.3 0.20.1 0.20.2 Fix Version/s: 0.21.0 Adding effected versions SecondaryNamenode fails to checkpoint because namenode fails with CancelledKeyException --- Key: HDFS-1024 URL: https://issues.apache.org/jira/browse/HDFS-1024 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.22.0 Reporter: dhruba borthakur Assignee: Dmytro Molkov Priority: Blocker Fix For: 0.20.3, 0.21.0, 0.22.0 Attachments: HDFS-1024.patch, HDFS-1024.patch.1 The secondary namenode fails to retrieve the entire fsimage from the Namenode. It fetches a part of the fsimage but believes that it has fetched the entire fsimage file and proceeds ahead with the checkpointing. Stack traces will be attached below. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834158#action_12834158 ] Cosmin Lehene commented on HDFS-909: @Todd what's the state of this patch? This happens more often than I initially thought. Just hit it again. Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log - Key: HDFS-909 URL: https://issues.apache.org/jira/browse/HDFS-909 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 Environment: CentOS Reporter: Cosmin Lehene Assignee: Todd Lipcon Priority: Blocker Fix For: 0.21.0, 0.22.0 Attachments: hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt closing the edits log file can race with write to edits log file operation resulting in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads and then removed twice from the buffer, losing a good byte from edits log. Example: {code} FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.setReadyToFlush() FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.setReadyToFlush() FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() VERSUS FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.setReadyToFlush() FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR Any FSEditLog.write {code} Access on the edits flush operations is synchronized only in the FSEdits.logSync() method level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in the example above So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the end-of-file marker) and then remove it twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon cluster restart. We got to this point after investigating a corrupted edits file that made HDFS unable to start with {code:title=namenode.log} java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 {code} EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834264#action_12834264 ] Cosmin Lehene commented on HDFS-909: @Todd, Thanks! We're using 0.21 :) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log - Key: HDFS-909 URL: https://issues.apache.org/jira/browse/HDFS-909 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 Environment: CentOS Reporter: Cosmin Lehene Assignee: Todd Lipcon Priority: Blocker Fix For: 0.21.0, 0.22.0 Attachments: hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt closing the edits log file can race with write to edits log file operation resulting in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads and then removed twice from the buffer, losing a good byte from edits log. Example: {code} FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.setReadyToFlush() FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.setReadyToFlush() FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() VERSUS FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.setReadyToFlush() FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR Any FSEditLog.write {code} Access on the edits flush operations is synchronized only in the FSEdits.logSync() method level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in the example above So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the end-of-file marker) and then remove it twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon cluster restart. We got to this point after investigating a corrupted edits file that made HDFS unable to start with {code:title=namenode.log} java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 {code} EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-909: --- Description: closing the edits log file can race with write to edits log file operation resulting in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads and then removed twice from the buffer, losing a good byte from edits log. Example: {code} FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.setReadyToFlush() FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.setReadyToFlush() FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() VERSUS FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.setReadyToFlush() FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR Any FSEditLog.write {code} Access on the edits flush operations is synchronized only in the FSEdits.logSync() method level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in the example above So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the end-of-file marker) and then remove it twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon cluster restart. We got to this point after investigating a corrupted edits file that made HDFS unable to start with {code:title=namenode.log} java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 {code} EDIT: moved the logs to a comment to make this readable was: closing the edits log file can race with write to edits log file operation resulting in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads and then removed twice from the buffer, losing a good byte from edits log. Example: {code} FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.setReadyToFlush() FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.setReadyToFlush() FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() VERSUS FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.setReadyToFlush() FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR Any FSEditLog.write {code} Access on the edits flush operations is synchronized only in the FSEdits.logSync() method level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in the example above So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the end-of-file marker) and then remove it twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon cluster restart. We got to this point after investigating a corrupted edits file that made HDFS unable to start with {code:title=namenode.log} java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 {code} In the edits file we found the first 2 entries: {code:title=edits}
[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828543#action_12828543 ] Cosmin Lehene commented on HDFS-909: @Konstantin I moved the details to a comment and broke it on more lines but I missed the log entry that really messes up the layout. I can't edit the comment, unfortunately so if you can please break the log entry lines in my comment to have decent layout on this page. Sorry and thanks. PS I'll look at the code again to see the race issue you described. Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log - Key: HDFS-909 URL: https://issues.apache.org/jira/browse/HDFS-909 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0 Environment: CentOS Reporter: Cosmin Lehene Assignee: Todd Lipcon Priority: Blocker Fix For: 0.21.0, 0.22.0 Attachments: hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt closing the edits log file can race with write to edits log file operation resulting in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads and then removed twice from the buffer, losing a good byte from edits log. Example: {code} FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.setReadyToFlush() FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.setReadyToFlush() FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() VERSUS FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.setReadyToFlush() FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR Any FSEditLog.write {code} Access on the edits flush operations is synchronized only in the FSEdits.logSync() method level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in the example above So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the end-of-file marker) and then remove it twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon cluster restart. We got to this point after investigating a corrupted edits file that made HDFS unable to start with {code:title=namenode.log} java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 {code} EDIT: moved the logs to a comment to make this readable -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804947#action_12804947 ] Cosmin Lehene commented on HDFS-630: I'm glad it finally got in both 0.21 and trunk. It was a long lived issue. Thanks for the support! :) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client, name-node Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Cosmin Lehene Fix For: 0.21.0, 0.22.0 Attachments: 0001-Fix-HDFS-630-0.21-svn-1.patch, 0001-Fix-HDFS-630-0.21-svn-2.patch, 0001-Fix-HDFS-630-0.21-svn.patch, 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-4.patch, hdfs-630-0.20.txt, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log - Key: HDFS-909 URL: https://issues.apache.org/jira/browse/HDFS-909 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.21.0, 0.22.0 Environment: CentOS Reporter: Cosmin Lehene Priority: Blocker Fix For: 0.21.0, 0.22.0 closing the edits log file can race with write to edits log file operation resulting in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads and then removed twice from the buffer, losing a good byte from edits log. Example: {code} FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.setReadyToFlush() FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.setReadyToFlush() FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() VERSUS FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.setReadyToFlush() FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR Any FSEditLog.write {code} Access on the edits flush operations is synchronized only in the FSEdits.logSync() method level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in the example above So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the end-of-file marker) and then remove it twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon cluster restart. We got to this point after investigating a corrupted edits file that made HDFS unable to start with {code:title=namenode.log} java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 {code} In the edits file we found the first 2 entries: {code:title=edits} FFEC090005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D31323631303832363238303934000836373130383836340003F6CBB87EF376E3E604039665F9549DE069A5735E04039665ADCC71A050B16ABF015A179A00039665066861646F6F700A737570657267726F757001010003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D3132363130383236333138363902 ... {code} This is the completeFile operation that's missing the last byte {code:title=completeFile} FFEC090005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D31323631303832363238303934000836373130383836340003F6CBB87EF376E3E604039665F9549DE069A5735E04039665ADCC71A050B16ABF015A179A00039665066861646F6F700A737570657267726F757001?? {code} followed by a rename operation {code:Title=rename} 010003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D31323631303832363331383639 {code} The first byte of the rename was instead read as part of the completeFile() operation. This resulted in reading the next operation as 0x00 (OP_ADD) followed by an int (length) which would have been 0x030 which is 768 that was read and failed in the following code {code:Title=FSEditLog.java} case OP_ADD: case OP_CLOSE: { // versions 0 support per file replication // get name and replication int length = in.readInt();
[jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12802941#action_12802941 ] Cosmin Lehene commented on HDFS-909: Hi Todd, I haven't check yet, so it's possible to be on 0.20 as well. I forgot to add that the issue particularly nasty because it will first fail silently. In our case, the log was corrupted on 17th of December but we only discovered it yesterday when we restarted HDFS. It can be detected early by monitoring the secondary-namenode.out log file. Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log - Key: HDFS-909 URL: https://issues.apache.org/jira/browse/HDFS-909 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 0.21.0, 0.22.0 Environment: CentOS Reporter: Cosmin Lehene Priority: Blocker Fix For: 0.21.0, 0.22.0 closing the edits log file can race with write to edits log file operation resulting in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads and then removed twice from the buffer, losing a good byte from edits log. Example: {code} FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.setReadyToFlush() FSNameSystem.rollEditLog() - FSEditLog.divertFileStreams() - FSEditLog.closeStream() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.setReadyToFlush() FSNameSystem.rollFSImage() - FSIMage.rollFSImage() - FSEditLog.purgeEditLog() - FSEditLog.revertFileStreams() - FSEditLog.closeStream() -EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() VERSUS FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.setReadyToFlush() FSNameSystem.completeFile - FSEditLog.logSync() - EditLogOutputStream.flush() - EditLogFileOutputStream.flushAndSync() OR Any FSEditLog.write {code} Access on the edits flush operations is synchronized only in the FSEdits.logSync() method level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush() or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in the example above So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte (normally the FSEditsLog.OP_INVALID which is the end-of-file marker) and then remove it twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon cluster restart. We got to this point after investigating a corrupted edits file that made HDFS unable to start with {code:title=namenode.log} java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450 {code} In the edits file we found the first 2 entries: {code:title=edits} FFEC090005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D31323631303832363238303934000836373130383836340003F6CBB87EF376E3E604039665F9549DE069A5735E04039665ADCC71A050B16ABF015A179A00039665066861646F6F700A737570657267726F757001010003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D3132363130383236333138363902 ... {code} This is the completeFile operation that's missing the last byte {code:title=completeFile} FFEC090005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D31323631303832363238303934000836373130383836340003F6CBB87EF376E3E604039665F9549DE069A5735E04039665ADCC71A050B16ABF015A179A00039665066861646F6F700A737570657267726F757001?? {code} followed by a rename operation {code:Title=rename}
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Status: Patch Available (was: Open) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client, name-node Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Cosmin Lehene Attachments: 0001-Fix-HDFS-630-0.21-svn-1.patch, 0001-Fix-HDFS-630-0.21-svn-2.patch, 0001-Fix-HDFS-630-0.21-svn.patch, 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-4.patch, hdfs-630-0.20.txt, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Status: Open (was: Patch Available) tests fail erratically canceling again In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client, name-node Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Cosmin Lehene Attachments: 0001-Fix-HDFS-630-0.21-svn-1.patch, 0001-Fix-HDFS-630-0.21-svn-2.patch, 0001-Fix-HDFS-630-0.21-svn.patch, 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-4.patch, hdfs-630-0.20.txt, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Attachment: 0001-Fix-HDFS-630-0.21-svn-2.patch attaching 0.21 patch with javadoc link fixed In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client, name-node Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Cosmin Lehene Attachments: 0001-Fix-HDFS-630-0.21-svn-1.patch, 0001-Fix-HDFS-630-0.21-svn-2.patch, 0001-Fix-HDFS-630-0.21-svn.patch, 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, hdfs-630-0.20.txt, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Attachment: 0001-Fix-HDFS-630-trunk-svn-4.patch patch for trunk with javadoc link fixed. the TestFiHFlush test that failed previously seems to work fine when running tests using ant - so nothing done regarding that. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client, name-node Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Cosmin Lehene Attachments: 0001-Fix-HDFS-630-0.21-svn-1.patch, 0001-Fix-HDFS-630-0.21-svn-2.patch, 0001-Fix-HDFS-630-0.21-svn.patch, 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-4.patch, hdfs-630-0.20.txt, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Status: Open (was: Patch Available) Canceling to restart build In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client, name-node Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Cosmin Lehene Attachments: 0001-Fix-HDFS-630-0.21-svn-1.patch, 0001-Fix-HDFS-630-0.21-svn-2.patch, 0001-Fix-HDFS-630-0.21-svn.patch, 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-4.patch, hdfs-630-0.20.txt, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Status: Patch Available (was: Open) Trying the trunk patch one more time. I dont' exactly know how to trigger a 0.21 patch/build In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client, name-node Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Cosmin Lehene Attachments: 0001-Fix-HDFS-630-0.21-svn-1.patch, 0001-Fix-HDFS-630-0.21-svn-2.patch, 0001-Fix-HDFS-630-0.21-svn.patch, 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-4.patch, hdfs-630-0.20.txt, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Status: Patch Available (was: Open) I have an it runs on my machine feeling. Trying once more In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client, name-node Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Cosmin Lehene Attachments: 0001-Fix-HDFS-630-0.21-svn-1.patch, 0001-Fix-HDFS-630-0.21-svn-2.patch, 0001-Fix-HDFS-630-0.21-svn.patch, 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, 0001-Fix-HDFS-630-trunk-svn-4.patch, hdfs-630-0.20.txt, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Attachment: 0001-Fix-HDFS-630-0.21-svn-1.patch 0001-Fix-HDFS-630-trunk-svn-3.patch New patches for 0.21 and trunk. ClientProtcol versionID is 53L for 0.21 54L for trunk. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client, name-node Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Cosmin Lehene Attachments: 0001-Fix-HDFS-630-0.21-svn-1.patch, 0001-Fix-HDFS-630-0.21-svn.patch, 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, 0001-Fix-HDFS-630-trunk-svn-3.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793640#action_12793640 ] Cosmin Lehene commented on HDFS-630: @stack unfortunately, no. The patch needs to be changed for trunk. {code:title=ClientProtocol.java} Index: src/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java === --- src/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java (revision 891402) +++ src/java/org/apache/hadoop/hdfs/protocol/ClientProtocol.java (working copy) @@ -44,9 +44,9 @@ * Compared to the previous version the following changes have been introduced: * (Only the latest change is reflected. * The log of historical changes can be retrieved from the svn). - * 50: change LocatedBlocks to include last block information. + * 51: changed addBlock to include a list of excluded datanodes. */ - public static final long versionID = 50L; + public static final long versionID = 51L; {code} The versionID in 0.21 changes from 50L to 51L. The problem is that on trunk is already 52L so it should probably change it from 52L to 53L. This could be, however ignored on trunk and changed independently. I'm not sure what's the right approach. I could create another patch for trunk, however this would just poise versionID meaningless - It's 51L on 0.21, but on trunk 51L is something else. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs client, name-node Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Cosmin Lehene Attachments: 0001-Fix-HDFS-630-0.21-svn.patch, 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Attachment: 0001-Fix-HDFS-630-0.21-svn.patch new patch for 0.21 removed previous addBlock method changed ClientProtocol version changed log level in DFSClient to debug for the node exclusion operation refactored TestDFSClientExcludedNodes to junit4 In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Cosmin Lehene Priority: Minor Attachments: 0001-Fix-HDFS-630-0.21-svn.patch, 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Status: Open (was: Patch Available) Can't see that build issue locally and can't figure out what caused it on the build server. Trying once more time In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Ruyue Ma Priority: Minor Attachments: 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Status: Patch Available (was: Open) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Ruyue Ma Priority: Minor Attachments: 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Attachment: 0001-Fix-HDFS-630-trunk-svn-2.patch I reformatted the code a little, trying to stay close to the files it changes. There's no consistent style across files however. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Ruyue Ma Priority: Minor Attachments: 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, 0001-Fix-HDFS-630-trunk-svn-2.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Attachment: 0001-Fix-HDFS-630-trunk-svn-1.patch last patch doesn't apply on trunk after the commit for HDFS-764. Here's a new patch for trunk that also fix the previous javac warning In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Ruyue Ma Priority: Minor Attachments: 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-trunk-svn-1.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Attachment: 0001-Fix-HDFS-630-svn.patch fixed old method in NameNode.addBlock it returned addBlock(src, clientName, null, null); instead of addBlock(src, clientName, previous, null); and when called it never committed previous block. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Ruyue Ma Priority: Minor Attachments: 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Status: Patch Available (was: Open) Fix for 0.21 and trunk. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Ruyue Ma Priority: Minor Attachments: 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, 0001-Fix-HDFS-630-svn.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Attachment: 0001-Fix-HDFS-630-svn.patch I've patch -p1 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch svn add src/test/hdfs/org/apache/hadoop/hdfs/TestDFSClientExcludedNodes.java svn diff 0001-Fix-HDFS-630-svn.patch I really hope this works. It appears there's no easy way to generate a patch from git and have it working in this setup. Dhruba: if it still won't work, please run the patch with -p1 and then generate a patch that will work. By the way, a unit test is included with the last 3 patches. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Ruyue Ma Priority: Minor Attachments: 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, 0001-Fix-HDFS-630-svn.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Attachment: 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch The patch applies on trunk as well. However since it's a git patch I guess it caused some confusion. Here is the unified patch. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Ruyue Ma Priority: Minor Fix For: 0.21.0 Attachments: 0001-Fix-HDFS-630-for-0.21-and-trunk-unified.patch, 0001-Fix-HDFS-630-for-0.21.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777500#action_12777500 ] Cosmin Lehene commented on HDFS-630: stack: I can't reproduce it on 0.21. I did find it in the NN log before upgrading the HBase jar to the patched hdfs. java.io.IOException: Cannot complete block: block has not been COMMITTED by the client at org.apache.hadoop.hdfs.server.namenode.BlockInfoUnderConstruction.convertToCompleteBlock(BlockInfoUnderConstruction.java:158) at org.apache.hadoop.hdfs.server.namenode.BlockManager.completeBlock(BlockManager.java:288) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1243) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:637) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:621) at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:516) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:960) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:958) I should point that at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:621) line 621 in the NameNode means it was called from an unpached DFSClient that calls the old NameNode interface line 621: return addBlock(src, clientName, null, null); This is part of public LocatedBlock addBlock(String src, String clientName, Block previous) @Override public LocatedBlock addBlock(String src, String clientName, Block previous) throws IOException { return addBlock(src, clientName, null, null); } This is different than your stacktrace http://pastie.org/695936 that calls the complete() method. However could you search for the same error while adding a new block with addBlock() (like mine)? If you find it, you could figure out what's the entry point in NameNode, and if it's line 621 you might have a an unpatched DFSClient. However, even with an unpatched DFSClient I still fail, yet, to figure out why would it cause it. Perhaps I should get a better understanding of the cause of the exception. So far, from the code comments in BlockInfoUnderConstruction I have that the state of the block (the generation stamp and the length) has not been committed by the client or it does not have at least a minimal number of replicas reported from data-nodes. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Ruyue Ma Priority: Minor Fix For: 0.21.0 Attachments: 0001-Fix-HDFS-630-for-0.21.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Affects Version/s: (was: 0.20.1) Status: Patch Available (was: Open) Adapted for 0.21 branch. Added excludedNodes back to BlockPlacementPolicy. Adapted to use HashMapNode, Node instead of ListNode since BlockPlacementPolicyDefault was changed to use HashMap. However I'm not sure if it's supposed to be a HashMap... Luckily, Dhruba didn't removed the code that dealt with excludedNodes from BlockPlacementPolicyDefault so I only had to wire up the methods. I also added a unit test - it's practically a functional test that spins up a DFSMiniCluster with 3 DataNodes and kills one before creating the file. In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Ruyue Ma Priority: Minor Fix For: 0.21.0 Attachments: HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HDFS-630) In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block.
[ https://issues.apache.org/jira/browse/HDFS-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Lehene updated HDFS-630: --- Attachment: 0001-Fix-HDFS-630-for-0.21.patch In DFSOutputStream.nextBlockOutputStream(), the client can exclude specific datanodes when locating the next block. --- Key: HDFS-630 URL: https://issues.apache.org/jira/browse/HDFS-630 Project: Hadoop HDFS Issue Type: New Feature Components: hdfs client Affects Versions: 0.21.0 Reporter: Ruyue Ma Assignee: Ruyue Ma Priority: Minor Fix For: 0.21.0 Attachments: 0001-Fix-HDFS-630-for-0.21.patch, HDFS-630.patch created from hdfs-200. If during a write, the dfsclient sees that a block replica location for a newly allocated block is not-connectable, it re-requests the NN to get a fresh set of replica locations of the block. It tries this dfs.client.block.write.retries times (default 3), sleeping 6 seconds between each retry ( see DFSClient.nextBlockOutputStream). This setting works well when you have a reasonable size cluster; if u have few datanodes in the cluster, every retry maybe pick the dead-datanode and the above logic bails out. Our solution: when getting block location from namenode, we give nn the excluded datanodes. The list of dead datanodes is only for one block allocation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.