[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981798#comment-14981798 ] Hadoop QA commented on HDFS-9289: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 8s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 4s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 4s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 29s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 3s {color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in trunk cannot run convertXmlToText from findbugs {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 39s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 18s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 2s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 1s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 21s {color} | {color:red} Patch generated 1 new checkstyle issues in hadoop-hdfs-project (total was 247, now 247). {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 19s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 69m 1s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 55s {color} | {color:green} hadoop-hdfs-client in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 67m 54s {color} | {color:green} hadoop-hdfs in the patch passed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 57s {color} | {color:green} hadoop-hdfs-client in the patch passed with JDK v1.7.0_79. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 20s {color} | {color:red} Patch generated 56 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 168m 34s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.hdfs.server.datanode.TestBlockScanner | | | hadoop.hdfs.server.namenode.TestFSImage | | | hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.7.0 Server=1.7.0 Image:test-patch-base-hadoop-date2015-10-30 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12769654/HDFS-9289.6.patch | | JIRA Issue | HDFS-9289 | | Optional T
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981321#comment-14981321 ] Zhe Zhang commented on HDFS-9289: - A small ask for the next rev: {code} // BlockInfo#commitBlock - this.set(getBlockId(), block.getNumBytes(), block.getGenerationStamp()); + this.setNumBytes(block.getNumBytes()); {code} We also need to add a test in the 04 patch. Otherwise LGTM. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch, > HDFS-9289.4.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981313#comment-14981313 ] Zhe Zhang commented on HDFS-9289: - Thanks Jing for the explanation. I agree it's reasonable to throw an exception in {{commitBlock}} and rely on lease recovery to bring the block back to full strength in this case. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch, > HDFS-9289.4.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981291#comment-14981291 ] Jing Zhao commented on HDFS-9289: - bq. In general, if a client misreports GS, does it indicate a likelihood of misreported numBytes – and therefore we should deny the commitBlock? Currently NN only depends on the reported length from the client to determine the block length (not considering lease recovery scenario). So the only check we can do about the length is the existing one: {{assert block.getNumBytes() <= commitBlock.getNumBytes()}}. bq. But it's still a data loss because the data written by the client after updatePipeline becomes invisible. Throwing an exception here may not necessarily mean that the data written after updatePipeline will get lost. In most cases the data can still get recovered during the lease recovery, considering the replicas have already get persisted in DataNodes before client sends out the commit/complete request to NN (since the client has received the last response from the pipeline at that time). So throwing exception here should be the correct behavior and may not be that risky. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch, > HDFS-9289.4.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981262#comment-14981262 ] Zhe Zhang commented on HDFS-9289: - bq. That's silent data corruption! [~daryn] I agree it's a silent data corruption in the current logic because we update the NN's copy of the GS with the reported GS from the client: {code} // BlockInfo#commitBlock this.set(getBlockId(), block.getNumBytes(), block.getGenerationStamp()); {code} Throwing an exception (and therefore denying the commitBlock) turns this into an explicit failure, which is better. But it's still a data loss because the data written by the client after {{updatePipeline}} becomes invisible. So I think at least for this particular bug (lacking {{volatile}}), the right thing to do is to avoid changing NN's copy of GS when committing block (so we should avoid changing blockID as well). The only thing we should commit is {{numBytes}}. Of course we should still print a {{WARN}} or {{ERROR}} when GSes mismatch. As a safer first step we should at least avoid decrementing NN's copy of block GS. In general, if a client misreports GS, does it indicate a likelihood of misreported {{numBytes}} -- and therefore we should deny the {{commitBlock}}? It's hard to say; the {{volatile}} bug here is only for GS. But since we have already ensured the NN's copy of block {{numBytes}} never decrements, the harm of a misreported {{numBytes}} is not severe. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch, > HDFS-9289.4.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981222#comment-14981222 ] Chang Li commented on HDFS-9289: Thanks [~jingzhao], [~zhz] and [~daryn] for reivew and valuable discussions! Some additional info about several cases of mismatched GS we encountered is that they all happened after pipelineupdate for datanode close recovery, so no mismatched size of commit happen only mismatched GS. Could we reach a consensus of whether we should log warn of mismatched GS block info or throw exception? > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch, > HDFS-9289.4.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980535#comment-14980535 ] Daryn Sharp commented on HDFS-9289: --- I worked with Chang on this issue and can't think of a scenario in which it's legitimate for the client to misreport the genstamp - whether the pipeline was updated or not. Consider a more extreme case: The client wrote more data after the pipeline recovered and misreports the older genstamp. That's silent data corruption! I'd like to see an exception here rather than later. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch, > HDFS-9289.4.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979400#comment-14979400 ] Jing Zhao commented on HDFS-9289: - bq. What if the updatePipeline RPC call has successfully finished NN side changes but failed in sending response to client? Should we allow the client to commit the block? The RPC call will fail on the client side and the client will not use the old GS to commit the block. bq. How about we use this JIRA to commit the volatile change (which should fix the reported issue) and dedicate a follow-on JIRA to the commitBlock GS validation change? I agree with this proposal. Let's only log a warning msg on the NN side but not throw exception in this jira. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch, > HDFS-9289.4.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14979367#comment-14979367 ] Zhe Zhang commented on HDFS-9289: - Thanks Jing for sharing the thoughts. I think the GS validation in {{BlockManager#commitBlock}} is a little tricky. What if the {{updatePipeline}} RPC call has successfully finished NN side changes but failed in sending response to client? Should we allow the client to commit the block? GS is used to determine whether a replica is stale, but the client doesn't have a replica. Among the 3 attributes of a block (ID, size, GS), client should always have the same ID as NN, and should always have fresher size than NN. So maybe the right thing to do is to discard the client reported GS in {{commitBlock}}, but I'm not so sure. How about we use this JIRA to commit the {{volatile}} change (which should fix the reported issue) and dedicate a follow-on JIRA to the {{commitBlock}} GS validation change? > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch, > HDFS-9289.4.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978896#comment-14978896 ] Jing Zhao commented on HDFS-9289: - Making DataStreamer#block volatile is a good change, the GS validation on the NN side also looks good to me. Maybe we do not need a new type of {{InvalidGenStampException}} though. And to log the detailed information of the block with mismatching GS on the NN side will also be useful. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978617#comment-14978617 ] Chang Li commented on HDFS-9289: [~zhz], no we don't have this log because we didn't enable the blockStateChangeLog. How do you propose we should proceed with this jira? > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977679#comment-14977679 ] Hadoop QA commented on HDFS-9289: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 30m 31s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 11m 6s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 16m 4s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 39s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 3m 30s | The applied patch generated 1 new checkstyle issues (total was 161, now 161). | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 2m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 1m 5s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 8m 5s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 5m 22s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 78m 6s | Tests failed in hadoop-hdfs. | | {color:green}+1{color} | hdfs tests | 0m 59s | Tests passed in hadoop-hdfs-client. | | | | 159m 31s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.hdfs.TestRecoverStripedFile | | | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | | | hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock | | | hadoop.hdfs.TestEncryptionZones | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12769100/HDFS-9289.3.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 68ce93c | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13238/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/13238/artifact/patchprocess/diffcheckstylehadoop-hdfs.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13238/artifact/patchprocess/testrun_hadoop-hdfs.txt | | hadoop-hdfs-client test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13238/artifact/patchprocess/testrun_hadoop-hdfs-client.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13238/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13238/console | This message was automatically generated. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977260#comment-14977260 ] Zhe Zhang commented on HDFS-9289: - bq. I think there probabaly exist some cache coherence issue This sounds possible. Maybe the {{DFSOutputStream}} thread uses a stale copy of {{block}} in {{completeFile}}, after {{block}} is updated by the {{DataStreamer}} thread. bq. Then pipelineupdate happen with only d2 and d3 with new GS. Then file complete with old GS and d2 and d3 were marked corrupt. Do you have any log showing that "replica marked as corrupt because its GS is newer than the block GS on NN"? Regardless, making {{DataStreamer#block}} volatile is a good change. Ideally we should add a test to emulate the cache coherency problem but it doesn't look easy. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977156#comment-14977156 ] Chang Li commented on HDFS-9289: [~zhz], yes, the above log is from the same cluster as the first log I post. The two replicas in two datanodes from updated pipeline had new GS but they were marked as corrupt because the block commit with old genstamp. The complete story happened in that cluster is: there were initially 3 datanodes in pipeline d1, d2, d3. Then pipelineupdate happen with only d2 and d3 with new GS. Then file complete with old GS and d2 and d3 were marked corrupt. Then after 1 day, full block report from d1 came in, and NN found out d1 has the the right block with "correct" old GS but d1 is under replicated, so NN told d1 to replicate its replica with old GS to the other two nodes, d4, d5. So the all 3DNs I showed above were d1, d4, and d5 having old GS. I think there probabaly exist some cache coherence issue since {code}protected ExtendedBlock block;{code} lack volatile. That could also explain why this issue didn't happen frequently. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch, HDFS-9289.3.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976925#comment-14976925 ] Zhe Zhang commented on HDFS-9289: - The fact that all 3 DNs have old GS doesn't mean the client also has an old GS. Is the above log from the same cluster as previous [logs | https://issues.apache.org/jira/browse/HDFS-9289?focusedCommentId=14972655&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14972655]? In these cases, is there any replica with the correct (new) GS? If so it doesn't look a bug. If all replicas of a block have old GS, then it's more suspicious. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976815#comment-14976815 ] Chang Li commented on HDFS-9289: [~zhz], I don't have the log show the file was completed with an old GS. But by look up the block from jsp page right now, I can see that the block blk_3773617405 currently has replica on host ***657n26.***.com, ***656n04.***.com, and ***656n38.***.com, by going to those datanode, I see the replica on those datanodes have replica with old genstamp. {code} bash-4.1$ hostname ***657n26.***.com bash-4.1$ ls -l /grid/2/hadoop/var/hdfs/data/current/BP-1052427332-98.138.108.146-1350583571998/current/finalized/subdir236/subdir212/blk_3773617405* -rw-r--r-- 1 hdfs users 107761275 Oct 23 18:00 /grid/2/hadoop/var/hdfs/data/current/BP-1052427332-98.138.108.146-1350583571998/current/finalized/subdir236/subdir212/blk_3773617405 -rw-r--r-- 1 hdfs users841895 Oct 23 18:00 /grid/2/hadoop/var/hdfs/data/current/BP-1052427332-98.138.108.146-1350583571998/current/finalized/subdir236/subdir212/blk_3773617405_1106111498065.meta bash-4.1$ hostname ***656n04.***.com bash-4.1$ ls -l /grid/1/hadoop/var/hdfs/data/current/BP-1052427332-98.138.108.146-1350583571998/current/finalized/subdir236/subdir212/blk_3773617405* -rw-r--r-- 1 hdfs users 107761275 Oct 21 19:14 /grid/1/hadoop/var/hdfs/data/current/BP-1052427332-98.138.108.146-1350583571998/current/finalized/subdir236/subdir212/blk_3773617405 -rw-r--r-- 1 hdfs users841895 Oct 21 19:14 /grid/1/hadoop/var/hdfs/data/current/BP-1052427332-98.138.108.146-1350583571998/current/finalized/subdir236/subdir212/blk_3773617405_1106111498065.meta bash-4.1$ hostname ***656n38.***.com bash-4.1$ ls -l /grid/3/hadoop/var/hdfs/data/current/BP-1052427332-98.138.108.146-1350583571998/current/finalized/subdir236/subdir212/blk_3773617405* -rw-r--r-- 1 hdfs users 107761275 Oct 23 09:14 /grid/3/hadoop/var/hdfs/data/current/BP-1052427332-98.138.108.146-1350583571998/current/finalized/subdir236/subdir212/blk_3773617405 -rw-r--r-- 1 hdfs users841895 Oct 23 09:14 /grid/3/hadoop/var/hdfs/data/current/BP-1052427332-98.138.108.146-1350583571998/current/finalized/subdir236/subdir212/blk_3773617405_1106111498065.meta {code} > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976735#comment-14976735 ] Zhe Zhang commented on HDFS-9289: - bq. the client after updatepipeline with the new gen stamp it later completed file with the old gen stamp This looks very strange. But why do you think this happened? Did you see logs that the file was completed with an old GS? > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976471#comment-14976471 ] Chang Li commented on HDFS-9289: Hi, [~walter.k.su], I don't know in which cluster this strange case will happen again and I can't enable the debug message of NameNode.blockStateChangeLog across all clusters. I will look into the root cause of how this strange problem happen. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975837#comment-14975837 ] Walter Su commented on HDFS-9289: - The patch hides a potential bigger bug. We should find it out and address it. Hi, [~lichangleo]. I'll very appreciate if you could enable debug level of {{NameNode.blockStateChangeLog}} and attach more logs? Or instructions about how to reproduce it. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975381#comment-14975381 ] Chang Li commented on HDFS-9289: [~zhz], you are the right, the client had the new genstamp, but the problem I am trying to point out is that the client after updatepipeline with the new gen stamp it later completed file with the old gen stamp. So my patch is trying to prevent client from complete file with old genstamp after it updated pipeline with new genstamp > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975348#comment-14975348 ] Zhe Zhang commented on HDFS-9289: - [~lichangleo] I think the below log shows that the client does have new GS {{1106111511603}} because the parameter {{newBlock}} is passed in from the client. So IIUC even if we check GS when completing file, as the patch does, it won't stop the client from completing / closing the file. Or could you describe how you think the patch can avoid this error? Thanks.. {code} 2015-10-20 19:49:20,392 [IPC Server handler 63 on 8020] INFO namenode.FSNamesystem: updatePipeline(BP-1052427332-98.138.108.146-1350583571998:blk_3773617405_1106111498065) successfully to BP-1052427332-98.138.108.146-1350583571998:blk_3773617405_1106111511603 {code} > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974704#comment-14974704 ] Chang Li commented on HDFS-9289: Hi [~jingzhao], we are currently taking the default one, and the default is true. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974679#comment-14974679 ] Jing Zhao commented on HDFS-9289: - Hi [~lichangleo], what is the current conf setting of the replace-datanode-on-failure policy in your cluster? From the log looks like you disabled it? > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974525#comment-14974525 ] Zhe Zhang commented on HDFS-9289: - [~lichangleo] Thanks for sharing the logs! I'll look at the patch and logs and post a review today. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974497#comment-14974497 ] Chang Li commented on HDFS-9289: [~zhz], before we figure out the root cause of this strange case, should let this jira to be a temporary fix? > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974490#comment-14974490 ] Chang Li commented on HDFS-9289: have met another case in our cluster {code} 2015-10-23 04:38:08,544 [IPC Server handler 11 on 8020] INFO hdfs.StateChange: BLOCK* allocateBlock: /projects/wcc/wcc1/data/2015/10/22/05/Content-9892.temp. gz.temp._COPYING_. BP-1161836467-98.137.240.59-1438814573258 blk_1427767166_354062734{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[Replica UnderConstruction[[DISK]DS-a04f60ed-6700-4e93-8a52-555301e07d3b:NORMAL:10.213.43.41:1004|RBW], ReplicaUnderConstruction[[DISK]DS-7e0de56b-17ba-4164-8b19-67a9 f9f84c2c:NORMAL:10.213.46.123:1004|RBW], ReplicaUnderConstruction[[DISK]DS-14a850d1-deb9-496b-b5ed-bb57010a8b56:NORMAL:10.213.46.96:1004|RBW]]} 2015-10-23 04:39:35,588 [IPC Server handler 5 on 8020] INFO namenode.FSNamesystem: updatePipeline(block=BP-1161836467-98.137.240.59-1438814573258:blk_1427767 166_354062734, newGenerationStamp=354080525, newLength=24505255, newNodes=[10.213.46.123:1004, 10.213.46.96:1004], clientName=DFSClient_NONMAPREDUCE_12621589 81_1) 2015-10-23 04:39:35,588 [IPC Server handler 5 on 8020] INFO namenode.FSNamesystem: updatePipeline(BP-1161836467-98.137.240.59-1438814573258:blk_1427767166_35 4062734) successfully to BP-1161836467-98.137.240.59-1438814573258:blk_1427767166_354080525 2015-10-23 04:39:35,595 [IPC Server handler 50 on 8020] INFO hdfs.StateChange: DIR* completeFile: /projects/wcc/wcc1/data/2015/10/22/05/Content-9892.temp.gz.temp._COPYING_ is closed by DFSClient_NONMAPREDUCE_1262158981_1 {code} This is also a complete file before a pipelineupdate. jsp page shows three nodes which currently holds the replica of blk_1427767166. One of the node is 10.213.43.41, which is the first node in old pipeline and dropped out in the updated pipeline and the replica currently in that node has old gen stamp. The other two nodes are later replicated after the first node in old pipeline sent in its block report. The two nodes in the updated pipeline were marked as corrupted until the node 10.213.43.41 sent in its block report. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972655#comment-14972655 ] Chang Li commented on HDFS-9289: Hi [~zhz], here is the log, {code} INFO hdfs.StateChange: BLOCK* allocateBlock: /projects/FETLDEV/Benzene/benzene_stg_transient/primer/201510201900/_temporary/1/_temporary/attempt_1444859775697_31140_m_001028_0/part-m-01028. BP-1052427332-98.138.108.146-1350583571998 blk_3773617405_1106111498065{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-0a28b82a-e3fb-4e42-b925-e76ebd98afb4:NORMAL:10.216.32.61:1004|RBW], ReplicaUnderConstruction[[DISK]DS-236c19ee-0a39-4e53-9520-c32941ca1828:NORMAL:10.216.70.49:1004|RBW], ReplicaUnderConstruction[[DISK]DS-fc7c2dab-9309-46be-b5c0-52be8e698591:NORMAL:10.216.70.43:1004|RBW]]} 2015-10-20 19:49:20,392 [IPC Server handler 63 on 8020] INFO namenode.FSNamesystem: updatePipeline(block=BP-1052427332-98.138.108.146-1350583571998:blk_3773617405_1106111498065, newGenerationStamp=1106111511603, newLength=107761275, newNodes=[10.216.70.49:1004, 10.216.70.43:1004], clientName=DFSClient_attempt_1444859775697_31140_m_001028_0_1424303982_1) 2015-10-20 19:49:20,392 [IPC Server handler 63 on 8020] INFO namenode.FSNamesystem: updatePipeline(BP-1052427332-98.138.108.146-1350583571998:blk_3773617405_1106111498065) successfully to BP-1052427332-98.138.108.146-1350583571998:blk_3773617405_1106111511603 2015-10-20 19:49:20,400 [IPC Server handler 96 on 8020] INFO hdfs.StateChange: DIR* completeFile: /projects/FETLDEV/Benzene/benzene_stg_transient/primer/201510201900/_temporary/1/_temporary/attempt_1444859775697_31140_m_001028_0/part-m-01028 is closed by DFSClient_attempt_1444859775697_31140_m_001028_0_1424303982_1 {code} You can see the file complete after a pipeline update. The block changed its genStamp from blk_3773617405_1106111498065 to blk_3773617405_1106111511603. But then the two nodes in the updated pipeline are marked as corrupted. When I do fsck, it shows {code} hdfs fsck /projects/FETLDEV/Benzene/benzene_stg_transient/primer/201510201900/part-m-01028 Connecting to namenode via http://uraniumtan-nn1.tan.ygrid.yahoo.com:50070 FSCK started by hdfs (auth:KERBEROS_SSL) from /98.138.131.190 for path /projects/FETLDEV/Benzene/benzene_stg_transient/primer/201510201900/part-m-01028 at Wed Oct 21 15:04:56 UTC 2015 . /projects/FETLDEV/Benzene/benzene_stg_transient/primer/201510201900/part-m-01028: CORRUPT blockpool BP-1052427332-98.138.108.146-1350583571998 block blk_3773617405 /projects/FETLDEV/Benzene/benzene_stg_transient/primer/201510201900/part-m-01028: Replica placement policy is violated for BP-1052427332-98.138.108.146-1350583571998:blk_3773617405_1106111498065. Block should be additionally replicated on 1 more rack(s). {code} it shows the blk with old gen stamp blk_3773617405_1106111498065. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14972209#comment-14972209 ] Zhe Zhang commented on HDFS-9289: - [~lichangleo] Thanks for reporting the issue. bq. but the file complete with the old block genStamp. How did that happen? So the client somehow had an old GS? IIUC the {{updatePipeline}} protocol is below (using {{client_GS}}, {{DN_GS}}, and {{NN_GS}} to denote the 3 copies of GS): # Client asks for new GS from NN through {{updateBlockForPipeline}}. After this, {{client_GS}} is new, both {{DN_GS}} and {{NN_GS}} are old # Client calls {{createBlockOutputStream}} to update DN's GS. After this, both {{client_GS}} and {{DN_GS}} are new, {{NN_GS}} is old # Client calls {{updatePipeline}}. After this, all 3 GSes should be new Maybe step 3 failed, and then client tried to complete the file? It'd be ideal if you could extend the unit test to reproduce the error without the fix (or paste the error log). Thanks! > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971590#comment-14971590 ] Elliott Clark commented on HDFS-9289: - It had all of the data and the same md5sums when I checked. So the only thing different was genstamps. Not really sure about why that happened. But I didn't mean to side track this jira. Test looks nice. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14971302#comment-14971302 ] Chang Li commented on HDFS-9289: [~eclark], block on 10.210.31.38 should be marked as corrupt because it's from old pipeline right? > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch, HDFS-9289.2.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970213#comment-14970213 ] Elliott Clark commented on HDFS-9289: - {code} 15/10/22 09:37:36 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_1190230043 added as corrupt on 10.210.31.38:50010 by hbase4678.test.com/10.210.31.38 because reported RBW replica with genstamp 116735085 does not match COMPLETE block's genstamp in block map 116737586 {code} Block lengths on "corrupt" replicas is the same as on the non-corrupt. The only difference is the genstamp. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969984#comment-14969984 ] Chang Li commented on HDFS-9289: will update patch with info of expected and encountered gen stamp and unit test soon. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969979#comment-14969979 ] Chang Li commented on HDFS-9289: Hi [~eclark], I think the case you gave is not the same and the corrupt block doesn't seem to be caused by gen stamp reverse mismatch. So your initial pipeline has node 33, 48, 38. Then after pipeline update it has node 33, 45, 29. Then node 38 is marked corrupt due to gen stamp mismatch, which is what it should be. Then node 29(with correct gen stamp) were told to replicate to some other node, and then client report block node 29 as corrupt. This case of corruption doesn't seem like to be caused by gen stamp mismatch on namenode side but a report from client ("because client machine reported it") > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969946#comment-14969946 ] Hadoop QA commented on HDFS-9289: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 50s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 59s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 27s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 32s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 46s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 2m 9s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 36s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 0m 30s | Post-patch findbugs hadoop-hdfs-project/hadoop-hdfs compilation is broken. | | {color:green}+1{color} | findbugs | 0m 30s | The patch does not introduce any new Findbugs (version ) warnings. | | {color:green}+1{color} | native | 0m 13s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 0m 25s | Tests failed in hadoop-hdfs. | | | | 44m 31s | | \\ \\ || Reason || Tests || | Failed build | hadoop-hdfs | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12768113/HDFS-9289.1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 0fce5f9 | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13136/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13136/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13136/console | This message was automatically generated. > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969933#comment-14969933 ] Elliott Clark commented on HDFS-9289: - Also can we add the expected and encountered genstamps to the exception message > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/browse/HDFS-9289 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chang Li >Assignee: Chang Li >Priority: Critical > Attachments: HDFS-9289.1.patch > > > we have seen a case of corrupt block which is caused by file complete after a > pipelineUpdate, but the file complete with the old block genStamp. This > caused the replicas of two datanodes in updated pipeline to be viewed as > corrupte. Propose to check genstamp when commit block -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9289) check genStamp when complete file
[ https://issues.apache.org/jira/browse/HDFS-9289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969890#comment-14969890 ] Elliott Clark commented on HDFS-9289: - We just had this something very similar happen on a prod cluster. Then the datanode holding the only complete block was shut off for repair. {code} 15/10/22 06:29:32 INFO hdfs.StateChange: BLOCK* allocateBlock: /TESTCLUSTER-HBASE/WALs/hbase4544.test.com,16020,1444266312515/hbase4544.test.com%2C16020%2C1444266312515.default.1445520572440. BP-1735829752-10.210.49.21-1437433901380 blk_1190230043_116735085{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-8d0a91de-8a69-4f39-816e-de3a0fa8a3aa:NORMAL:10.210.81.33:50010|RBW], ReplicaUnderConstruction[[DISK]DS-52d9a122-a46a-4129-ab3d-d9041de109f8:NORMAL:10.210.31.48:50010|RBW], ReplicaUnderConstruction[[DISK]DS-c734b72e-27de-4dd4-a46c-7ae59f6ef792:NORMAL:10.210.31.38:50010|RBW]]} 15/10/22 06:32:48 INFO namenode.FSNamesystem: updatePipeline(block=BP-1735829752-10.210.49.21-1437433901380:blk_1190230043_116735085, newGenerationStamp=116737586, newLength=201675125, newNodes=[10.210.81.33:50010, 10.210.81.45:50010, 10.210.64.29:50010], clientName=DFSClient_NONMAPREDUCE_1976436475_1) 15/10/22 06:32:48 INFO namenode.FSNamesystem: updatePipeline(BP-1735829752-10.210.49.21-1437433901380:blk_1190230043_116735085) successfully to BP-1735829752-10.210.49.21-1437433901380:blk_1190230043_116737586 15/10/22 06:32:50 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.210.64.29:50010 is added to blk_1190230043_116737586{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-8d0a91de-8a69-4f39-816e-de3a0fa8a3aa:NORMAL:10.210.81.33:50010|RBW], ReplicaUnderConstruction[[DISK]DS-d5f7fff9-005d-4804-a223-b6e6624d3af2:NORMAL:10.210.81.45:50010|RBW], ReplicaUnderConstruction[[DISK]DS-0620aef7-b6b2-4a23-950c-09373f68a815:NORMAL:10.210.64.29:50010|FINALIZED]]} size 201681322 15/10/22 06:32:50 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.210.81.45:50010 is added to blk_1190230043_116737586{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-8d0a91de-8a69-4f39-816e-de3a0fa8a3aa:NORMAL:10.210.81.33:50010|RBW], ReplicaUnderConstruction[[DISK]DS-0620aef7-b6b2-4a23-950c-09373f68a815:NORMAL:10.210.64.29:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-52a0a4ba-cf64-4763-99a8-6c9bb5946879:NORMAL:10.210.81.45:50010|FINALIZED]]} size 201681322 15/10/22 06:32:50 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.210.81.33:50010 is added to blk_1190230043_116737586{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-0620aef7-b6b2-4a23-950c-09373f68a815:NORMAL:10.210.64.29:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-52a0a4ba-cf64-4763-99a8-6c9bb5946879:NORMAL:10.210.81.45:50010|FINALIZED], ReplicaUnderConstruction[[DISK]DS-4d937567-7184-40b7-a822-c7e3b5d588d4:NORMAL:10.210.81.33:50010|FINALIZED]]} size 201681322 15/10/22 09:37:36 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_1190230043 added as corrupt on 10.210.31.38:50010 by hbase4678.test.com/10.210.31.38 because reported RBW replica with genstamp 116735085 does not match COMPLETE block's genstamp in block map 116737586 15/10/22 09:37:36 INFO BlockStateChange: BLOCK* invalidateBlock: blk_1190230043_116735085(stored=blk_1190230043_116737586) on 10.210.31.38:50010 15/10/22 09:37:36 INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_1190230043_116735085 to 10.210.31.38:50010 15/10/22 09:37:39 INFO BlockStateChange: BLOCK* BlockManager: ask 10.210.31.38:50010 to delete [blk_1190230043_116735085] 15/10/22 12:45:03 INFO BlockStateChange: BLOCK* ask 10.210.64.29:50010 to replicate blk_1190230043_116737586 to datanode(s) 10.210.64.56:50010 15/10/22 12:45:07 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_1190230043 added as corrupt on 10.210.64.29:50010 by hbase4496.test.com/10.210.64.56 because client machine reported it 15/10/22 12:50:49 INFO BlockStateChange: BLOCK* ask 10.210.81.45:50010 to replicate blk_1190230043_116737586 to datanode(s) 10.210.49.49:50010 15/10/22 12:50:55 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: blk_1190230043 added as corrupt on 10.210.81.45:50010 by hbase4478.test.com/10.210.49.49 because client machine reported it 15/10/22 12:56:01 WARN blockmanagement.BlockManager: PendingReplicationMonitor timed out blk_1190230043_116737586 {code} The patch will help but the issue will still be there. Is there some way to keep the genstamps from getting out of sync? > check genStamp when complete file > - > > Key: HDFS-9289 > URL: https://issues.apache.org/jira/b