[jira] [Updated] (HDFS-7587) Edit log corruption can happen if append fails with a quota violation
[ https://issues.apache.org/jira/browse/HDFS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated HDFS-7587: -- Attachment: HDFS-7587-branch-2.6.patch For the 2.6.1 release effort, the backport isn't straightforward due to difference between 2.6 and 2.7. It has the following differences compared to the original patch. * Include part of HDFS-7509 so that prepareFileForWrite has the expected function signature. * Use Quota.Counts instead of QuotaCounts which is introduced in HDFS-7584. * Skip the check for storage type specific quota introduced in HDFS-7584. * Add the necessary definitions for INodesPath#length and FSDirectory#shouldSkipQuotaChecks. > Edit log corruption can happen if append fails with a quota violation > - > > Key: HDFS-7587 > URL: https://issues.apache.org/jira/browse/HDFS-7587 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Kihwal Lee >Assignee: Jing Zhao >Priority: Blocker > Labels: 2.6.1-candidate > Fix For: 2.7.0 > > Attachments: HDFS-7587-branch-2.6.patch, HDFS-7587.001.patch, > HDFS-7587.002.patch, HDFS-7587.003.patch, HDFS-7587.patch > > > We have seen a standby namenode crashing due to edit log corruption. It was > complaining that {{OP_CLOSE}} cannot be applied because the file is not > under-construction. > When a client was trying to append to the file, the remaining space quota was > very small. This caused a failure in {{prepareFileForWrite()}}, but after the > inode was already converted for writing and a lease added. Since these were > not undone when the quota violation was detected, the file was left in > under-construction with an active lease without edit logging {{OP_ADD}}. > A subsequent {{append()}} eventually caused a lease recovery after the soft > limit period. This resulted in {{commitBlockSynchronization()}}, which closed > the file with {{OP_CLOSE}} being logged. Since there was no corresponding > {{OP_ADD}}, edit replaying could not apply this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7587) Edit log corruption can happen if append fails with a quota violation
[ https://issues.apache.org/jira/browse/HDFS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated HDFS-7587: -- Fix Version/s: 2.6.1 [~sjlee0] backported this to 2.6.1. I just pushed the commit to 2.6.1 after running compilation and TestDiskspaceQuotaUpdate which changed in the patch. [~mingma], I didn't actually see a diff between the branch-2 patch and yours / Sangjin's. Appreciate any cross-verification on the 2.6.1 branch whether I got it right or not. Thanks. > Edit log corruption can happen if append fails with a quota violation > - > > Key: HDFS-7587 > URL: https://issues.apache.org/jira/browse/HDFS-7587 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Kihwal Lee >Assignee: Jing Zhao >Priority: Blocker > Labels: 2.6.1-candidate > Fix For: 2.7.0, 2.6.1 > > Attachments: HDFS-7587-branch-2.6.patch, HDFS-7587.001.patch, > HDFS-7587.002.patch, HDFS-7587.003.patch, HDFS-7587.patch > > > We have seen a standby namenode crashing due to edit log corruption. It was > complaining that {{OP_CLOSE}} cannot be applied because the file is not > under-construction. > When a client was trying to append to the file, the remaining space quota was > very small. This caused a failure in {{prepareFileForWrite()}}, but after the > inode was already converted for writing and a lease added. Since these were > not undone when the quota violation was detected, the file was left in > under-construction with an active lease without edit logging {{OP_ADD}}. > A subsequent {{append()}} eventually caused a lease recovery after the soft > limit period. This resulted in {{commitBlockSynchronization()}}, which closed > the file with {{OP_CLOSE}} being logged. Since there was no corresponding > {{OP_ADD}}, edit replaying could not apply this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7587) Edit log corruption can happen if append fails with a quota violation
[ https://issues.apache.org/jira/browse/HDFS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-7587: - Assignee: Daryn Sharp > Edit log corruption can happen if append fails with a quota violation > - > > Key: HDFS-7587 > URL: https://issues.apache.org/jira/browse/HDFS-7587 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Assignee: Daryn Sharp >Priority: Blocker > > We have seen a standby namenode crashing due to edit log corruption. It was > complaining that {{OP_CLOSE}} cannot be applied because the file is not > under-construction. > When a client was trying to append to the file, the remaining space quota was > very small. This caused a failure in {{prepareFileForWrite()}}, but after the > inode was already converted for writing and a lease added. Since these were > not undone when the quota violation was detected, the file was left in > under-construction with an active lease without edit logging {{OP_ADD}}. > A subsequent {{append()}} eventually caused a lease recovery after the soft > limit period. This resulted in {{commitBlockSynchronization()}}, which closed > the file with {{OP_CLOSE}} being logged. Since there was no corresponding > {{OP_ADD}}, edit replaying could not apply this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7587) Edit log corruption can happen if append fails with a quota violation
[ https://issues.apache.org/jira/browse/HDFS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze updated HDFS-7587: -- Component/s: namenode > Edit log corruption can happen if append fails with a quota violation > - > > Key: HDFS-7587 > URL: https://issues.apache.org/jira/browse/HDFS-7587 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Kihwal Lee >Assignee: Daryn Sharp >Priority: Blocker > > We have seen a standby namenode crashing due to edit log corruption. It was > complaining that {{OP_CLOSE}} cannot be applied because the file is not > under-construction. > When a client was trying to append to the file, the remaining space quota was > very small. This caused a failure in {{prepareFileForWrite()}}, but after the > inode was already converted for writing and a lease added. Since these were > not undone when the quota violation was detected, the file was left in > under-construction with an active lease without edit logging {{OP_ADD}}. > A subsequent {{append()}} eventually caused a lease recovery after the soft > limit period. This resulted in {{commitBlockSynchronization()}}, which closed > the file with {{OP_CLOSE}} being logged. Since there was no corresponding > {{OP_ADD}}, edit replaying could not apply this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7587) Edit log corruption can happen if append fails with a quota violation
[ https://issues.apache.org/jira/browse/HDFS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-7587: - Attachment: HDFS-7587.patch > Edit log corruption can happen if append fails with a quota violation > - > > Key: HDFS-7587 > URL: https://issues.apache.org/jira/browse/HDFS-7587 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Kihwal Lee >Assignee: Daryn Sharp >Priority: Blocker > Attachments: HDFS-7587.patch > > > We have seen a standby namenode crashing due to edit log corruption. It was > complaining that {{OP_CLOSE}} cannot be applied because the file is not > under-construction. > When a client was trying to append to the file, the remaining space quota was > very small. This caused a failure in {{prepareFileForWrite()}}, but after the > inode was already converted for writing and a lease added. Since these were > not undone when the quota violation was detected, the file was left in > under-construction with an active lease without edit logging {{OP_ADD}}. > A subsequent {{append()}} eventually caused a lease recovery after the soft > limit period. This resulted in {{commitBlockSynchronization()}}, which closed > the file with {{OP_CLOSE}} being logged. Since there was no corresponding > {{OP_ADD}}, edit replaying could not apply this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7587) Edit log corruption can happen if append fails with a quota violation
[ https://issues.apache.org/jira/browse/HDFS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-7587: - Status: Patch Available (was: Open) > Edit log corruption can happen if append fails with a quota violation > - > > Key: HDFS-7587 > URL: https://issues.apache.org/jira/browse/HDFS-7587 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Kihwal Lee >Assignee: Daryn Sharp >Priority: Blocker > Attachments: HDFS-7587.patch > > > We have seen a standby namenode crashing due to edit log corruption. It was > complaining that {{OP_CLOSE}} cannot be applied because the file is not > under-construction. > When a client was trying to append to the file, the remaining space quota was > very small. This caused a failure in {{prepareFileForWrite()}}, but after the > inode was already converted for writing and a lease added. Since these were > not undone when the quota violation was detected, the file was left in > under-construction with an active lease without edit logging {{OP_ADD}}. > A subsequent {{append()}} eventually caused a lease recovery after the soft > limit period. This resulted in {{commitBlockSynchronization()}}, which closed > the file with {{OP_CLOSE}} being logged. Since there was no corresponding > {{OP_ADD}}, edit replaying could not apply this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7587) Edit log corruption can happen if append fails with a quota violation
[ https://issues.apache.org/jira/browse/HDFS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daryn Sharp updated HDFS-7587: -- Assignee: (was: Daryn Sharp) > Edit log corruption can happen if append fails with a quota violation > - > > Key: HDFS-7587 > URL: https://issues.apache.org/jira/browse/HDFS-7587 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Kihwal Lee >Priority: Blocker > Attachments: HDFS-7587.patch > > > We have seen a standby namenode crashing due to edit log corruption. It was > complaining that {{OP_CLOSE}} cannot be applied because the file is not > under-construction. > When a client was trying to append to the file, the remaining space quota was > very small. This caused a failure in {{prepareFileForWrite()}}, but after the > inode was already converted for writing and a lease added. Since these were > not undone when the quota violation was detected, the file was left in > under-construction with an active lease without edit logging {{OP_ADD}}. > A subsequent {{append()}} eventually caused a lease recovery after the soft > limit period. This resulted in {{commitBlockSynchronization()}}, which closed > the file with {{OP_CLOSE}} being logged. Since there was no corresponding > {{OP_ADD}}, edit replaying could not apply this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7587) Edit log corruption can happen if append fails with a quota violation
[ https://issues.apache.org/jira/browse/HDFS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-7587: Attachment: HDFS-7587.001.patch Rebase Daryn's patch. Also make changes based on Nicholas's comments, i.e., first verifying the quota and updating the quota after the action. With fix from HDFS-7943 we will not have blocks with size greater than the preferred block size. Thus we can avoid "earning back" quota scenarios. Truncate may have similar issue when the data to truncate is only part of the original last block. Will update the patch later to fix this part. > Edit log corruption can happen if append fails with a quota violation > - > > Key: HDFS-7587 > URL: https://issues.apache.org/jira/browse/HDFS-7587 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Kihwal Lee >Priority: Blocker > Attachments: HDFS-7587.001.patch, HDFS-7587.patch > > > We have seen a standby namenode crashing due to edit log corruption. It was > complaining that {{OP_CLOSE}} cannot be applied because the file is not > under-construction. > When a client was trying to append to the file, the remaining space quota was > very small. This caused a failure in {{prepareFileForWrite()}}, but after the > inode was already converted for writing and a lease added. Since these were > not undone when the quota violation was detected, the file was left in > under-construction with an active lease without edit logging {{OP_ADD}}. > A subsequent {{append()}} eventually caused a lease recovery after the soft > limit period. This resulted in {{commitBlockSynchronization()}}, which closed > the file with {{OP_CLOSE}} being logged. Since there was no corresponding > {{OP_ADD}}, edit replaying could not apply this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7587) Edit log corruption can happen if append fails with a quota violation
[ https://issues.apache.org/jira/browse/HDFS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-7587: Attachment: HDFS-7587.002.patch Add fix for truncate. > Edit log corruption can happen if append fails with a quota violation > - > > Key: HDFS-7587 > URL: https://issues.apache.org/jira/browse/HDFS-7587 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Kihwal Lee >Assignee: Jing Zhao >Priority: Blocker > Attachments: HDFS-7587.001.patch, HDFS-7587.002.patch, HDFS-7587.patch > > > We have seen a standby namenode crashing due to edit log corruption. It was > complaining that {{OP_CLOSE}} cannot be applied because the file is not > under-construction. > When a client was trying to append to the file, the remaining space quota was > very small. This caused a failure in {{prepareFileForWrite()}}, but after the > inode was already converted for writing and a lease added. Since these were > not undone when the quota violation was detected, the file was left in > under-construction with an active lease without edit logging {{OP_ADD}}. > A subsequent {{append()}} eventually caused a lease recovery after the soft > limit period. This resulted in {{commitBlockSynchronization()}}, which closed > the file with {{OP_CLOSE}} being logged. Since there was no corresponding > {{OP_ADD}}, edit replaying could not apply this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7587) Edit log corruption can happen if append fails with a quota violation
[ https://issues.apache.org/jira/browse/HDFS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-7587: Attachment: HDFS-7587.003.patch Thanks for the review, Nicholas! Update the patch to address your comments. I will separate the truncate fix into another jira. bq. Non-copy-on-truncate OR Copy-on-truncate for upgrade but not snapshot: Quota usage count is decreased. No quota check is needed. We may also need to check/update the quota here since the current logic is to count UC block's storage usage using the preferred size. > Edit log corruption can happen if append fails with a quota violation > - > > Key: HDFS-7587 > URL: https://issues.apache.org/jira/browse/HDFS-7587 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Kihwal Lee >Assignee: Jing Zhao >Priority: Blocker > Attachments: HDFS-7587.001.patch, HDFS-7587.002.patch, > HDFS-7587.003.patch, HDFS-7587.patch > > > We have seen a standby namenode crashing due to edit log corruption. It was > complaining that {{OP_CLOSE}} cannot be applied because the file is not > under-construction. > When a client was trying to append to the file, the remaining space quota was > very small. This caused a failure in {{prepareFileForWrite()}}, but after the > inode was already converted for writing and a lease added. Since these were > not undone when the quota violation was detected, the file was left in > under-construction with an active lease without edit logging {{OP_ADD}}. > A subsequent {{append()}} eventually caused a lease recovery after the soft > limit period. This resulted in {{commitBlockSynchronization()}}, which closed > the file with {{OP_CLOSE}} being logged. Since there was no corresponding > {{OP_ADD}}, edit replaying could not apply this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7587) Edit log corruption can happen if append fails with a quota violation
[ https://issues.apache.org/jira/browse/HDFS-7587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-7587: Resolution: Fixed Fix Version/s: 2.7.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Thanks again for the review, Nicholas. I've committed this to 2.7. > Edit log corruption can happen if append fails with a quota violation > - > > Key: HDFS-7587 > URL: https://issues.apache.org/jira/browse/HDFS-7587 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: Kihwal Lee >Assignee: Jing Zhao >Priority: Blocker > Fix For: 2.7.0 > > Attachments: HDFS-7587.001.patch, HDFS-7587.002.patch, > HDFS-7587.003.patch, HDFS-7587.patch > > > We have seen a standby namenode crashing due to edit log corruption. It was > complaining that {{OP_CLOSE}} cannot be applied because the file is not > under-construction. > When a client was trying to append to the file, the remaining space quota was > very small. This caused a failure in {{prepareFileForWrite()}}, but after the > inode was already converted for writing and a lease added. Since these were > not undone when the quota violation was detected, the file was left in > under-construction with an active lease without edit logging {{OP_ADD}}. > A subsequent {{append()}} eventually caused a lease recovery after the soft > limit period. This resulted in {{commitBlockSynchronization()}}, which closed > the file with {{OP_CLOSE}} being logged. Since there was no corresponding > {{OP_ADD}}, edit replaying could not apply this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)