[jira] [Updated] (HDFS-9119) Discrepancy between edit log tailing interval and RPC timeout for transitionToActive
[ https://issues.apache.org/jira/browse/HDFS-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated HDFS-9119: - Target Version/s: (was: 2.8.0) > Discrepancy between edit log tailing interval and RPC timeout for > transitionToActive > > > Key: HDFS-9119 > URL: https://issues.apache.org/jira/browse/HDFS-9119 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Affects Versions: 2.7.1 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-9119.00.patch > > > {{EditLogTailer}} on standby NameNode tails edits from active NameNode every > 2 minutes. But the {{transitionToActive}} RPC call has a timeout of 1 minute. > If active NameNode encounters very intensive metadata workload (in > particular, a lot of {{AddOp}} and {{MkDir}} operations to create new files > and directories), the amount of updates accumulated in the 2 mins edit log > tailing interval is hard for the standby NameNode to catch up in the 1 min > timeout window. If that happens, the FailoverController will timeout and give > up trying to transition the standby to active. The old ANN will resume adding > more edits. When the SbNN finally finishes catching up the edits and tries to > become active, it will crash. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-9119) Discrepancy between edit log tailing interval and RPC timeout for transitionToActive
[ https://issues.apache.org/jira/browse/HDFS-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HDFS-9119: Attachment: HDFS-9119.00.patch > Discrepancy between edit log tailing interval and RPC timeout for > transitionToActive > > > Key: HDFS-9119 > URL: https://issues.apache.org/jira/browse/HDFS-9119 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Affects Versions: 2.7.1 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-9119.00.patch > > > {{EditLogTailer}} on standby NameNode tails edits from active NameNode every > 2 minutes. But the {{transitionToActive}} RPC call has a timeout of 1 minute. > If active NameNode encounters very intensive metadata workload (in > particular, a lot of {{AddOp}} and {{MkDir}} operations to create new files > and directories), the amount of updates accumulated in the 2 mins edit log > tailing interval is hard for the standby NameNode to catch up in the 1 min > timeout window. If that happens, the FailoverController will timeout and give > up trying to transition the standby to active. The old ANN will resume adding > more edits. When the SbNN finally finishes catching up the edits and tries to > become active, it will crash. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9119) Discrepancy between edit log tailing interval and RPC timeout for transitionToActive
[ https://issues.apache.org/jira/browse/HDFS-9119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HDFS-9119: Status: Patch Available (was: Open) Submitting initial patch to trigger Jenkins. Will add a new test in the next rev. > Discrepancy between edit log tailing interval and RPC timeout for > transitionToActive > > > Key: HDFS-9119 > URL: https://issues.apache.org/jira/browse/HDFS-9119 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Affects Versions: 2.7.1 >Reporter: Zhe Zhang >Assignee: Zhe Zhang > Attachments: HDFS-9119.00.patch > > > {{EditLogTailer}} on standby NameNode tails edits from active NameNode every > 2 minutes. But the {{transitionToActive}} RPC call has a timeout of 1 minute. > If active NameNode encounters very intensive metadata workload (in > particular, a lot of {{AddOp}} and {{MkDir}} operations to create new files > and directories), the amount of updates accumulated in the 2 mins edit log > tailing interval is hard for the standby NameNode to catch up in the 1 min > timeout window. If that happens, the FailoverController will timeout and give > up trying to transition the standby to active. The old ANN will resume adding > more edits. When the SbNN finally finishes catching up the edits and tries to > become active, it will crash. -- This message was sent by Atlassian JIRA (v6.3.4#6332)