[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully
[ https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380003#comment-14380003 ] Kihwal Lee commented on HDFS-6353: -- +1 The command and the protorol change look fine. Thanks for the patch, [~jingzhao] Handle checkpoint failure more gracefully - Key: HDFS-6353 URL: https://issues.apache.org/jira/browse/HDFS-6353 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Suresh Srinivas Assignee: Jing Zhao Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch, HDFS-6353.002.patch One of the failure patterns I have seen is, in some rare circumstances, due to some inconsistency the secondary or standby fails to consume editlog. The only solution when this happens is to save the namespace at the current active namenode. But sometimes when this happens, unsuspecting admin might end up restarting the namenode, requiring more complicated solution to the problem (such as ignore editlog record that cannot be consumed etc.). How about adding the following functionality: When checkpointer (standby or secondary) fails to consume editlog, based on a configurable flag (on/off) to let the active namenode know about this failure. Active namenode can enters safemode and saves namespace. When in this type of safemode, namenode UI also shows information about checkpoint failure and that it is saving namespace. Once the namespace is saved, namenode can come out of safemode. This means service unavailability (even in HA cluster). But it might be worth it to avoid long startup times or need for other manual fixes. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully
[ https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378278#comment-14378278 ] Jing Zhao commented on HDFS-6353: - Thanks for the review, Jitendra! One concern of the current approach is whether the dfsadmin commands may block the NN shutdown. But looks like the current RPC timeout (by default 1min) should be able to handle this. Another concern is that for HA setup the saveNamespace commands will be sent to both NN. But I think this should be fine since the checkpoint is made if and only if no checkpoint has been done during the past several checkpoint periods. [~kihwal], do you have any further comments about the current approach/patch? Handle checkpoint failure more gracefully - Key: HDFS-6353 URL: https://issues.apache.org/jira/browse/HDFS-6353 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Suresh Srinivas Assignee: Jing Zhao Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch, HDFS-6353.002.patch One of the failure patterns I have seen is, in some rare circumstances, due to some inconsistency the secondary or standby fails to consume editlog. The only solution when this happens is to save the namespace at the current active namenode. But sometimes when this happens, unsuspecting admin might end up restarting the namenode, requiring more complicated solution to the problem (such as ignore editlog record that cannot be consumed etc.). How about adding the following functionality: When checkpointer (standby or secondary) fails to consume editlog, based on a configurable flag (on/off) to let the active namenode know about this failure. Active namenode can enters safemode and saves namespace. When in this type of safemode, namenode UI also shows information about checkpoint failure and that it is saving namespace. Once the namespace is saved, namenode can come out of safemode. This means service unavailability (even in HA cluster). But it might be worth it to avoid long startup times or need for other manual fixes. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully
[ https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372484#comment-14372484 ] Hadoop QA commented on HDFS-6353: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706082/HDFS-6353.002.patch against trunk revision fe5c23b. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 14 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal: org.apache.hadoop.tracing.TestTracing Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/10014//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10014//console This message is automatically generated. Handle checkpoint failure more gracefully - Key: HDFS-6353 URL: https://issues.apache.org/jira/browse/HDFS-6353 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Suresh Srinivas Assignee: Jing Zhao Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch, HDFS-6353.002.patch One of the failure patterns I have seen is, in some rare circumstances, due to some inconsistency the secondary or standby fails to consume editlog. The only solution when this happens is to save the namespace at the current active namenode. But sometimes when this happens, unsuspecting admin might end up restarting the namenode, requiring more complicated solution to the problem (such as ignore editlog record that cannot be consumed etc.). How about adding the following functionality: When checkpointer (standby or secondary) fails to consume editlog, based on a configurable flag (on/off) to let the active namenode know about this failure. Active namenode can enters safemode and saves namespace. When in this type of safemode, namenode UI also shows information about checkpoint failure and that it is saving namespace. Once the namespace is saved, namenode can come out of safemode. This means service unavailability (even in HA cluster). But it might be worth it to avoid long startup times or need for other manual fixes. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully
[ https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372090#comment-14372090 ] Jitendra Nath Pandey commented on HDFS-6353: +1, the patch looks good to me. Handle checkpoint failure more gracefully - Key: HDFS-6353 URL: https://issues.apache.org/jira/browse/HDFS-6353 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Suresh Srinivas Assignee: Jing Zhao Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch One of the failure patterns I have seen is, in some rare circumstances, due to some inconsistency the secondary or standby fails to consume editlog. The only solution when this happens is to save the namespace at the current active namenode. But sometimes when this happens, unsuspecting admin might end up restarting the namenode, requiring more complicated solution to the problem (such as ignore editlog record that cannot be consumed etc.). How about adding the following functionality: When checkpointer (standby or secondary) fails to consume editlog, based on a configurable flag (on/off) to let the active namenode know about this failure. Active namenode can enters safemode and saves namespace. When in this type of safemode, namenode UI also shows information about checkpoint failure and that it is saving namespace. Once the namespace is saved, namenode can come out of safemode. This means service unavailability (even in HA cluster). But it might be worth it to avoid long startup times or need for other manual fixes. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully
[ https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372108#comment-14372108 ] Hadoop QA commented on HDFS-6353: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691823/HDFS-6353.001.patch against trunk revision 586348e. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/10008//console This message is automatically generated. Handle checkpoint failure more gracefully - Key: HDFS-6353 URL: https://issues.apache.org/jira/browse/HDFS-6353 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Suresh Srinivas Assignee: Jing Zhao Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch One of the failure patterns I have seen is, in some rare circumstances, due to some inconsistency the secondary or standby fails to consume editlog. The only solution when this happens is to save the namespace at the current active namenode. But sometimes when this happens, unsuspecting admin might end up restarting the namenode, requiring more complicated solution to the problem (such as ignore editlog record that cannot be consumed etc.). How about adding the following functionality: When checkpointer (standby or secondary) fails to consume editlog, based on a configurable flag (on/off) to let the active namenode know about this failure. Active namenode can enters safemode and saves namespace. When in this type of safemode, namenode UI also shows information about checkpoint failure and that it is saving namespace. Once the namespace is saved, namenode can come out of safemode. This means service unavailability (even in HA cluster). But it might be worth it to avoid long startup times or need for other manual fixes. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully
[ https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274738#comment-14274738 ] Hadoop QA commented on HDFS-6353: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691823/HDFS-6353.001.patch against trunk revision 5188153. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 14 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 3 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9195//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/9195//artifact/patchprocess/newPatchFindbugsWarningsbkjournal.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9195//console This message is automatically generated. Handle checkpoint failure more gracefully - Key: HDFS-6353 URL: https://issues.apache.org/jira/browse/HDFS-6353 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Suresh Srinivas Assignee: Jing Zhao Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch One of the failure patterns I have seen is, in some rare circumstances, due to some inconsistency the secondary or standby fails to consume editlog. The only solution when this happens is to save the namespace at the current active namenode. But sometimes when this happens, unsuspecting admin might end up restarting the namenode, requiring more complicated solution to the problem (such as ignore editlog record that cannot be consumed etc.). How about adding the following functionality: When checkpointer (standby or secondary) fails to consume editlog, based on a configurable flag (on/off) to let the active namenode know about this failure. Active namenode can enters safemode and saves namespace. When in this type of safemode, namenode UI also shows information about checkpoint failure and that it is saving namespace. Once the namespace is saved, namenode can come out of safemode. This means service unavailability (even in HA cluster). But it might be worth it to avoid long startup times or need for other manual fixes. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully
[ https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999176#comment-13999176 ] Kihwal Lee commented on HDFS-6353: -- We run a monitoring tool that watches the name.dir for fsimages. If new one does not appear in configured_checkpoint_interval * factor, it alerts operators. We could at least show it on the namenode UI. If we expose the last checkpoint time, interval (time # tx) and txid in jmx, javascript can take care of the rest. Handle checkpoint failure more gracefully - Key: HDFS-6353 URL: https://issues.apache.org/jira/browse/HDFS-6353 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Suresh Srinivas Assignee: Jing Zhao One of the failure patterns I have seen is, in some rare circumstances, due to some inconsistency the secondary or standby fails to consume editlog. The only solution when this happens is to save the namespace at the current active namenode. But sometimes when this happens, unsuspecting admin might end up restarting the namenode, requiring more complicated solution to the problem (such as ignore editlog record that cannot be consumed etc.). How about adding the following functionality: When checkpointer (standby or secondary) fails to consume editlog, based on a configurable flag (on/off) to let the active namenode know about this failure. Active namenode can enters safemode and saves namespace. When in this type of safemode, namenode UI also shows information about checkpoint failure and that it is saving namespace. Once the namespace is saved, namenode can come out of safemode. This means service unavailability (even in HA cluster). But it might be worth it to avoid long startup times or need for other manual fixes. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully
[ https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1426#comment-1426 ] Jing Zhao commented on HDFS-6353: - Thanks for the comments, [~kihwal]. I think we already have LastCheckpointTime, MostRecentCheckpointTxId, and LastAppliedOrWrittenTxId exposed through jmx. Ambari already uses these for monitoring and alerting. Handle checkpoint failure more gracefully - Key: HDFS-6353 URL: https://issues.apache.org/jira/browse/HDFS-6353 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Suresh Srinivas Assignee: Jing Zhao One of the failure patterns I have seen is, in some rare circumstances, due to some inconsistency the secondary or standby fails to consume editlog. The only solution when this happens is to save the namespace at the current active namenode. But sometimes when this happens, unsuspecting admin might end up restarting the namenode, requiring more complicated solution to the problem (such as ignore editlog record that cannot be consumed etc.). How about adding the following functionality: When checkpointer (standby or secondary) fails to consume editlog, based on a configurable flag (on/off) to let the active namenode know about this failure. Active namenode can enters safemode and saves namespace. When in this type of safemode, namenode UI also shows information about checkpoint failure and that it is saving namespace. Once the namespace is saved, namenode can come out of safemode. This means service unavailability (even in HA cluster). But it might be worth it to avoid long startup times or need for other manual fixes. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully
[ https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998160#comment-13998160 ] Jing Zhao commented on HDFS-6353: - Can this be integrated with HDFS-4923? So if the SNN or SBN fails to do a checkpoint, the ANN will get notified, and mark some save namespace when shutdown variable to true. The variable will be reset to false if a checkpoint is done successfully afterwards. Otherwise the NN will trigger savenamespace when it gets shutdown by a careless admin (but without purging edits and the last checkpoint). Handle checkpoint failure more gracefully - Key: HDFS-6353 URL: https://issues.apache.org/jira/browse/HDFS-6353 Project: Hadoop HDFS Issue Type: Sub-task Components: namenode Reporter: Suresh Srinivas Assignee: Jing Zhao One of the failure patterns I have seen is, in some rare circumstances, due to some inconsistency the secondary or standby fails to consume editlog. The only solution when this happens is to save the namespace at the current active namenode. But sometimes when this happens, unsuspecting admin might end up restarting the namenode, requiring more complicated solution to the problem (such as ignore editlog record that cannot be consumed etc.). How about adding the following functionality: When checkpointer (standby or secondary) fails to consume editlog, based on a configurable flag (on/off) to let the active namenode know about this failure. Active namenode can enters safemode and saves namespace. When in this type of safemode, namenode UI also shows information about checkpoint failure and that it is saving namespace. Once the namespace is saved, namenode can come out of safemode. This means service unavailability (even in HA cluster). But it might be worth it to avoid long startup times or need for other manual fixes. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)