[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully

2015-03-25 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14380003#comment-14380003
 ] 

Kihwal Lee commented on HDFS-6353:
--

+1 The command and the protorol change look fine. Thanks for the patch, 
[~jingzhao]

 Handle checkpoint failure more gracefully
 -

 Key: HDFS-6353
 URL: https://issues.apache.org/jira/browse/HDFS-6353
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Suresh Srinivas
Assignee: Jing Zhao
 Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch, 
 HDFS-6353.002.patch


 One of the failure patterns I have seen is, in some rare circumstances, due 
 to some inconsistency the secondary or standby fails to consume editlog. The 
 only solution when this happens is to save the namespace at the current 
 active namenode. But sometimes when this happens, unsuspecting admin might 
 end up restarting the namenode, requiring more complicated solution to the 
 problem (such as ignore editlog record that cannot be consumed etc.).
 How about adding the following functionality:
 When checkpointer (standby or secondary) fails to consume editlog, based on a 
 configurable flag (on/off) to let the active namenode know about this 
 failure. Active namenode can enters safemode and saves namespace. When  in 
 this type of safemode, namenode UI also shows information about checkpoint 
 failure and that it is saving namespace. Once the namespace is saved, 
 namenode can come out of safemode.
 This means service unavailability (even in HA cluster). But it might be worth 
 it to avoid long startup times or need for other manual fixes. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully

2015-03-24 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378278#comment-14378278
 ] 

Jing Zhao commented on HDFS-6353:
-

Thanks for the review, Jitendra! 

One concern of the current approach is whether the dfsadmin commands may block 
the NN shutdown. But looks like the current RPC timeout (by default 1min) 
should be able to handle this. Another concern is that for HA setup the 
saveNamespace commands will be sent to both NN. But I think this should be fine 
since the checkpoint is made if and only if no checkpoint has been done during 
the past several checkpoint periods.

[~kihwal], do you have any further comments about the current approach/patch?

 Handle checkpoint failure more gracefully
 -

 Key: HDFS-6353
 URL: https://issues.apache.org/jira/browse/HDFS-6353
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Suresh Srinivas
Assignee: Jing Zhao
 Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch, 
 HDFS-6353.002.patch


 One of the failure patterns I have seen is, in some rare circumstances, due 
 to some inconsistency the secondary or standby fails to consume editlog. The 
 only solution when this happens is to save the namespace at the current 
 active namenode. But sometimes when this happens, unsuspecting admin might 
 end up restarting the namenode, requiring more complicated solution to the 
 problem (such as ignore editlog record that cannot be consumed etc.).
 How about adding the following functionality:
 When checkpointer (standby or secondary) fails to consume editlog, based on a 
 configurable flag (on/off) to let the active namenode know about this 
 failure. Active namenode can enters safemode and saves namespace. When  in 
 this type of safemode, namenode UI also shows information about checkpoint 
 failure and that it is saving namespace. Once the namespace is saved, 
 namenode can come out of safemode.
 This means service unavailability (even in HA cluster). But it might be worth 
 it to avoid long startup times or need for other manual fixes. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully

2015-03-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372484#comment-14372484
 ] 

Hadoop QA commented on HDFS-6353:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12706082/HDFS-6353.002.patch
  against trunk revision fe5c23b.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 14 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs 
hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal:

  org.apache.hadoop.tracing.TestTracing

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10014//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10014//console

This message is automatically generated.

 Handle checkpoint failure more gracefully
 -

 Key: HDFS-6353
 URL: https://issues.apache.org/jira/browse/HDFS-6353
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Suresh Srinivas
Assignee: Jing Zhao
 Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch, 
 HDFS-6353.002.patch


 One of the failure patterns I have seen is, in some rare circumstances, due 
 to some inconsistency the secondary or standby fails to consume editlog. The 
 only solution when this happens is to save the namespace at the current 
 active namenode. But sometimes when this happens, unsuspecting admin might 
 end up restarting the namenode, requiring more complicated solution to the 
 problem (such as ignore editlog record that cannot be consumed etc.).
 How about adding the following functionality:
 When checkpointer (standby or secondary) fails to consume editlog, based on a 
 configurable flag (on/off) to let the active namenode know about this 
 failure. Active namenode can enters safemode and saves namespace. When  in 
 this type of safemode, namenode UI also shows information about checkpoint 
 failure and that it is saving namespace. Once the namespace is saved, 
 namenode can come out of safemode.
 This means service unavailability (even in HA cluster). But it might be worth 
 it to avoid long startup times or need for other manual fixes. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully

2015-03-20 Thread Jitendra Nath Pandey (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372090#comment-14372090
 ] 

Jitendra Nath Pandey commented on HDFS-6353:


+1, the patch looks good to me.

 Handle checkpoint failure more gracefully
 -

 Key: HDFS-6353
 URL: https://issues.apache.org/jira/browse/HDFS-6353
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Suresh Srinivas
Assignee: Jing Zhao
 Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch


 One of the failure patterns I have seen is, in some rare circumstances, due 
 to some inconsistency the secondary or standby fails to consume editlog. The 
 only solution when this happens is to save the namespace at the current 
 active namenode. But sometimes when this happens, unsuspecting admin might 
 end up restarting the namenode, requiring more complicated solution to the 
 problem (such as ignore editlog record that cannot be consumed etc.).
 How about adding the following functionality:
 When checkpointer (standby or secondary) fails to consume editlog, based on a 
 configurable flag (on/off) to let the active namenode know about this 
 failure. Active namenode can enters safemode and saves namespace. When  in 
 this type of safemode, namenode UI also shows information about checkpoint 
 failure and that it is saving namespace. Once the namespace is saved, 
 namenode can come out of safemode.
 This means service unavailability (even in HA cluster). But it might be worth 
 it to avoid long startup times or need for other manual fixes. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully

2015-03-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372108#comment-14372108
 ] 

Hadoop QA commented on HDFS-6353:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12691823/HDFS-6353.001.patch
  against trunk revision 586348e.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HDFS-Build/10008//console

This message is automatically generated.

 Handle checkpoint failure more gracefully
 -

 Key: HDFS-6353
 URL: https://issues.apache.org/jira/browse/HDFS-6353
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Suresh Srinivas
Assignee: Jing Zhao
 Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch


 One of the failure patterns I have seen is, in some rare circumstances, due 
 to some inconsistency the secondary or standby fails to consume editlog. The 
 only solution when this happens is to save the namespace at the current 
 active namenode. But sometimes when this happens, unsuspecting admin might 
 end up restarting the namenode, requiring more complicated solution to the 
 problem (such as ignore editlog record that cannot be consumed etc.).
 How about adding the following functionality:
 When checkpointer (standby or secondary) fails to consume editlog, based on a 
 configurable flag (on/off) to let the active namenode know about this 
 failure. Active namenode can enters safemode and saves namespace. When  in 
 this type of safemode, namenode UI also shows information about checkpoint 
 failure and that it is saving namespace. Once the namespace is saved, 
 namenode can come out of safemode.
 This means service unavailability (even in HA cluster). But it might be worth 
 it to avoid long startup times or need for other manual fixes. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully

2015-01-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274738#comment-14274738
 ] 

Hadoop QA commented on HDFS-6353:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12691823/HDFS-6353.001.patch
  against trunk revision 5188153.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 14 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 3 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs 
hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9195//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/9195//artifact/patchprocess/newPatchFindbugsWarningsbkjournal.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9195//console

This message is automatically generated.

 Handle checkpoint failure more gracefully
 -

 Key: HDFS-6353
 URL: https://issues.apache.org/jira/browse/HDFS-6353
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Suresh Srinivas
Assignee: Jing Zhao
 Attachments: HDFS-6353.000.patch, HDFS-6353.001.patch


 One of the failure patterns I have seen is, in some rare circumstances, due 
 to some inconsistency the secondary or standby fails to consume editlog. The 
 only solution when this happens is to save the namespace at the current 
 active namenode. But sometimes when this happens, unsuspecting admin might 
 end up restarting the namenode, requiring more complicated solution to the 
 problem (such as ignore editlog record that cannot be consumed etc.).
 How about adding the following functionality:
 When checkpointer (standby or secondary) fails to consume editlog, based on a 
 configurable flag (on/off) to let the active namenode know about this 
 failure. Active namenode can enters safemode and saves namespace. When  in 
 this type of safemode, namenode UI also shows information about checkpoint 
 failure and that it is saving namespace. Once the namespace is saved, 
 namenode can come out of safemode.
 This means service unavailability (even in HA cluster). But it might be worth 
 it to avoid long startup times or need for other manual fixes. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully

2014-05-16 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999176#comment-13999176
 ] 

Kihwal Lee commented on HDFS-6353:
--

We run a monitoring tool that watches the name.dir for fsimages. If new one 
does not appear in configured_checkpoint_interval * factor, it alerts 
operators.  We could at least show it on the namenode UI.  If we expose the 
last checkpoint time, interval (time  #  tx) and txid in jmx, javascript can 
take care of the rest.



 Handle checkpoint failure more gracefully
 -

 Key: HDFS-6353
 URL: https://issues.apache.org/jira/browse/HDFS-6353
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Suresh Srinivas
Assignee: Jing Zhao

 One of the failure patterns I have seen is, in some rare circumstances, due 
 to some inconsistency the secondary or standby fails to consume editlog. The 
 only solution when this happens is to save the namespace at the current 
 active namenode. But sometimes when this happens, unsuspecting admin might 
 end up restarting the namenode, requiring more complicated solution to the 
 problem (such as ignore editlog record that cannot be consumed etc.).
 How about adding the following functionality:
 When checkpointer (standby or secondary) fails to consume editlog, based on a 
 configurable flag (on/off) to let the active namenode know about this 
 failure. Active namenode can enters safemode and saves namespace. When  in 
 this type of safemode, namenode UI also shows information about checkpoint 
 failure and that it is saving namespace. Once the namespace is saved, 
 namenode can come out of safemode.
 This means service unavailability (even in HA cluster). But it might be worth 
 it to avoid long startup times or need for other manual fixes. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully

2014-05-16 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1426#comment-1426
 ] 

Jing Zhao commented on HDFS-6353:
-

Thanks for the comments, [~kihwal]. I think we already have LastCheckpointTime, 
MostRecentCheckpointTxId, and LastAppliedOrWrittenTxId exposed through jmx. 
Ambari already uses these for monitoring and alerting.

 Handle checkpoint failure more gracefully
 -

 Key: HDFS-6353
 URL: https://issues.apache.org/jira/browse/HDFS-6353
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Suresh Srinivas
Assignee: Jing Zhao

 One of the failure patterns I have seen is, in some rare circumstances, due 
 to some inconsistency the secondary or standby fails to consume editlog. The 
 only solution when this happens is to save the namespace at the current 
 active namenode. But sometimes when this happens, unsuspecting admin might 
 end up restarting the namenode, requiring more complicated solution to the 
 problem (such as ignore editlog record that cannot be consumed etc.).
 How about adding the following functionality:
 When checkpointer (standby or secondary) fails to consume editlog, based on a 
 configurable flag (on/off) to let the active namenode know about this 
 failure. Active namenode can enters safemode and saves namespace. When  in 
 this type of safemode, namenode UI also shows information about checkpoint 
 failure and that it is saving namespace. Once the namespace is saved, 
 namenode can come out of safemode.
 This means service unavailability (even in HA cluster). But it might be worth 
 it to avoid long startup times or need for other manual fixes. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-6353) Handle checkpoint failure more gracefully

2014-05-15 Thread Jing Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998160#comment-13998160
 ] 

Jing Zhao commented on HDFS-6353:
-

Can this be integrated with HDFS-4923? So if the SNN or SBN fails to do a 
checkpoint, the ANN will get notified, and mark some save namespace when 
shutdown variable to true. The variable will be reset to false if a checkpoint 
is done successfully afterwards. Otherwise the NN will trigger savenamespace 
when it gets shutdown by a careless admin (but without purging edits and the 
last checkpoint).

 Handle checkpoint failure more gracefully
 -

 Key: HDFS-6353
 URL: https://issues.apache.org/jira/browse/HDFS-6353
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: namenode
Reporter: Suresh Srinivas
Assignee: Jing Zhao

 One of the failure patterns I have seen is, in some rare circumstances, due 
 to some inconsistency the secondary or standby fails to consume editlog. The 
 only solution when this happens is to save the namespace at the current 
 active namenode. But sometimes when this happens, unsuspecting admin might 
 end up restarting the namenode, requiring more complicated solution to the 
 problem (such as ignore editlog record that cannot be consumed etc.).
 How about adding the following functionality:
 When checkpointer (standby or secondary) fails to consume editlog, based on a 
 configurable flag (on/off) to let the active namenode know about this 
 failure. Active namenode can enters safemode and saves namespace. When  in 
 this type of safemode, namenode UI also shows information about checkpoint 
 failure and that it is saving namespace. Once the namespace is saved, 
 namenode can come out of safemode.
 This means service unavailability (even in HA cluster). But it might be worth 
 it to avoid long startup times or need for other manual fixes. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)