[ 
https://issues.apache.org/jira/browse/HDFS-6353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999176#comment-13999176
 ] 

Kihwal Lee commented on HDFS-6353:
----------------------------------

We run a monitoring tool that watches the name.dir for fsimages. If new one 
does not appear in configured_checkpoint_interval * factor, it alerts 
operators.  We could at least show it on the namenode UI.  If we expose the 
last checkpoint time, interval (time & #  tx) and txid in jmx, javascript can 
take care of the rest.



> Handle checkpoint failure more gracefully
> -----------------------------------------
>
>                 Key: HDFS-6353
>                 URL: https://issues.apache.org/jira/browse/HDFS-6353
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: namenode
>            Reporter: Suresh Srinivas
>            Assignee: Jing Zhao
>
> One of the failure patterns I have seen is, in some rare circumstances, due 
> to some inconsistency the secondary or standby fails to consume editlog. The 
> only solution when this happens is to save the namespace at the current 
> active namenode. But sometimes when this happens, unsuspecting admin might 
> end up restarting the namenode, requiring more complicated solution to the 
> problem (such as ignore editlog record that cannot be consumed etc.).
> How about adding the following functionality:
> When checkpointer (standby or secondary) fails to consume editlog, based on a 
> configurable flag (on/off) to let the active namenode know about this 
> failure. Active namenode can enters safemode and saves namespace. When  in 
> this type of safemode, namenode UI also shows information about checkpoint 
> failure and that it is saving namespace. Once the namespace is saved, 
> namenode can come out of safemode.
> This means service unavailability (even in HA cluster). But it might be worth 
> it to avoid long startup times or need for other manual fixes. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to