[jira] [Commented] (HDFS-3217) ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING exception

2012-04-06 Thread Hari Mankude (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248621#comment-13248621
 ] 

Hari Mankude commented on HDFS-3217:


bq.I disagree. It is an explicit decision to not have the ZKFC act as a service 
supervisor, because it adds a lot of complexity. There already exist lots of 
solutions for service management - we assume that the user is already using 
something like puppet, daemontools, supervisord, cron, etc, to make sure the 
daemon restarts eventually.

I did not find a reference to an external monitoring tool in the HA design 
docs. So apologies there. If the scanning interval of the external tools is 
significant, it might still make sense for FC to restart the NN directly. With 
one of the NN processes down, the cluster is functioning in a degraded state 
and the longer it takes to restart the standby NN process, longer the recovery 
time is going to be.



> ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING 
> exception
> -
>
> Key: HDFS-3217
> URL: https://issues.apache.org/jira/browse/HDFS-3217
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: auto-failover, ha
>Reporter: Hari Mankude
>Assignee: Hari Mankude
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3217) ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING exception

2012-04-06 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248589#comment-13248589
 ] 

Todd Lipcon commented on HDFS-3217:
---

I disagree. It is an explicit decision to not have the ZKFC act as a service 
supervisor, because it adds a lot of complexity. There already exist lots of 
solutions for service management - we assume that the user is already using 
something like puppet, daemontools, supervisord, cron, etc, to make sure the 
daemon restarts eventually.

> ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING 
> exception
> -
>
> Key: HDFS-3217
> URL: https://issues.apache.org/jira/browse/HDFS-3217
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: auto-failover, ha
>Reporter: Hari Mankude
>Assignee: Hari Mankude
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3217) ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING exception

2012-04-06 Thread Hari Mankude (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248504#comment-13248504
 ] 

Hari Mankude commented on HDFS-3217:


ZKFC should restart NN when it sees a SERVICE_NOT_RESPONDING exception. NN 
might have aborted due to loss of quorum and unless there is manual 
intervention, NN will not be restarted.

> ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING 
> exception
> -
>
> Key: HDFS-3217
> URL: https://issues.apache.org/jira/browse/HDFS-3217
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: auto-failover, ha
>Reporter: Hari Mankude
>Assignee: Hari Mankude
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira