[jira] [Commented] (HDFS-3217) ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING exception
[ https://issues.apache.org/jira/browse/HDFS-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248621#comment-13248621 ] Hari Mankude commented on HDFS-3217: bq.I disagree. It is an explicit decision to not have the ZKFC act as a service supervisor, because it adds a lot of complexity. There already exist lots of solutions for service management - we assume that the user is already using something like puppet, daemontools, supervisord, cron, etc, to make sure the daemon restarts eventually. I did not find a reference to an external monitoring tool in the HA design docs. So apologies there. If the scanning interval of the external tools is significant, it might still make sense for FC to restart the NN directly. With one of the NN processes down, the cluster is functioning in a degraded state and the longer it takes to restart the standby NN process, longer the recovery time is going to be. > ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING > exception > - > > Key: HDFS-3217 > URL: https://issues.apache.org/jira/browse/HDFS-3217 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: auto-failover, ha >Reporter: Hari Mankude >Assignee: Hari Mankude > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3217) ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING exception
[ https://issues.apache.org/jira/browse/HDFS-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248589#comment-13248589 ] Todd Lipcon commented on HDFS-3217: --- I disagree. It is an explicit decision to not have the ZKFC act as a service supervisor, because it adds a lot of complexity. There already exist lots of solutions for service management - we assume that the user is already using something like puppet, daemontools, supervisord, cron, etc, to make sure the daemon restarts eventually. > ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING > exception > - > > Key: HDFS-3217 > URL: https://issues.apache.org/jira/browse/HDFS-3217 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: auto-failover, ha >Reporter: Hari Mankude >Assignee: Hari Mankude > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3217) ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING exception
[ https://issues.apache.org/jira/browse/HDFS-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248504#comment-13248504 ] Hari Mankude commented on HDFS-3217: ZKFC should restart NN when it sees a SERVICE_NOT_RESPONDING exception. NN might have aborted due to loss of quorum and unless there is manual intervention, NN will not be restarted. > ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING > exception > - > > Key: HDFS-3217 > URL: https://issues.apache.org/jira/browse/HDFS-3217 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: auto-failover, ha >Reporter: Hari Mankude >Assignee: Hari Mankude > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira