[ https://issues.apache.org/jira/browse/HADOOP-11328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223378#comment-14223378 ]
Stephen Chu commented on HADOOP-11328: -------------------------------------- Thanks, Tianyin. I agree the log will be helpful. +1 (non-binding) > ZKFailoverController.java does not log Exception and causes latent problems > during failover > ------------------------------------------------------------------------------------------- > > Key: HADOOP-11328 > URL: https://issues.apache.org/jira/browse/HADOOP-11328 > Project: Hadoop Common > Issue Type: Bug > Components: ha > Affects Versions: 2.5.1 > Reporter: Tianyin Xu > Attachments: ZKFailoverController.log.exception.1.patch > > > In _ZKFailoverController.java_, the _Exception_ caught by the _run()_ method > does not have a single error log. This causes latent problems that are only > manifested during failover. > h5. The problem we encountered > An _Exception_ is thrown from the _doRun()_ method during _initHM()_ (caused > by a configuration error). If you want to repeat, you can set > "_ha.health-monitor.connect-retry-interval.ms_" to be any nonsensical value. > {code:title=ZKFailoverController.java|borderStyle=solid} > private int doRun(String[] args) > ... > initRPC(); > initHM(); > startRPC(); > .... > } > {code} > The Exception is caught in the _run()_ method, as follows, > {code:title=ZKFailoverController.java|borderStyle=solid} > public int run(final String[] args) throws Exception { > ... > try { > ... > @Override > public Integer run() { > try { > return doRun(args); > } catch (Exception t) { > throw new RuntimeException(t); > } finally { > if (elector != null) { > elector.terminateConnection(); > } > } > } > }); > } catch (RuntimeException rte) { > throw (Exception)rte.getCause(); > } > } > {code} > Unfortunately, the Exception (causing the shutdown of the process) is *not > logged at all*. This causes latent errors which is only manifested during > failover (because ZKFC is dead). The tricky thing here is that everything > looks perfectly fine: the _jps_ command shows a running > DFSZKFailoverController process and the two NameNode (active and standby) > work fine. > h5. Patch > We strongly suggest to add a error log to notify the error caught, such as, > --- > hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java > (revision 1641307) > +++ > hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java > (working copy) > {code:title=@@ -178,6 +178,7 @@|borderStyle=solid} > } > }); > } catch (RuntimeException rte) { > + LOG.fatal("The failover controller encounters runtime error: " + rte); > throw (Exception)rte.getCause(); > } > } > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)