[ https://issues.apache.org/jira/browse/HDFS-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272226#comment-16272226 ]
Brahma Reddy Battula commented on HDFS-12834: --------------------------------------------- Sure,will look into HDFS-11751. Sorry for delay. > DFSZKFailoverController on error exits with 0 error code > -------------------------------------------------------- > > Key: HDFS-12834 > URL: https://issues.apache.org/jira/browse/HDFS-12834 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha > Affects Versions: 2.7.3, 3.0.0-alpha4 > Reporter: Zbigniew Kostrzewa > Assignee: Bharat Viswanadham > Attachments: HDFS-12834.00.patch, HDFS-12834.01.patch > > > On error {{DFSZKFailoverController}} exits with 0 return code which leads to > problems when integrating it with scripts and monitoring tools, e.g. systemd, > which when configured to restart the service only on failure does not restart > ZKFC because it exited with 0. > For example, in my case, systemd reported zkfc exited with success but in > logs I have found this: > {noformat} > 2017-11-14 05:33:55,075 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 3334ms for sessionid > 0x15fb794bd240001, closing socket connection and attempting reconnect > 2017-11-14 05:33:55,178 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Session disconnected. Entering neutral mode... > 2017-11-14 05:33:55,564 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server 10.9.4.73/10.9.4.73:2182. Will not attempt to > authenticate using SASL (unknown error) > 2017-11-14 05:33:55,566 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to 10.9.4.73/10.9.4.73:2182, initiating session > 2017-11-14 05:33:55,569 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server 10.9.4.73/10.9.4.73:2182, sessionid = > 0x15fb794bd240001, negotiated timeout = 5000 > 2017-11-14 05:33:55,570 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Session connected. > 2017-11-14 05:33:58,230 INFO org.apache.zookeeper.ClientCnxn: Unable to read > additional data from server sessionid 0x15fb794bd240001, likely server has > closed socket, closing socket connection and attempting reconnect > 2017-11-14 05:33:58,335 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Session disconnected. Entering neutral mode... > 2017-11-14 05:33:58,402 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server 10.9.4.138/10.9.4.138:2181. Will not attempt to > authenticate using SASL (unknown error) > 2017-11-14 05:33:58,403 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to 10.9.4.138/10.9.4.138:2181, initiating session > 2017-11-14 05:33:58,406 INFO org.apache.zookeeper.ClientCnxn: Unable to read > additional data from server sessionid 0x15fb794bd240001, likely server has > closed socket, closing socket connection and attempting reconnect > 2017-11-14 05:33:59,218 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server 10.9.4.228/10.9.4.228:2183. Will not attempt to > authenticate using SASL (unknown error) > 2017-11-14 05:33:59,219 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established to 10.9.4.228/10.9.4.228:2183, initiating session > 2017-11-14 05:33:59,221 INFO org.apache.zookeeper.ClientCnxn: Unable to read > additional data from server sessionid 0x15fb794bd240001, likely server has > closed socket, closing socket connection and attempting reconnect > 2017-11-14 05:34:01,094 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server 10.9.4.73/10.9.4.73:2182. Will not attempt to > authenticate using SASL (unknown error) > 2017-11-14 05:34:01,094 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 1773ms for sessionid > 0x15fb794bd240001, closing socket connection and attempting reconnect > 2017-11-14 05:34:01,196 FATAL org.apache.hadoop.ha.ActiveStandbyElector: > Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further > znode monitoring connection errors. > 2017-11-14 05:34:02,153 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x15fb794bd240001 closed > 2017-11-14 05:34:02,154 FATAL org.apache.hadoop.ha.ZKFailoverController: > Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. > Not retrying further znode monitoring connection errors. > 2017-11-14 05:34:02,154 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2017-11-14 05:34:05,208 INFO org.apache.hadoop.ipc.Server: Stopping server on > 8019 > 2017-11-14 05:34:05,487 INFO org.apache.hadoop.ipc.Server: Stopping IPC > Server listener on 8019 > 2017-11-14 05:34:05,488 INFO org.apache.hadoop.ipc.Server: Stopping IPC > Server Responder > 2017-11-14 05:34:05,487 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Yielding from election > 2017-11-14 05:34:05,488 INFO org.apache.hadoop.ha.HealthMonitor: Stopping > HealthMonitor thread > 2017-11-14 05:34:05,490 FATAL > org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Got a fatal error, > exiting now > java.lang.RuntimeException: ZK Failover Controller failed: Received stat > error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode > monitoring connection errors. > at > org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:369) > at > org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:238) > at > org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61) > at > org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172) > at > org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415) > at > org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168) > at > org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:181) > {noformat} > The code that seems responsible is in {{DFSZKFailoverController.java}}: > {code} > public static void main(String args[]) > throws Exception { > ... > int retCode = 0; > try { > retCode = zkfc.run(parser.getRemainingArgs()); > } catch (Throwable t) { > LOG.fatal("Got a fatal error, exiting now", t); > } > System.exit(retCode); > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org