We have analyzed all other request processors exit patterns and could not find this pattern in any of them.
Found that this has introduced as part of ZOOKEEPER-121. System.exit and thread.join on same thread is causing this hang. I've also gone through Ted's earlier response on disk full scenario. http://www.google.co.in/url?sa=t&source=web&cd=3&ved=0CCAQFjAC&url=http%3A%2 F%2Fmail-archives.apache.org%2Fmod_mbox%2Fzookeeper-user%2F201106.mbox%2F%25 3CBANLkTimzQjXZvDKnP6xQLF9jHfsaz6JstA%40mail.gmail.com%253E&ei=FBQETvPWIcLNr Qfk75yaDA&usg=AFQjCNFTkguyxTligpz1TZBmkqe9Osz-uw We feel, even when one of the cluster member's disk is full, we should not interrupt the complete service. So, raised a new jira for this issue. https://issues.apache.org/jira/browse/ZOOKEEPER-1109 -----Original Message----- From: Laxman [mailto:[email protected]] Sent: Wednesday, June 22, 2011 1:54 PM To: [email protected] Subject: Zookeeper service is down when Leader disk is full Hi Everyone, We have found one issue while testing the disk space full scenario. Request you to validate our observations. Will log an issue if this found to be valid. Problem: Zookeeper is not shut down completely when dataDir disk space is full and ZK Cluster went into unserviceable state. Version: Zookeeper 3.3.3 Scenario If the leader zookeeper disk is made full, the zookeeper is trying to shutdown. But it is waiting indefinitely while shutting down the SyncRequestProcessor thread. Root Cause: this.join() is invoked in the same thread where System.exit(11) has been triggered. When disk space full happens, It got the exception as follows 'No space left on device' and invoked System.exit(11) from the SyncRequestProcessor thread(The following logs shows the same). Before exiting JVM, ZK will execute the ShutdownHook of QuorumPeerMain and the flow comes to SyncRequestProcessor.shutdown(). Here this.join() is invoked in the same thread where System.exit(11) has been invoked. Thread dumps: The following thread dump shows the QuorumPeerMain thread is infntely waiting inside SyncRequestProcessor. "Thread-2" prio=10 tid=0x0810a400 nid=0x1695 in Object.wait() [0xac783000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xb804f5e8> (a org.apache.zookeeper.server.SyncRequestProcessor) at java.lang.Thread.join(Thread.java:1143) - locked <0xb804f5e8> (a org.apache.zookeeper.server.SyncRequestProcessor) at java.lang.Thread.join(Thread.java:1196) at org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcess or.java:171) at org.apache.zookeeper.server.quorum.ProposalRequestProcessor.shutdown(Proposa lRequestProcessor.java:79) at org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcess or.java:513) at org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:41 3) at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:411) at org.apache.zookeeper.server.quorum.QuorumPeer.shutdown(QuorumPeer.java:694) at org.apache.zookeeper.server.quorum.QuorumPeerMain$1.run(QuorumPeerMain.java: 126) "SyncThread:2" prio=10 tid=0xad7fd800 nid=0x4acb in Object.wait() [0xac9ba000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xb8030d00> (a org.apache.zookeeper.server.quorum.QuorumPeerMain$1) at java.lang.Thread.join(Thread.java:1143) - locked <0xb8030d00> (a org.apache.zookeeper.server.quorum.QuorumPeerMain$1) at java.lang.Thread.join(Thread.java:1196) at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79 ) at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24) at java.lang.Shutdown.runHooks(Shutdown.java:79) at java.lang.Shutdown.sequence(Shutdown.java:123) at java.lang.Shutdown.exit(Shutdown.java:168) - locked <0xf01ff3b0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Runtime.exit(Runtime.java:90) at java.lang.System.exit(System.java:904) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.ja va:149) Logs : 2011-06-21 10:09:59,730 - FATAL [SyncThread:2:SyncRequestProcessor@148] - Severe unrecoverable error, exiting java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:30 5) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog .java:324) at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor. java:158) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.ja va:98) 2011-06-21 10:09:59,732 - INFO [Thread-2:QuorumPeer@691] - The Quorum server is going for shutdown 2011-06-21 10:09:59,732 - INFO [Thread-2:Leader@393] - Shutdown called java.lang.Exception: shutdown Leader! reason: quorum Peer shutdown at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:393) at org.apache.zookeeper.server.quorum.QuorumPeer.shutdown(QuorumPeer.java:694) at org.apache.zookeeper.server.quorum.QuorumPeerMain$1.run(QuorumPeerMain.java: 126) 2011-06-21 10:09:59,733 - INFO [Thread-6:Leader$LearnerCnxAcceptor@243] - exception while shutting down acceptor: java.net.SocketException: Socket closed 2011-06-21 10:09:59,758 - INFO [ProcessThread:-1:PrepRequestProcessor@120] - PrepRequestProcessor exited loop! 2011-06-21 10:09:59,758 - INFO [CommitProcessor:2:CommitProcessor@150] - CommitProcessor exited loop! 2011-06-21 10:09:59,759 - INFO [Thread-2:FinalRequestProcessor@379] - shutdown of request processor complete 2011-06-21 10:10:00,000 - INFO [SessionTracker:SessionTrackerImpl@165] - SessionTrackerImpl exited loop! -- Thanks Laxman
