Hi all, In my topology I observers that one of the supervisor machines get repeatedly disconnected from the Zookeeper and it prints the following error,
EndOfStreamException: Unable to read additional data from client sessionid 0x146193a4b70073d, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:662) 2014-05-20 06:51:20,631 [myid:] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /204.13.85.2:37938 which had sessionid 0x146193a4b70073d 2014-05-20 06:51:20,631 [myid:] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x146193a4b700741, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) at java.lang.Thread.run(Thread.java:662) 2014-05-20 06:51:20,632 [myid:] - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /204.13.85.2:37942 which had sessionid 0x146193a4b700741 2014-05-20 06:51:20,634 [myid:] - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception In the supervisor log the following error is getting printed along with the above error in zookeper 2014-05-20 06:59:33 b.s.d.supervisor [INFO] dfa06019-0c29-4782-94da-c37fcc75243d still hasn't started 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Worker dfa06019-0c29-4782-94da-c37fcc75243d failed to start 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Worker 4677c74c-8239-4cd3-8ff7-c95c3724e40e failed to start 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Worker 39c70558-c144-4da6-b685-841d7a531ec0 failed to start 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Worker 983c05ff-107e-483c-97e6-bb5c309606ec failed to start 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down and clearing state for id 39c70558-c144-4da6-b685-841d7a531ec0. Current supervisor time: 1400594374. State: :not-started, Heartbeat: nil 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down 5dd6583a-a5a4-4d76-8797-e885eacdf18f:39c70558-c144-4da6-b685-841d7a531ec0 2014-05-20 06:59:34 b.s.util [INFO] Error when trying to kill 25682. Process is probably already dead. 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shut down 5dd6583a-a5a4-4d76-8797-e885eacdf18f:39c70558-c144-4da6-b685-841d7a531ec0 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down and clearing state for id 983c05ff-107e-483c-97e6-bb5c309606ec. Current supervisor time: 1400594374. State: :not-started, Heartbeat: nil 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down 5dd6583a-a5a4-4d76-8797-e885eacdf18f:983c05ff-107e-483c-97e6-bb5c309606ec 2014-05-20 06:59:34 b.s.util [INFO] Error when trying to kill 25684. Process is probably already dead. 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shut down 5dd6583a-a5a4-4d76-8797-e885eacdf18f:983c05ff-107e-483c-97e6-bb5c309606ec 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down and clearing state for id 4677c74c-8239-4cd3-8ff7-c95c3724e40e. Current supervisor time: 1400594374. State: :not-started, Heartbeat: nil 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down 5dd6583a-a5a4-4d76-8797-e885eacdf18f:4677c74c-8239-4cd3-8ff7-c95c3724e40e 2014-05-20 06:59:34 b.s.util [INFO] Error when trying to kill 25680. Process is probably already dead. 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shut down 5dd6583a-a5a4-4d76-8797-e885eacdf18f:4677c74c-8239-4cd3-8ff7-c95c3724e40e 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down and clearing state for id dfa06019-0c29-4782-94da-c37fcc75243d. Current supervisor time: 1400594374. State: :not-started, Heartbeat: nil 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shutting down 5dd6583a-a5a4-4d76-8797-e885eacdf18f:dfa06019-0c29-4782-94da-c37fcc75243d 2014-05-20 06:59:34 b.s.util [INFO] Error when trying to kill 25679. Process is probably already dead. 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Shut down 5dd6583a-a5a4-4d76-8797-e885eacdf18f:dfa06019-0c29-4782-94da-c37fcc75243d 2014-05-20 06:59:34 b.s.d.supervisor [INFO] Launching worker with assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "LatencyMeasureTopology-17-1400594223", :executors [[4 4]]} for this supervisor 5dd6583a-a5a4-4d76-8797-e885eacdf18f on port 6700 with id 05c9e509-c29f-4310-b959-02f083224518 What's going wrong here? I feel like this is a heartbeat expiry issue! If so what are the parameters that i should tweak to avoid this issue. Thanks, Sajith.