I have storm-0.9.2 installed in a 5-node cluster. I have a simple topology with 1 spout and varying number of bolts (4, 9, 22, 31). For each configuration I have configured (#bolts + 1) workers. Thus for 4 bolts, I have 5 workers, 22 bolts 23 workers, etc. I have observed failed worker processes in the worker log files with corresponding *EndOfStream exception* in the zookeeper.out log file. When I do get a clean test run the total number of tuples are accounted for and each worker processes the same amount of tuples. I'm attributing this to the shufflegrouping's even distribution.
On a non-clean test run, the workers that fail attempt to reconnect. However since the total number of tuples are finite and the spout has emitted all of them, there are no more tuples to process. Also the total number of tuples processed is less than the actual finite amount. What are the possible causes for a worker to die in the middle of processing? *Cluster environment:* Apache Storm 0.9.2 Zookeeper 3.4.6 Ubuntu 13.10 *Excerpt from zookeeper.out:* *2014-10-29 11:40:52,136 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357 <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357>] - caught end of stream exception* *EndOfStreamException: Unable to read additional data from client sessionid 0x1495d312a4f002d, likely client has closed socket* * at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)* * at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)* * at java.lang.Thread.run(Thread.java:744)* *2014-10-29 11:40:52,137 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007 <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007>] - Closed socket connection for client /192.168.0.4:38970 <http://192.168.0.4:38970/> which had sessionid 0x1495d312a4f002d* Thanks, Dennis
