Hi all, We are seeing our workers constantly being killed by Storm with to the following logs: worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, closing socket and attempting reconnect supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
Eventually Storm decides to just kill the worker and restart it as you see in the supervisor log. We theorize this is the Zookeeper heartbeat thread and it is being choked out due to very high CPU load on the machine (near 100%). I have increased the connection timeouts in the storm.yaml config file yet Storm seems to continue to use some unknown value for the above client session timeout messages: storm.zookeeper.connection.timeout: 300000 storm.zookeeper.session.timeout: 300000 1) What timeout config is appropriate for the above timeout message? 2) Is this expected behavior for Storm to be unable to keep up with heartbeat threads under high CPU or is our theory incorrect? Thanks, Michael