Github user revans2 commented on the pull request:
https://github.com/apache/storm/pull/797#issuecomment-151279210
@HeartSaVioR I am seeing similar things to what @harshach is seeing. I
really want to trace this down and fix it. How many nodes do you have? Which
daemons are running on which nodes? What is version of java you are running?
What OS are you running on? Can you share some information about the hardware,
I know it is VMs but number of cores and frequency would be good. What is the
network connection between the nodes?
The failures you are seeing look like what I would see when ZK or the
network would get overloaded. The heartbeats could not make it to ZK and so it
didn't show any change in the data some of the time, but with only 3 workers
and none of them getting rescheduled I find that hard to believe. Can you
share any of the logs? Have you tried to run
[zktop](https://github.com/phunt/zktop/blob/master/zktop.py) to see if any of
the nodes in the ensemble are showing signs of slowness. Have you looked to
see if the network and disk utilization?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---