In reviewing client logs as part of our Cassandra testing, I noticed several Hector "All host pools marked down" exceptions in the logs. Further investigation showed a consistent pattern of "java.net.SocketException: Broken pipe" and "java.net.SocketException: Connection reset" messages. These errors occur for all 36 hosts in the cluster over a period of seconds, as Hector tries to find a working host to connect to. Failing to find a host results in the "All host pools marked down" messages. These messages recur for a period ranging from several seconds up to almost 15 minutes, clustering around two to three minutes. Then connectivity returns and when Hector tries to reconnect it succeeds.
The clients are instances of a JBoss 5 web application. We use Hector 0.7.0-29 (plus a patch that was pulled in advance of -30) The Cassandra cluster has 72 nodes split between two datacenters. It's running 0.7.5 plus a couple of bug fixes pulled in advance of 0.7.6. The keyspace uses NetworkTopologyStrategy and RF=6 (3 in each datacenter). The clients are reading and writing at LOCAL_QUORUM to the 36 nodes in their own data center. Right now the second datacenter is for failover only, so there are no clients actually writing there. There's nothing else obvious in the JBoss logs at around the same time, e.g. other application errors, GC events. The Cassandra system.log files at INFO level shows nothing out of the ordinary. I have a capture of one of the incidents at DEBUG level where again I see nothing abnormal looking, but there's so much data that it would be easy to miss something. Other observations: * It only happens on weekdays (Our weekends are much lower load) * It has occurred every weekday for the last month except for Monday May 30, the Memorial Day holiday in the US. * Most days it occurs only once, but six times it has occurred twice, never more often than that. * It generally happens in the late afternoon, but there have been occurrences earlier in the afternoon and twice in the late morning. Earliest occurrence is 11:19 am, latest is 18:11 pm. Our peak loads are between 10:00 and 14:00, so most occurrences do *not* correspond with peak load times. * It only happens on a single client JBoss instance at a time. * Generally, it affects a different host each day, but the same host was affected on consecutive days once. * Out of 40 clients, one has been affected three times, seven have been affected twice, 11 have been affected once and 21 have not been affected. * The cluster is lightly loaded. Given that the problem affects a single client machine at a time and that machine loses the ability to connect to the entire cluster, It seems unlikely that the problem is on the C* server side. Even a network problem seems hard to explain, given that the clients are on the same subnet, I would expect all of them to fail if it were a network issue. I'm hoping that perhaps someone has seen a similar issue or can suggest things to try. Thanks in advance for any help! Jim