In reviewing client logs as part of our Cassandra testing, I noticed
several Hector "All host pools marked down" exceptions in the logs.
Further investigation showed a consistent pattern of
"java.net.SocketException: Broken pipe" and "java.net.SocketException:
Connection reset" messages. These errors occur for all 36 hosts in the
cluster over a period of seconds, as Hector tries to find a working
host to connect to. Failing to find a host results in the "All host
pools marked down" messages. These messages recur for a period ranging
from several seconds up to almost 15 minutes, clustering around two to
three minutes. Then connectivity returns and when Hector tries to
reconnect it succeeds.

The clients are instances of a JBoss 5 web application. We use Hector
0.7.0-29 (plus a patch that was pulled in advance of -30) The
Cassandra cluster has 72 nodes split between two datacenters. It's
running 0.7.5 plus a couple of bug fixes pulled in advance of 0.7.6.
The keyspace uses NetworkTopologyStrategy and RF=6 (3 in each
datacenter). The clients are reading and writing at LOCAL_QUORUM to
the 36 nodes in their own data center. Right now the second datacenter
is for failover only, so there are no clients actually writing there.

There's nothing else obvious in the JBoss logs at around the same
time, e.g. other application errors, GC events. The Cassandra
system.log files at INFO level shows nothing out of the ordinary. I
have a capture of one of the incidents at DEBUG level where again I
see nothing abnormal looking, but there's so much data that it would
be easy to miss something.

Other observations:
* It only happens on weekdays (Our weekends are much lower load)
* It has occurred every weekday for the last month except for Monday
May 30, the Memorial Day holiday in the US.
* Most days it occurs only once, but six times it has occurred twice,
never more often than that.
* It generally happens in the late afternoon, but there have been
occurrences earlier in the afternoon and twice in the late morning.
Earliest occurrence is 11:19 am, latest is 18:11 pm. Our peak loads
are between 10:00 and 14:00, so most occurrences do *not* correspond
with peak load times.
* It only happens on a single client JBoss instance at a time.
* Generally, it affects a different host each day, but the same host
was affected on consecutive days once.
* Out of 40 clients, one has been affected three times, seven have
been affected twice, 11 have been affected once and 21 have not been
affected.
* The cluster is lightly loaded.

Given that the problem affects a single client machine at a time and
that machine loses the ability to connect to the entire cluster, It
seems unlikely that the problem is on the C* server side. Even a
network problem seems hard to explain, given that the clients are on
the same subnet, I would expect all of them to fail if it were a
network issue.

I'm hoping that perhaps someone has seen a similar issue or can
suggest things to try.

Thanks in advance for any help!

Jim

Reply via email to