In our case, the timeouts were happening because internode authentication was turned on and by default the user column family in the system_auth keyspace is replicated only on 1 node. We also had to tune the permissions_validity_in_ms from the default of 2000 ms to a larger value. The issue was that all authentication requests would go to one node, since it was replicated only on 1 node. We set replication factor to n (# of nodes) on the system_auth keyspace.
Hope this helps. Parag From: Robert Coli <rc...@eventbrite.com<mailto:rc...@eventbrite.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Date: Monday, November 24, 2014 at 2:52 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Subject: Re: What causes NoHostAvailableException, WriteTimeoutException, and UnavailableException? On Mon, Nov 24, 2014 at 12:57 PM, Kevin Burton <bur...@spinn3r.com<mailto:bur...@spinn3r.com>> wrote: I’m trying to track down some exceptions in our production cluster. I bumped up our write load and now I’m getting a non-trivial number of these exceptions. Somewhere on the order of 100 per hour. All machines have a somewhat high CPU load because they’re doing other tasks. I’m worried that perhaps my background tasks are just overloading cassandra and one way to mitigate this is to nice them to least favorable priority (this is my first tasks). Two out of three of them are timeouts or lack of availability. Seeing this across your cluster is usually associated with hitting a "pre-fail" condition in terms of GC, where the amount of data stored per node makes the steady state working set larger than available non-fragmented heap. If you're graphing GC time, I would expect to see a concomitant spike there. =Rob