> Most of the time, we got a few timeouts on the failover (unexpected, but not 
> the end of the world) and then quickly recovered; 
For read or write requests ? I'm guessing with 3 nodes you are using RF 3. In 
cassandra 1.x the read repair chance is only 10%, so 90% of the time only CL 
nodes are involved in a read request. If one of the nodes involved dies during 
the request the coordinator will time out waiting. 
 
> We see B making a request in the logs (on debug) and 10 seconds later timing 
> out.  We see nothing happening in C's log (also debug).  
What were the log messages from the nodes ? In particular the ones from the 
StorageProxy on Node B and RowMutationVerbHandler on node C.

> In retrospect, I should have put it in trace (will do this next time)
TRACE logs a lot of stuff. I'd hold off on that.  

> I also noticed a few other crazy log messages on C in that time period. 
What were the log messages ? 

>  There were two instances of "invalid protocol header", which in code seems 
> to only happen when PROTOCOL_MAGIC doesn't match (MessagingService.java), 
> which seems like an impossible state.
Often means something other than Cassandra connected on the port. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 15/05/2012, at 1:00 AM, E S wrote:

> Hello,
> 
> I am having some very strange issues with a cassandra setup.  I recognize 
> that this is not the ideal cluster setup, but I'd still like to try and 
> understand what is going wrong.
> 
> The cluster has 3 machines (A,B,C) running Cassandra 1.0.9 with JNA.  A & B 
> are in datacenter1 while C is in datacenter2.  Cassandra knows about the 
> different datacenter because of the rack inferred snitch.  However, we are 
> currently using a simple placement strategy on the keyspace.  All reads and 
> writes are done with quorum.  Hinted handoffs are enabled.  Most the the 
> cassandra settings are at their defaults, with the exception of thrift 
> message sizes, which we have upped to 256 mb (while very rare, we can 
> sometimes have a few larger rows so wanted a big buffer).  There is a 
> firewall between the two datacenters.  We have enabled TCP traffic for the 
> thrift and storage ports (but not JMX, and no UDP)
> 
> Another odd thing is that there are actually 2 cassandra clusters hosted on 
> these machines (although with the same setup).  Each machine has 2 cassandra 
> processes, but everything is running on different ports and different cluster 
> names.
> 
> On one of the two clusters we were doing some failover testing.  We would 
> take nodes down quickly in succession and make sure sure the system remained 
> up.
> 
> Most of the time, we got a few timeouts on the failover (unexpected, but not 
> the end of the world) and then quickly recovered; however, twice we were able 
> to put the cluster in an unusable state.  We found that sometimes node C, 
> while seemingly up (no load, and marked as UP in the ring by other nodes), 
> was unresponsive to B (when A was down) when B was coordinating a quorum 
> write.  We see B making a request in the logs (on debug) and 10 seconds later 
> timing out.  We see nothing happening in C's log (also debug).  The box is 
> just idling.  In retrospect, I should have put it in trace (will do this next 
> time).  We had it come back after 30 minutes once.  Another time, it came 
> back earlier after cycling it.
> 
> I also noticed a few other crazy log messages on C in that time period.  
> There were two instances of "invalid protocol header", which in code seems to 
> only happen when PROTOCOL_MAGIC doesn't match (MessagingService.java), which 
> seems like an impossible state.
> 
> I'm currently at a loss trying to explain what is going on.  Has anyone seen 
> anything like this?  I'd appreciate any additional debugging ideas!  Thanks 
> for any help.
> 
> Regards,
> Eddie  
> 

Reply via email to