[ 
https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-2058:
--------------------------------------

    Attachment: 2058.txt

Brandon's testing has narrowed the culprit down to CASSANDRA-1959.  As 
discussed on CASSANDRA-2054, the main problem there is with the 
NonBlockingHashMap introduced to track timed out latencies.

This patch reverts that and takes a different approach, of tracking the latency 
in the callback map.  This means that we need a unique messageId for each 
target we send a message to.  The Right Way to do this would be to have Message 
objects only contain the data to send, not the From address and not the 
messageId.  Refactoring Message is outside our scope here though, so instead we 
create a new Message for each target.

This does let us clean up the callback map in ResponseVerbHandler instead of in 
each Callback.  (That is what is going on in the changes to QRH, WRH, and AR.)

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: 2058.txt, cassandra.pmc01.log.bz2, 
> cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on 
> the DES, and moved some CFs from one KS into another (drain whole cluster, 
> take it down, move files, change schema, put it back up). Since then, I've 
> had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu 
> machine) and become totally unresponsive. After a moment or two like that, 
> its neighbour dies too, and the failure cascades around the ring. 
> Unfortunately because of the high load I'm not able to get into the machine 
> to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but 
> recovers. This may or may not be the same issue from which the nodes don't 
> recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to