[ 
https://issues.apache.org/jira/browse/CASSANDRA-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607826#comment-13607826
 ] 

Brandon Williams commented on CASSANDRA-5367:
---------------------------------------------

It looks like hints aren't stuck, there's a thread trying to deliver to a host 
and there's a large compaction of hints going on.  The host that the hints are 
for is the problem.
                
> Hints stuck on compaction
> -------------------------
>
>                 Key: CASSANDRA-5367
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5367
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.2.2
>         Environment: 80 Node cluster on 1.2.2 (problem has been around since 
> before 1.0)
>            Reporter: Brooke Bryan
>         Attachments: thread.log
>
>
> When our cluster is handling hints, we will very often see hints get stuck on 
> nodes if it is unable to communicate with another node.  The problem is not 
> that the other node is down, the other node will be sat doing compactions, or 
> running out of memory.  While that node is a problem, and needs to be fixed, 
> all other nodes on the cluster will stick waiting to handle hints between 
> that node and itself.
> This causes a pretty major knock on effect throughout the entire cluster, 
> causing hints to back up.  We are seeing some nodes backed up with 14GB of 
> hints, after 2 days of the hints being stuck.
> Also, during this "stuck" session, compactionstats will show a compaction on 
> the system hints column family, and not change the completed bytes amount.
> This is the only reason for an entire cluster to get very bogged down from 
> what I have experienced, and requires a lot of manual intervention to get 
> everything back online.
> After putting a node into debug mode, I have narrowed down the issue to be 
> within:
> startColumn = hint.name(); (line ~361 HintedHandoffManager) and line 390
> based on the log output, and through pausing handoffs etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to