[ 
https://issues.apache.org/jira/browse/CASSANDRA-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Vishnevskiy updated CASSANDRA-13687:
----------------------------------------------
    Attachment: 3.0.9heap.png
                3.0.14heap.png
                3.0.14cpu.png

We just had this happen again. I am attaching screenshots of similar time range 
again from before and after.

As you can see in this [^3.0.14heap.png] image at 1PM the heap spikes to 6GB, 
then we have to take down the node cause it makes the cluster start failing. We 
then proceed to change MAX_HEAP_SIZE to 12GB and bring it up again and repair. 
This time it spikes to 8GB and sticks there though the whole repair. It then 
drops down to 600MB without a huge CMS almost like it was 1 big object. The 
node calling repair (1-1) is the only one with the heap growth. If you look at 
[^3.0.9heap.png] this used to not occur during repair and all nodes looked 
similar.

Another interesting thing is CPU usage as seen in [^3.0.14cpu.png]. The node 
performing the node tool repair (in blue) is using way more CPU than the other 
node in the cluster. We compared this a week ago with 3.0.9 and this was also 
not true.

This feels like a bug in repair?


> Abnormal heap growth and long GC during repair.
> -----------------------------------------------
>
>                 Key: CASSANDRA-13687
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13687
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Stanislav Vishnevskiy
>         Attachments: 3.0.14cpu.png, 3.0.14heap.png, 3.0.14.png, 
> 3.0.9heap.png, 3.0.9.png
>
>
> We recently upgraded from 3.0.9 to 3.0.14 to get the fix from CASSANDRA-13004
> Sadly 3 out of the last 7 nights we have had to wake up due Cassandra dying 
> on us. We currently don't have any data to help reproduce this, but maybe 
> since there aren't many commits between the 2 versions it might be obvious.
> Basically we trigger a parallel incremental repair from a single node every 
> night at 1AM. That node will sometimes start allocating a lot and keeping the 
> heap maxed and triggering GC. Some of these GC can last up to 2 minutes. This 
> effectively destroys the whole cluster due to timeouts to this node.
> The only solution we currently have is to drain the node and restart the 
> repair, it has worked fine the second time every time.
> I attached heap charts from 3.0.9 and 3.0.14 during repair.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to