[ https://issues.apache.org/jira/browse/CASSANDRA-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stanislav Vishnevskiy updated CASSANDRA-13687: ---------------------------------------------- Attachment: 3.0.9heap.png 3.0.14heap.png 3.0.14cpu.png We just had this happen again. I am attaching screenshots of similar time range again from before and after. As you can see in this [^3.0.14heap.png] image at 1PM the heap spikes to 6GB, then we have to take down the node cause it makes the cluster start failing. We then proceed to change MAX_HEAP_SIZE to 12GB and bring it up again and repair. This time it spikes to 8GB and sticks there though the whole repair. It then drops down to 600MB without a huge CMS almost like it was 1 big object. The node calling repair (1-1) is the only one with the heap growth. If you look at [^3.0.9heap.png] this used to not occur during repair and all nodes looked similar. Another interesting thing is CPU usage as seen in [^3.0.14cpu.png]. The node performing the node tool repair (in blue) is using way more CPU than the other node in the cluster. We compared this a week ago with 3.0.9 and this was also not true. This feels like a bug in repair? > Abnormal heap growth and long GC during repair. > ----------------------------------------------- > > Key: CASSANDRA-13687 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13687 > Project: Cassandra > Issue Type: Bug > Reporter: Stanislav Vishnevskiy > Attachments: 3.0.14cpu.png, 3.0.14heap.png, 3.0.14.png, > 3.0.9heap.png, 3.0.9.png > > > We recently upgraded from 3.0.9 to 3.0.14 to get the fix from CASSANDRA-13004 > Sadly 3 out of the last 7 nights we have had to wake up due Cassandra dying > on us. We currently don't have any data to help reproduce this, but maybe > since there aren't many commits between the 2 versions it might be obvious. > Basically we trigger a parallel incremental repair from a single node every > night at 1AM. That node will sometimes start allocating a lot and keeping the > heap maxed and triggering GC. Some of these GC can last up to 2 minutes. This > effectively destroys the whole cluster due to timeouts to this node. > The only solution we currently have is to drain the node and restart the > repair, it has worked fine the second time every time. > I attached heap charts from 3.0.9 and 3.0.14 during repair. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org