[ https://issues.apache.org/jira/browse/CASSANDRA-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020819#comment-14020819 ]
Jacek Furmankiewicz edited comment on CASSANDRA-7361 at 6/7/14 2:35 PM: ------------------------------------------------------------------------ Guys, this is unreal. Let me recap: a) we start off with a small 2 GB input file of data b) when processed this data explodes to a mere 30-40 GB on disk, which should be nothing for Cassandra c) halfway through processing this data, Cassandra suddenly grabs itself by its own throat and starts choking itself d) when we back off with a retry policy, Cassandra is still choking itself e) when we shut down the app, Cassandra is still choking itself f) the only option is to restart Cassandra. Potentially every node in the cluster. We still have not been able to process data for the customer g) this will happen every day h) we use the off heap cache, which is supposed to let us escape from the limits of Java heap Please explain to me how this issue is Minor and Resolved. What is the solution that you recommend? Are we supposed to disable the row cache and never use it in production, as it can cause a total server meltdown? P.S. Every key that we read is only read 2-3 times, there is no reason why it could not expire from a cache P.S.S. This worked fine in 1.1.X, we were able to process this data a few months ago with row cache enabled. It is a regression that only started appearing once we upgraded to 2.0.6. was (Author: jfurmankiewicz): Guys, this is unreal. Let me recap: a) we start off with a small 2 GB input file of data b) when processed this data explodes to a mere 30-40 GB on disk, which should be nothing for Cassandra c) halfway through processing this data, Cassandra suddenly grabs itself by its ownthroat and starts choking itself d) when we back off with a retry policy, Cassandra is still choking itself e) when we shut down the app, Cassandra is still choking itself f) the only option is to restart Cassandra. Potentially every node in the cluster. We still have not been able to process data for the customer g) this will happen every day Please explain to me how this issue is Minor and Resolved. What is the solution that you recommend? Are we supposed to disable the row cache and never use it in production, as it can cause a total server meltdown? P.S. Every key that we read is only read 2-3 times, there is no reason why it could not expire from a cache P.S.S. This worked fine in 1.1.X, we were able to process this data a few months ago with row cache enabled. It is a regression that only started appearing once we upgraded to 2.0.6. > Cassandra locks up in full GC when you assign the entire heap to row cache > -------------------------------------------------------------------------- > > Key: CASSANDRA-7361 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7361 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Ubuntu, RedHat, JDK 1.7 > Reporter: Jacek Furmankiewicz > Priority: Minor > Attachments: histogram.png, leaks_report.png, top_consumers.png > > > We have a long running batch load process, which runs for many hours. > Massive amount of writes, in large mutation batches (we increase the thrift > frame size to 45 MB). > Everything goes well, but after about 3 hrs of processing everything locks > up. We start getting NoHostsAvailable exceptions on the Java application side > (with Astyanax as our driver), eventually socket timeouts. > Looking at Cassandra, we can see that it is using nearly the full 8GB of heap > and unable to free it. It spends most of its time in full GC, but the amount > of memory does not go down. > Here is a long sample from jstat to show this over an extended time period > e.g. > http://aep.appspot.com/display/NqqEagzGRLO_pCP2q8hZtitnuVU/ > This continues even after we shut down our app. Nothing is connected to > Cassandra any more, yet it is still stuck in full GC and cannot free up > memory. > Running nodetool tpstats shows that nothing is pending, all seems OK: > {quote} > Pool Name Active Pending Completed Blocked All > time blocked > ReadStage 0 0 69555935 0 > 0 > RequestResponseStage 0 0 0 0 > 0 > MutationStage 0 0 73123690 0 > 0 > ReadRepairStage 0 0 0 0 > 0 > ReplicateOnWriteStage 0 0 0 0 > 0 > GossipStage 0 0 0 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > MigrationStage 0 0 46 0 > 0 > MemoryMeter 0 0 1125 0 > 0 > FlushWriter 0 0 824 0 > 30 > ValidationExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 23 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > MemtablePostFlusher 0 0 1783 0 > 0 > MiscStage 0 0 0 0 > 0 > PendingRangeCalculator 0 0 1 0 > 0 > CompactionExecutor 0 0 74330 0 > 0 > commitlog_archiver 0 0 0 0 > 0 > HintedHandoff 0 0 0 0 > 0 > Message type Dropped > RANGE_SLICE 0 > READ_REPAIR 0 > PAGED_RANGE 0 > BINARY 0 > READ 585 > MUTATION 75775 > _TRACE 0 > REQUEST_RESPONSE 0 > COUNTER_MUTATION 0 > {quote} > We had this happen on 2 separate boxes, one with 2.0.6, the other with 2.0.8. > Right now this is a total blocker for us. We are unable to process the > customer data and have to abort in the middle of large processing. > This is a new customer, so we did not have a chance to see if this occurred > with 1.1 or 1.2 in the past (we moved to 2.0 recently). > We have the Cassandra process still running, pls let us know if there is > anything else we could run to give you more insight. -- This message was sent by Atlassian JIRA (v6.2#6252)