[ https://issues.apache.org/jira/browse/CASSANDRA-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcus Eriksson reopened CASSANDRA-9033: ---------------------------------------- reopening to close as duplicate > Upgrading from 2.1.1 to 2.1.3 with LCS and many sstable files makes nodes > unresponsive > --------------------------------------------------------------------------------------- > > Key: CASSANDRA-9033 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9033 > Project: Cassandra > Issue Type: Bug > Environment: * Ubuntu 14.04.2 - Linux ip-10-0-2-122 3.13.0-46-generic > #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux > * EC2 m2-xlarge instances [4cpu, 16GB RAM, 1TB storage on 3 platters] > * 12 nodes running a mix of 2.1.1 and 2.1.3 > * 8GB stack size with offheap objects > Reporter: Brent Haines > Assignee: Marcus Eriksson > Attachments: cassandra-env.sh, cassandra.yaml, system.log.1.zip > > > We have an Event Log table using LCS that has grown fast. There are more than > 100K sstable files that are around 1KB. Increasing compactors and adjusting > compaction throttling upward doesn't make a difference. It has been running > great though until we upgraded to 2.1.3. Those nodes needed more RAM for the > stack (12 GB) to even have a prayer of responding to queries. They bog down > and become unresponsive. There are no GC messages that I can see, and no > compaction either. > The only work-around I have found is to decommission, blow away the big CF > and rejoin. That happens in about 20 minutes and everything is freaking happy > again. The size of the files is more like what I'd expect as well. > Our schema: > {code} > cqlsh> describe columnfamily data.stories > CREATE TABLE data.stories ( > id timeuuid PRIMARY KEY, > action_data timeuuid, > action_name text, > app_id timeuuid, > app_instance_id timeuuid, > data map<text, text>, > objects set<timeuuid>, > time_stamp timestamp, > user_id timeuuid > ) WITH bloom_filter_fp_chance = 0.01 > AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' > AND comment = 'Stories represent the timeline and are placed in the > dashboard for the brand manager to see' > AND compaction = {'min_threshold': '4', 'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32'} > AND compression = {'sstable_compression': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND dclocal_read_repair_chance = 0.1 > AND default_time_to_live = 0 > AND gc_grace_seconds = 864000 > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair_chance = 0.0 > AND speculative_retry = '99.0PERCENTILE'; > cqlsh> > {code} > There were no log entries that stood out. It pretty much consisted of "x is > down" "x is up" repeated ad infinitum. I have attached the zipped system.log > that has the situation after the upgrade and then after I stopped, removed > system, system_traces, OpsCenter, and data/stories-/* and restarted. > It has rejoined the cluster now and is busy read-repairing to recover its > data. > On another note, we see a lot of this during repair now (on all the nodes): > {code} > ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,207 RepairSession.java:303 > - [repair #c5043c40-d260-11e4-a2f2-8bb3e2bbdb35] session completed with the > following error > java.io.IOException: Failed during snapshot creation. > at > org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344) > ~[apache-cassandra-2.1.3.jar:2.1.3] > at > org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:146) > ~[apache-cassandra-2.1.3.jar:2.1.3] > at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) > ~[guava-16.0.jar:na] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [na:1.7.0_55] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_55] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55] > ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,208 > CassandraDaemon.java:167 - Exception in thread > Thread[AntiEntropySessions:5,5,RMI Runtime] > java.lang.RuntimeException: java.io.IOException: Failed during snapshot > creation. > at com.google.common.base.Throwables.propagate(Throwables.java:160) > ~[guava-16.0.jar:na] > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) > ~[apache-cassandra-2.1.3.jar:2.1.3] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > ~[na:1.7.0_55] > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > ~[na:1.7.0_55] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > ~[na:1.7.0_55] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_55] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55] > Caused by: java.io.IOException: Failed during snapshot creation. > at > org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344) > ~[apache-cassandra-2.1.3.jar:2.1.3] > at > org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:146) > ~[apache-cassandra-2.1.3.jar:2.1.3] > at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) > ~[guava-16.0.jar:na] > ... 3 common frames omitted > ERROR [RepairJobTask:2] 2015-03-24 20:03:20,227 RepairJob.java:145 - Error > occurred during snapshot phase > java.lang.RuntimeException: Could not create snapshot at /10.0.2.144 > at > org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:77) > ~[apache-cassandra-2.1.3.jar:2.1.3] > at > org.apache.cassandra.net.MessagingService$5$1.run(MessagingService.java:349) > ~[apache-cassandra-2.1.3.jar:2.1.3] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > ~[na:1.7.0_55] > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > ~[na:1.7.0_55] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [na:1.7.0_55] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_55] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55] > {code} > I am thinking that this means that my work-around for blowing away and > rebuilding the CF may not be working anymore. I don't know of another way to > force LCS compaction. The node doesn't ever seem to recover enough to compact > on its own. -- This message was sent by Atlassian JIRA (v6.3.4#6332)