tuning for read performance
Hi, I have a small 2 node cassandra cluster that seems to be constrained by read throughput. There are about 100 writes/s and 60 reads/s mostly against a skinny column family. Here's the cfstats for that family: SSTable count: 13 Space used (live): 231920026568 Space used (total): 231920026568 Number of Keys (estimate): 356899200 Memtable Columns Count: 1385568 Memtable Data Size: 359155691 Memtable Switch Count: 26 Read Count: 40705879 Read Latency: 25.010 ms. Write Count: 9680958 Write Latency: 0.036 ms. Pending Tasks: 0 Bloom Filter False Postives: 28380 Bloom Filter False Ratio: 0.00360 Bloom Filter Space Used: 874173664 Compacted row minimum size: 61 Compacted row maximum size: 152321 Compacted row mean size: 1445 iostat shows almost no write activity, here's a typical line: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 312.870.00 6.61 0.0043.27 23.35 105.06 2.28 71.19 and nodetool tpstats always shows pending tasks in the ReadStage. The data set has grown beyond physical memory (250GB/node w/64GB of RAM) so I know disk access is required, but are there particular settings I should experiment with that could help relieve some read i/o pressure? I already put memcached in front of cassandra so the row cache probably won't help much. Also this column family stores smallish documents (usually 1-100K) along with metadata. The document is only occasionally accessed, usually only the metadata is read/written. Would splitting out the document into a separate column family help? Thanks Kireet
frequent node up/downs
Hello, I recently set up a 2 node cassandra cluster on dedicated hardware. In the logs there have been a lot of "InetAddress xxx is now dead' or UP messages. Comparing the log messages between the 2 nodes, they seem to coincide with extremely long ParNew collections. I have seem some of up to 50 seconds. The installation is pretty vanilla, I didn't change any settings and the machines don't seem particularly busy - cassandra is the only thing running on the machine with an 8GB heap. The machine has 64GB of RAM and CPU/IO usage looks pretty light. I do see a lot of 'Heap is xxx full. You may need to reduce memtable and/or cache sizes' messages. Would this help with the long ParNew collections? That message seems to be triggered on a full collection.
Re: frequent node up/downs
Yeah I noticed the leap second problem and ran the suggested fix, but I have been facing these problems before Saturday and still see the occasional failures after running the fix. Thanks. On Mon, Jul 2, 2012 at 11:17 AM, Marcus Both wrote: > Yeah! Look that. > > http://arstechnica.com/business/2012/07/one-day-later-the-leap-second-v-the-internet-scorecard/ > I had the same problem. The solution was rebooting. > > On Mon, 2 Jul 2012 11:08:57 -0400 > feedly team wrote: > > > Hello, > >I recently set up a 2 node cassandra cluster on dedicated hardware. In > > the logs there have been a lot of "InetAddress xxx is now dead' or UP > > messages. Comparing the log messages between the 2 nodes, they seem to > > coincide with extremely long ParNew collections. I have seem some of up > to > > 50 seconds. The installation is pretty vanilla, I didn't change any > > settings and the machines don't seem particularly busy - cassandra is the > > only thing running on the machine with an 8GB heap. The machine has 64GB > of > > RAM and CPU/IO usage looks pretty light. I do see a lot of 'Heap is xxx > > full. You may need to reduce memtable and/or cache sizes' messages. Would > > this help with the long ParNew collections? That message seems to be > > triggered on a full collection. > > -- > Marcus Both > >
Re: frequent node up/downs
Couple more details. I confirmed that swap space is not being used (free -m shows 0 swap) and cassandra.log has a message like "JNA mlockall successful". top shows the process having 9g in resident memory but 21.6g in virtual...What accounts for the much larger virtual number? some kind of off-heap memory? I'm a little puzzled as to why I would get such long pauses without swapping. I uncommented all the gc logging options in cassandra-env.sh to try to see what is going on when the node freezes. Thanks Kireet On Mon, Jul 2, 2012 at 9:51 PM, feedly team wrote: > Yeah I noticed the leap second problem and ran the suggested fix, but I > have been facing these problems before Saturday and still see the > occasional failures after running the fix. > > Thanks. > > > On Mon, Jul 2, 2012 at 11:17 AM, Marcus Both wrote: > >> Yeah! Look that. >> >> http://arstechnica.com/business/2012/07/one-day-later-the-leap-second-v-the-internet-scorecard/ >> I had the same problem. The solution was rebooting. >> >> On Mon, 2 Jul 2012 11:08:57 -0400 >> feedly team wrote: >> >> > Hello, >> >I recently set up a 2 node cassandra cluster on dedicated hardware. >> In >> > the logs there have been a lot of "InetAddress xxx is now dead' or UP >> > messages. Comparing the log messages between the 2 nodes, they seem to >> > coincide with extremely long ParNew collections. I have seem some of up >> to >> > 50 seconds. The installation is pretty vanilla, I didn't change any >> > settings and the machines don't seem particularly busy - cassandra is >> the >> > only thing running on the machine with an 8GB heap. The machine has >> 64GB of >> > RAM and CPU/IO usage looks pretty light. I do see a lot of 'Heap is xxx >> > full. You may need to reduce memtable and/or cache sizes' messages. >> Would >> > this help with the long ParNew collections? That message seems to be >> > triggered on a full collection. >> >> -- >> Marcus Both >> >> >
Re: frequent node up/downs
I reduced the load and the problem hasn't been happening as much. After enabling gc logging, I see messages mentioning promotion failed when the pauses happen. It looks like this happens when there is a promotion failure. From reading on the web it looks like I could try reducing the CMSInitiatingOccupancyFraction value and/or decreasing the young gen size to try to avoid this scenario. Also is it normal to see the "Heap is xx full. You may need to reduce memtable and/or cache sizes" message quite often? I haven't turned on row caches or changed any default memtable size settings so I am wondering why the old gen fills up. On Wed, Jul 4, 2012 at 6:28 AM, aaron morton wrote: > What accounts for the much larger virtual number? some kind of off-heap > memory? > > http://wiki.apache.org/cassandra/FAQ#mmap > > I'm a little puzzled as to why I would get such long pauses without > swapping. > > The two are not related. On startup the JVM memory is locked so it will > not swap, from then on memory management is pretty much up the JVM. > > Getting a lot of ParNew activity does not mean the JVM is low on memory, > it means there is a lot of activity in the new heap. > > If you have a lot of insert activity (typically in a load test) you can > generate a lot of GC activity. Try reducing the load to a point where it > does not ht GC and then increase to find the cause. Also if you can connect > JConole to the JVM you may get a better view of the heap usage. > > Hope that helps. > > - > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 3/07/2012, at 3:41 PM, feedly team wrote: > > Couple more details. I confirmed that swap space is not being used (free > -m shows 0 swap) and cassandra.log has a message like "JNA mlockall > successful". top shows the process having 9g in resident memory but 21.6g > in virtual...What accounts for the much larger virtual number? some kind of > off-heap memory? > > I'm a little puzzled as to why I would get such long pauses without > swapping. I uncommented all the gc logging options in cassandra-env.sh to > try to see what is going on when the node freezes. > > Thanks > Kireet > > On Mon, Jul 2, 2012 at 9:51 PM, feedly team wrote: > >> Yeah I noticed the leap second problem and ran the suggested fix, but I >> have been facing these problems before Saturday and still see the >> occasional failures after running the fix. >> >> Thanks. >> >> >> On Mon, Jul 2, 2012 at 11:17 AM, Marcus Both wrote: >> >>> Yeah! Look that. >>> >>> http://arstechnica.com/business/2012/07/one-day-later-the-leap-second-v-the-internet-scorecard/ >>> I had the same problem. The solution was rebooting. >>> >>> On Mon, 2 Jul 2012 11:08:57 -0400 >>> feedly team wrote: >>> >>> > Hello, >>> >I recently set up a 2 node cassandra cluster on dedicated hardware. >>> In >>> > the logs there have been a lot of "InetAddress xxx is now dead' or UP >>> > messages. Comparing the log messages between the 2 nodes, they seem to >>> > coincide with extremely long ParNew collections. I have seem some of >>> up to >>> > 50 seconds. The installation is pretty vanilla, I didn't change any >>> > settings and the machines don't seem particularly busy - cassandra is >>> the >>> > only thing running on the machine with an 8GB heap. The machine has >>> 64GB of >>> > RAM and CPU/IO usage looks pretty light. I do see a lot of 'Heap is xxx >>> > full. You may need to reduce memtable and/or cache sizes' messages. >>> Would >>> > this help with the long ParNew collections? That message seems to be >>> > triggered on a full collection. >>> >>> -- >>> Marcus Both >>> >>> >> > >
Re: frequent node up/downs
responses below. thanks! On Fri, Jul 6, 2012 at 3:09 PM, aaron morton wrote: > It looks like this happens when there is a promotion failure. > > > Java Heap is full. > Memory is fragmented. > Use C for web scale. > unfortunately i became too dumb to use C around 2004. camping accident. > > Also is it normal to see the "Heap is xx full. You may need to reduce > memtable and/or cache sizes" message quite often? I haven't turned on row > caches or changed any default memtable size settings so I am wondering why > the old gen fills up. > > > It's odd to get that out of the box with an 8GB heap on a 1.1.X install. > > What sort of work load ? Is it under heavy inserts ? > opscenter shows between 60-120 writes/sec and between 80-150 reads/sec total for both machines. i am not sure if that is considered heavy or not. the machines don't seem particularly busy. load seems pretty even across both. Do you have a lot of CF's ? A lot of secondary indexes ? > i have 15 column families with maybe 4 that are larger and active. there are a couple secondary indexes. opscenter uses 8 CFs and system 7. total data is ~100GB After the messages is it able to reduce heap usage ? > seems like it, they occur every few minutes for awhile and then stop. Does it seem to correlate to compactions ? > no. > Is the node able to get back to a healthy state ? > yes. after the gc finishes it rejoins the cluster. > If this is testing are you able to pull back to a workload where the > issues doe not appear ? > i am guessing so. i am running a data-heavy background processing job. when i reduced thread count from 20 to 15 the problem has happened only once in the past 2 days vs 2-3 times a day. we are just starting to use cassandra so i am more worried about when more critical web traffic hits. > > Cheers > > ----- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 7/07/2012, at 4:33 AM, feedly team wrote: > > I reduced the load and the problem hasn't been happening as much. After > enabling gc logging, I see messages mentioning promotion failed when the > pauses happen. It looks like this happens when there is a promotion > failure. From reading on the web it looks like I could try reducing the > CMSInitiatingOccupancyFraction value and/or decreasing the young gen size > to try to avoid this scenario. > > Also is it normal to see the "Heap is xx full. You may need to reduce > memtable and/or cache sizes" message quite often? I haven't turned on row > caches or changed any default memtable size settings so I am wondering why > the old gen fills up. > > > On Wed, Jul 4, 2012 at 6:28 AM, aaron morton wrote: > >> What accounts for the much larger virtual number? some kind of off-heap >> memory? >> >> http://wiki.apache.org/cassandra/FAQ#mmap >> >> I'm a little puzzled as to why I would get such long pauses without >> swapping. >> >> The two are not related. On startup the JVM memory is locked so it will >> not swap, from then on memory management is pretty much up the JVM. >> >> Getting a lot of ParNew activity does not mean the JVM is low on memory, >> it means there is a lot of activity in the new heap. >> >> If you have a lot of insert activity (typically in a load test) you can >> generate a lot of GC activity. Try reducing the load to a point where it >> does not ht GC and then increase to find the cause. Also if you can connect >> JConole to the JVM you may get a better view of the heap usage. >> >> Hope that helps. >> >> - >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 3/07/2012, at 3:41 PM, feedly team wrote: >> >> Couple more details. I confirmed that swap space is not being used (free >> -m shows 0 swap) and cassandra.log has a message like "JNA mlockall >> successful". top shows the process having 9g in resident memory but 21.6g >> in virtual...What accounts for the much larger virtual number? some kind of >> off-heap memory? >> >> I'm a little puzzled as to why I would get such long pauses without >> swapping. I uncommented all the gc logging options in cassandra-env.sh to >> try to see what is going on when the node freezes. >> >> Thanks >> Kireet >> >> On Mon, Jul 2, 2012 at 9:51 PM, feedly team wrote: >> >>> Yeah I noticed the leap second problem and ran the suggested fix, but I >>> have been facing these problems before Saturday and still see the >>> occasional failures after running the f
high i/o usage on one node
I am having an issue where one node of a 2 node cluster seems to be using much more I/O than the other node. the cassandra read/write requests seem to be balanced, but iostat shows the data disk to be maxed at 100% utilization for one machine and <50% for the other. r/s to be about 3x greater on the high i/o node. I am using a RF of 2 and consistency mode of ALL for reads and ONE for writes (current requests are very read heavy). user CPU seems to be fairly low and the same on both machines, but the high i/o machine shows an os load of 34 (!) while the other machine reports 7. I ran a nodetool compactionstats and there are no tasks pending which i assume means there is no compaction going on, and the logs seem to be ok as well. the only difference is that on the high i/o node, i am doing full gc logging, but that's on a separate disk than the data. Another oddity is that the high i/o node shows a data size of 86GB while the other shows 71GB. I understand there could be differences, but with a RF of 2 I would think they would be roughly the equal? I am using version 1.0.10.
get_slice on wide rows
I have a column family that I am using for consistency purposes. Basically a marker column is written to a row in this family before some actions take place and is deleted only after all the actions complete. The idea is that if something goes horribly wrong this table can be read to see what needs to be fixed. In my dev environment things worked as planned, but in a larger scale/high traffic environment, the slice query times out and then cassandra quickly runs out of memory. The main difference here is that there is a very large number of writes (and deleted columns) in the row my code is attempting to read. Is the problem that cassandra is attempting to load all the deleted columns into memory? I did an sstableToJson dump and saw that the "d" deletion marker seemed to be present for the columns, though i didn't write any code to check all values. Is the solution here partitioning the wide row into multiple narrower rows?
Re: How to set LeveledCompactionStrategy for an existing table
in cassandra-cli, i did something like: update column family xyz with compaction_strategy='LeveledCompactionStrategy' On Thu, Aug 30, 2012 at 5:20 AM, Jean-Armel Luce wrote: > > Hello, > > I am using Cassandra 1.1.1 and CQL3. > I have a cluster with 1 node (test environment) > Could you tell how to set the compaction strategy to Leveled Strategy for > an existing table ? > > I have a table pns_credentials > > jal@jal-VirtualBox:~/cassandra/apache-cassandra-1.1.1/bin$ ./cqlsh -3 > Connected to Test Cluster at localhost:9160. > [cqlsh 2.2.0 | Cassandra 1.1.1 | CQL spec 3.0.0 | Thrift protocol 19.32.0] > Use HELP for help. > cqlsh> use test1; > cqlsh:test1> describe table pns_credentials; > > CREATE TABLE pns_credentials ( > ise text PRIMARY KEY, > isnew int, > ts timestamp, > mergestatus int, > infranetaccount text, > user_level int, > msisdn bigint, > mergeusertype int > ) WITH > comment='' AND > comparator=text AND > read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > default_validation=text AND > min_compaction_threshold=4 AND > max_compaction_threshold=32 AND > replicate_on_write='true' AND > compaction_strategy_class='SizeTieredCompactionStrategy' AND > compression_parameters:sstable_compression='SnappyCompressor'; > > I want to set the LeveledCompaction strategy for this table, so I execute > the following ALTER TABLE : > > cqlsh:test1> alter table pns_credentials > ... WITH compaction_strategy_class='LeveledCompactionStrategy' > ... AND compaction_strategy_options:sstable_size_in_mb=10; > > In Cassandra logs, I see some informations : > INFO 10:23:52,532 Enqueuing flush of > Memtable-schema_columnfamilies@965212657(1391/1738 serialized/live bytes, > 20 ops) > INFO 10:23:52,533 Writing Memtable-schema_columnfamilies@965212657(1391/1738 > serialized/live bytes, 20 ops) > INFO 10:23:52,629 Completed flushing > /var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-hd-94-Data.db > (1442 bytes) for commitlog position ReplayPosition(segmentId=3556583843054, > position=1987) > > > However, when I look at the description of the table, the table is still > with the SizeTieredCompactionStrategy > cqlsh:test1> describe table pns_credentials ; > > CREATE TABLE pns_credentials ( > ise text PRIMARY KEY, > isnew int, > ts timestamp, > mergestatus int, > infranetaccount text, > user_level int, > msisdn bigint, > mergeusertype int > ) WITH > comment='' AND > comparator=text AND > read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > default_validation=text AND > min_compaction_threshold=4 AND > max_compaction_threshold=32 AND > replicate_on_write='true' AND > compaction_strategy_class='SizeTieredCompactionStrategy' AND > compression_parameters:sstable_compression='SnappyCompressor'; > > In the schema_columnfamilies table (in system keyspace), the table > pns_credentials is still using the SizeTieredCompactionStrategy > cqlsh:test1> use system; > cqlsh:system> select * from schema_columnfamilies ; > ... > test1 | pns_credentials | null | KEYS_ONLY > |[] | | > org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy > | {} > | > org.apache.cassandra.db.marshal.UTF8Type | > {"sstable_compression":"org.apache.cassandra.io.compress.SnappyCompressor"} > | org.apache.cassandra.db.marshal.UTF8Type | 864000 | > 1029 | ise | org.apache.cassandra.db.marshal.UTF8Type > |0 | 32 > |4 |0.1 | True > | null | Standard |null > ... > > > I stopped/started the Cassandra node, but the table is still with > SizeTieredCompactionStrategy > > I tried using cassandra-cli, but the alter is still unsuccessfull. > > Is there anything I am missing ? > > > Thanks. > > Jean-Armel >