Decommissioning a datacenter deletes the data (on decommissioned datacenter)
tl;dr: Decommissioning datacenters by running nodetool decommission on a node deletes the data on the decommissioned node - is this expected ? I am trying our some tests on my multi-datacenter setup. Somewhere in the docs I read that decommissioning a node will stream its data to other nodes but it still retains its copy of the data. I was expecting the same behavior with multiple datacenters. I am using cassandra 1.2.12. Following are my observations: Lets say I have a datacenter DC1 which has keyspace keyspace_dc_1 and I have another datacenter DC2 which has keyspace keyspace_dc_2. They already have some data in them. I add DC2 to DC1, update the replication factors on both the keyspaces. Looking at the gossipinfo, I can see that the schemas are synced. I then look at the cfstats output and I can see then both the keyspaces are replicated on both the datacenters (also on the disk, as I can see a non-zero sstable count). Now, I decommission DC2: 1) Update the replication factors for the keyspaces. 2) Run nodetool decommission on all the nodes. I see that I have lost all my keyspaces (and data), the keyspaces from DC1 and DC2. This does not seem normal to me, is this expected ? Thanks, Sandeep
Re: Decommissioning a datacenter deletes the data (on decommissioned datacenter)
Hello Rob Sorry for being ambiguous. By deletes I mean that running decommission I can no longer see any keyspaces owned by this node or replicated by other nodes using the cfstats command. I am also seeing the same behavior when I remove a single node from a cluster (without datacenters). On Thu, Aug 7, 2014 at 11:43 AM, Robert Coli rc...@eventbrite.com wrote: On Thu, Aug 7, 2014 at 8:26 AM, srmore comom...@gmail.com wrote: tl;dr: Decommissioning datacenters by running nodetool decommission on a node deletes the data on the decommissioned node - is this expected ? What does deletes mean? What does lost all my keyspaces (and data) mean? =Rob
Re: Decommissioning a datacenter deletes the data (on decommissioned datacenter)
On Thu, Aug 7, 2014 at 12:27 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Aug 7, 2014 at 10:04 AM, srmore comom...@gmail.com wrote: Sorry for being ambiguous. By deletes I mean that running decommission I can no longer see any keyspaces owned by this node or replicated by other nodes using the cfstats command. I am also seeing the same behavior when I remove a single node from a cluster (without datacenters). I'm still not fully parsing you, but clusters should never forget schema as a result of decommission. Is that what you are saying is happening? Yes, this is what is happening. (In fact, even the decommissioned node itself does not forget its schema, which I personally consider a bug.) Ok, so I am assuming this is not a normal behavior and possibly a bug - is this correct ? =Rob
Re: Decommissioning a datacenter deletes the data (on decommissioned datacenter)
Thanks for the detailed reply Ken, this really helps. I also realized that I wasn't doing a 'nodetool rebuild' after reading your email. I was following the steps mentioned here http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_decomission_dc_t.html I do a test with nodetool rebuild and see what happens. On Thu, Aug 7, 2014 at 1:27 PM, Ken Hancock ken.hanc...@schange.com wrote: My reading is it didn't forget the schema. It lost the data. My reading is decomissioning worked fine. Possibly when you changed the replication on a keyspace to include a second data center, the data didn't get replicated. Probably not because I could see the sstables for the keyspace from the other datacenter created. My understanding could be wrong though. When you ADD a datacenter, you need to do a nodetool rebuild to get the data streamed to the new data center. When you alter a keyspace to include another datacenter in its replication schema, a nodetool repair is required -- was this done? http://www.datastax.com/documentation/cql/3.0/cql/cql_using/update_ks_rf_t.html I missed the 'nodetool rebuild' step that could be my issue, yes I did run repair. When you use nodetool decomission, you're effectively deleting the parititioning token from the cluster. The node being decommissioned will stream its data to the new owners of its original token range. This streaming in no way should affect any other datacenter because you have not changed the tokens or data ownership for any datacenter but the one in which you are decomissioning a node. That is what my understanding was, but when I decommission it does clear out (removes) all the keyspaces. When you eventually decomission the last node in the datacenter, all data is gone as there are no tokens in that datacenter to own any data. If you had a keyspace that was only replicated within that datacenter, that data is gone (though you could probably add nodes back in and ressurect it). The (now outdated) documentation [1] says that data remains on the node even after decommissioning. So I do not understand why the data would go away. If you had a keyspace where you changed the replication to include another datacenter, if that datacenter had never received the data, then it may have the schema but would have none of the data (other than new data that was written AFTER you change the replication). I would expect the repair to fix this, i.e. to stream the old data to the newly added datacenter. So, does nodetool rebuild help here ? [1] https://wiki.apache.org/cassandra/Operations#Removing_nodes_entirely On Thu, Aug 7, 2014 at 2:11 PM, srmore comom...@gmail.com wrote: On Thu, Aug 7, 2014 at 12:27 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Aug 7, 2014 at 10:04 AM, srmore comom...@gmail.com wrote: Sorry for being ambiguous. By deletes I mean that running decommission I can no longer see any keyspaces owned by this node or replicated by other nodes using the cfstats command. I am also seeing the same behavior when I remove a single node from a cluster (without datacenters). I'm still not fully parsing you, but clusters should never forget schema as a result of decommission. Is that what you are saying is happening? Yes, this is what is happening. (In fact, even the decommissioned node itself does not forget its schema, which I personally consider a bug.) Ok, so I am assuming this is not a normal behavior and possibly a bug - is this correct ? =Rob -- *Ken Hancock *| System Architect, Advanced Advertising SeaChange International 50 Nagog Park Acton, Massachusetts 01720 ken.hanc...@schange.com | www.schange.com | NASDAQ:SEAC http://www.schange.com/en-US/Company/InvestorRelations.aspx Office: +1 (978) 889-3329 | [image: Google Talk:] ken.hanc...@schange.com | [image: Skype:]hancockks | [image: Yahoo IM:]hancockks [image: LinkedIn] http://www.linkedin.com/in/kenhancock [image: SeaChange International] http://www.schange.com/This e-mail and any attachments may contain information which is SeaChange International confidential. The information enclosed is intended only for the addressees herein and may not be copied or forwarded without permission from SeaChange International.
Re: Decommissioning a datacenter deletes the data (on decommissioned datacenter)
I tried using 'nodetool rebuild' after I add the datacenters,date same outcome, and after I decommission my keyspaces are getting wiped out, I don't understand this. On Thu, Aug 7, 2014 at 1:54 PM, srmore comom...@gmail.com wrote: Thanks for the detailed reply Ken, this really helps. I also realized that I wasn't doing a 'nodetool rebuild' after reading your email. I was following the steps mentioned here http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_decomission_dc_t.html I do a test with nodetool rebuild and see what happens. On Thu, Aug 7, 2014 at 1:27 PM, Ken Hancock ken.hanc...@schange.com wrote: My reading is it didn't forget the schema. It lost the data. My reading is decomissioning worked fine. Possibly when you changed the replication on a keyspace to include a second data center, the data didn't get replicated. Probably not because I could see the sstables for the keyspace from the other datacenter created. My understanding could be wrong though. When you ADD a datacenter, you need to do a nodetool rebuild to get the data streamed to the new data center. When you alter a keyspace to include another datacenter in its replication schema, a nodetool repair is required -- was this done? http://www.datastax.com/documentation/cql/3.0/cql/cql_using/update_ks_rf_t.html I missed the 'nodetool rebuild' step that could be my issue, yes I did run repair. When you use nodetool decomission, you're effectively deleting the parititioning token from the cluster. The node being decommissioned will stream its data to the new owners of its original token range. This streaming in no way should affect any other datacenter because you have not changed the tokens or data ownership for any datacenter but the one in which you are decomissioning a node. That is what my understanding was, but when I decommission it does clear out (removes) all the keyspaces. When you eventually decomission the last node in the datacenter, all data is gone as there are no tokens in that datacenter to own any data. If you had a keyspace that was only replicated within that datacenter, that data is gone (though you could probably add nodes back in and ressurect it). The (now outdated) documentation [1] says that data remains on the node even after decommissioning. So I do not understand why the data would go away. If you had a keyspace where you changed the replication to include another datacenter, if that datacenter had never received the data, then it may have the schema but would have none of the data (other than new data that was written AFTER you change the replication). I would expect the repair to fix this, i.e. to stream the old data to the newly added datacenter. So, does nodetool rebuild help here ? [1] https://wiki.apache.org/cassandra/Operations#Removing_nodes_entirely On Thu, Aug 7, 2014 at 2:11 PM, srmore comom...@gmail.com wrote: On Thu, Aug 7, 2014 at 12:27 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Aug 7, 2014 at 10:04 AM, srmore comom...@gmail.com wrote: Sorry for being ambiguous. By deletes I mean that running decommission I can no longer see any keyspaces owned by this node or replicated by other nodes using the cfstats command. I am also seeing the same behavior when I remove a single node from a cluster (without datacenters). I'm still not fully parsing you, but clusters should never forget schema as a result of decommission. Is that what you are saying is happening? Yes, this is what is happening. (In fact, even the decommissioned node itself does not forget its schema, which I personally consider a bug.) Ok, so I am assuming this is not a normal behavior and possibly a bug - is this correct ? =Rob -- *Ken Hancock *| System Architect, Advanced Advertising SeaChange International 50 Nagog Park Acton, Massachusetts 01720 ken.hanc...@schange.com | www.schange.com | NASDAQ:SEAC http://www.schange.com/en-US/Company/InvestorRelations.aspx Office: +1 (978) 889-3329 | [image: Google Talk:] ken.hanc...@schange.com | [image: Skype:]hancockks | [image: Yahoo IM:]hancockks [image: LinkedIn] http://www.linkedin.com/in/kenhancock [image: SeaChange International] http://www.schange.com/This e-mail and any attachments may contain information which is SeaChange International confidential. The information enclosed is intended only for the addressees herein and may not be copied or forwarded without permission from SeaChange International.
Re: Question 1: JMX binding, Question 2: Logging
Hello Kyle, For your first question, you need to create aliases to localhost e.g. 127.0.0.2,127.0.0.3 etc. this should get you going. About the logging issue, I think if your instance failing before it gets to long anything, as an example you can strart one instance and make sure it logs correctly. Hope that helps. Sandeep On Tue, Feb 4, 2014 at 4:25 PM, Kyle Crumpton (kcrumpto) kcrum...@cisco.com wrote: Hi all, I'm fairly new to Cassandra. I'm deploying it to a PaaS. One thing this entails is that it must be able to have more than one instance on a single node. I'm running into the problem that JMX binds to 0.0.0.0:7199. My question is this: Is there a way to configure this? I have actually found the post that said to change the the following JVM_OPTS=$JVM_OPTS -Djava.rmi.server.hostname=127.1.246.3 where 127.1.246.3 is the IP I want to bind to.. This actually did not change the JMX binding by any means for me. I saw a post about a jmx listen address in cassandra.yaml and this also did not work. Any clarity on whether this is bindable at all? Or if there are plans for it? Also- I have logging turned on. For some reason, though, my Cassandra is not actually logging as intended. My log folder is actually empty after each (failed) run (due to the port being taken by my other cassandra process). Here is an actual copy of my log4j-server.properites file: http://fpaste.org/74470/15510941/ Any idea why this might not be logging? Thank you and best regards Kyle
Re: Lots of deletions results in death by GC
Sorry to hear that Robert, I ran into similar issue a while ago. I had an extremely heavy write and update load, as a result Cassandra (1.2.9) was constantly flushing to disk and used to GC, tried exactly the same steps you tried (tuning memtable_flush_writers (to 2) and memtable_flush_queue_size (to 8) ) no luck. Almost all of the issues went away when I migrated to 1.2.13 this release also had some fixes which I badly needed. What version are you running ? (I tried to look in the thread but couldn't find one, sorry if this is a repeat question) Dropped messages are the sign that Cassandra is taking heavy that's the load shedding mechanism. I would love to see some sort of back-pressure implemented. -sandeep On Tue, Feb 4, 2014 at 6:10 PM, Robert Wille rwi...@fold3.com wrote: I ran my test again, and Flush Writer's All time blocked increased to 2 and then shortly thereafter GC went into its death spiral. I doubled memtable_flush_writers (to 2) and memtable_flush_queue_size (to 8) and tried again. This time, the table that always sat with Memtable data size = 0 now showed increases in Memtable data size. That was encouraging. It never flushed, which isn't too surprising, because that table has relatively few rows and they are pretty wide. However, on the fourth table to clean, Flush Writer's All time blocked went to 1, and then there were no more completed events, and about 10 minutes later GC went into its death spiral. I assume that each time Flush Writer completes an event, that means a table was flushed. Is that right? Also, I got two dropped mutation messages at the same time that Flush Writer's All time blocked incremented. I then increased the writers and queue size to 3 and 12, respectively, and ran my test again. This time All time blocked remained at 0, but I still suffered death by GC. I would almost think that this is caused by high load on the server, but I've never seen CPU utilization go above about two of my eight available cores. If high load triggers this problem, then that is very disconcerting. That means that a CPU spike could permanently cripple a node. Okay, not permanently, but until a manual flush occurs. If anyone has any further thoughts, I'd love to hear them. I'm quite at the end of my rope. Thanks in advance Robert From: Nate McCall n...@thelastpickle.com Reply-To: user@cassandra.apache.org Date: Saturday, February 1, 2014 at 9:25 AM To: Cassandra Users user@cassandra.apache.org Subject: Re: Lots of deletions results in death by GC What's the output of 'nodetool tpstats' while this is happening? Specifically is Flush Writer All time blocked increasing? If so, play around with turning up memtable_flush_writers and memtable_flush_queue_size and see if that helps. On Sat, Feb 1, 2014 at 9:03 AM, Robert Wille rwi...@fold3.com wrote: A few days ago I posted about an issue I'm having where GC takes a long time (20-30 seconds), and it happens repeatedly and basically no work gets done. I've done further investigation, and I now believe that I know the cause. If I do a lot of deletes, it creates memory pressure until the memtables are flushed, but Cassandra doesn't flush them. If I manually flush, then life is good again (although that takes a very long time because of the GC issue). If I just leave the flushing to Cassandra, then I end up with death by GC. I believe that when the memtables are full of tombstones, Cassadnra doesn't realize how much memory the memtables are actually taking up, and so it doesn't proactively flush them in order to free up heap. As I was deleting records out of one of my tables, I was watching it via nodetool cfstats, and I found a very curious thing: Memtable cell count: 1285 Memtable data size, bytes: 0 Memtable switch count: 56 As the deletion process was chugging away, the memtable cell count increased, as expected, but the data size stayed at 0. No flushing occurred. Here's the schema for this table: CREATE TABLE bdn_index_pub ( tshard VARCHAR, pord INT, ord INT, hpath VARCHAR, page BIGINT, PRIMARY KEY (tshard, pord) ) WITH gc_grace_seconds = 0 AND compaction = { 'class' : 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 }; I have a few tables that I run this cleaning process on, and not all of them exhibit this behavior. One of them reported an increasing number of bytes, as expected, and it also flushed as expected. Here's the schema for that table: CREATE TABLE bdn_index_child ( ptshard VARCHAR, ord INT, hpath VARCHAR, PRIMARY KEY (ptshard, ord) ) WITH gc_grace_seconds = 0 AND compaction = { 'class' : 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 }; In both cases, I'm deleting the entire record (i.e. specifying just the first component of the primary key in the delete statement). Most records in bdn_index_pub have 10,000 rows per record. bdn_index_child usually has
Re: MUTATION messages dropped
What version of Cassandra are you running ? I used to see them a lot with 1.2.9, I could correlate the dropped messages with the heap usage almost every time, so check in the logs whether you are getting GC'd. In this respect 1.2.12 appears to be more stable. Moving to 1.2.12 took care of this for us. Thanks, Sandeep On Thu, Dec 19, 2013 at 6:12 AM, Alexander Shutyaev shuty...@gmail.comwrote: Hi all! We've had a problem with cassandra recently. We had 2 one-minute periods when we got a lot of timeouts on the client side (the only timeouts during 9 days we are using cassandra in production). In the logs we've found corresponding messages saying something about MUTATION messages dropped. Now, the official faq [1] says that this is an indicator that the load is too high. We've checked our monitoring and found out that 1-minute average cpu load had a local peak at the time of the problem, but it was like 0.8 against 0.2 usual which I guess is nothing for a 2 core virtual machine. We've also checked java threads - there was no peak there and their count was reasonable ~240-250. Can anyone give us a hint - what should we monitor to see this high load and what should we tune to make it acceptable? Thanks in advance, Alexander [1] http://wiki.apache.org/cassandra/FAQ#dropped_messages
Re: Write performance with 1.2.12
On Wed, Dec 11, 2013 at 10:49 PM, Aaron Morton aa...@thelastpickle.comwrote: It is the write latency, read latency is ok. Interestingly the latency is low when there is one node. When I join other nodes the latency drops about 1/3. To be specific, when I start sending traffic to the other nodes the latency for all the nodes increases, if I stop traffic to other nodes the latency drops again, I checked, this is not node specific it happens to any node. Is this the local write latency or the cluster wide write request latency ? This is a cluster wide write latency. What sort of numbers are you seeing ? I have a custom application that writes data to the cassandra node, so the numbers might be different than the standard stress test but it should be good enough for comparison. With the previous release 1.0.12 I was getting around 10K requests/ sec and with 1.2.12 I am getting around 6K requests/ sec. Everything else is the same. This is a three node cluster. With a single node I get 3K for cassandra 1.0.12 and 1.2.12. So I suspect there is some network chatter. I have started looking at the sources, hoping to find something. -sandeep Cheers - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 12/12/2013, at 3:39 pm, srmore comom...@gmail.com wrote: Thanks Aaron On Wed, Dec 11, 2013 at 8:15 PM, Aaron Morton aa...@thelastpickle.comwrote: Changed memtable_total_space_in_mb to 1024 still no luck. Reducing memtable_total_space_in_mb will increase the frequency of flushing to disk, which will create more for compaction to do and result in increased IO. You should return it to the default. You are right, had to revert it back to default. when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. What are you measuring, request latency or local read/write latency ? If it’s write latency it’s probably GC, if it’s read is probably IO or data model. It is the write latency, read latency is ok. Interestingly the latency is low when there is one node. When I join other nodes the latency drops about 1/3. To be specific, when I start sending traffic to the other nodes the latency for all the nodes increases, if I stop traffic to other nodes the latency drops again, I checked, this is not node specific it happens to any node. I don't see any GC activity in logs. Tried to control the compaction by reducing the number of threads, did not help much. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 7/12/2013, at 8:05 am, srmore comom...@gmail.com wrote: Changed memtable_total_space_in_mb to 1024 still no luck. On Fri, Dec 6, 2013 at 11:05 AM, Vicky Kak vicky@gmail.com wrote: Can you set the memtable_total_space_in_mb value, it is defaulting to 1/3 which is 8/3 ~ 2.6 gb in capacity http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management The flushing of 2.6 gb to the disk might slow the performance if frequently called, may be you have lots of write operations going on. On Fri, Dec 6, 2013 at 10:06 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote: You have passed the JVM configurations and not the cassandra configurations which is in cassandra.yaml. Apologies, was tuning JVM and that's what was in my mind. Here are the cassandra settings http://pastebin.com/uN42GgYT The spikes are not that significant in our case and we are running the cluster with 1.7 gb heap. Are these spikes causing any issue at your end? There are no big spikes, the overall performance seems to be about 40% low. On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.comwrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing
Re: Write performance with 1.2.12
On Thu, Dec 12, 2013 at 11:15 AM, J. Ryan Earl o...@jryanearl.us wrote: Why did you switch to RandomPartitioner away from Murmur3Partitioner? Have you tried with Murmur3? 1. # partitioner: org.apache.cassandra.dht.Murmur3Partitioner 2. partitioner: org.apache.cassandra.dht.RandomPartitioner Since I am comparing between the two versions I am keeping all the settings same. I see Murmur3Partitioner has some performance improvement but then switching back to RandomPartitioner should not cause performance to tank, right ? or am I missing something ? Also, is there an easier way to update the data from RandomPartitioner to Murmur3 ? (upgradesstable ?) On Fri, Dec 6, 2013 at 10:36 AM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote: You have passed the JVM configurations and not the cassandra configurations which is in cassandra.yaml. Apologies, was tuning JVM and that's what was in my mind. Here are the cassandra settings http://pastebin.com/uN42GgYT The spikes are not that significant in our case and we are running the cluster with 1.7 gb heap. Are these spikes causing any issue at your end? There are no big spikes, the overall performance seems to be about 40% low. On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.005.33 2.67 0.16 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-3 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.249.80 0.13 0.32 dm-4 0.00 0.00 0.00 6.60 0.0052.80 8.00 0.011.36 0.55 0.36 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-6 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.29 11.60 0.13 0.32 I can see I am cpu bound here but couldn't figure out exactly what is causing it, is this caused by GC or Compaction ? I am thinking it is compaction, I see a lot of context switches and interrupts in my vmstat output. I don't see GC activity in the logs but see some compaction activity. Has anyone seen this ? or know what can be done to free up the CPU. Thanks, Sandeep
Re: Write performance with 1.2.12
Thanks Aaron On Wed, Dec 11, 2013 at 8:15 PM, Aaron Morton aa...@thelastpickle.comwrote: Changed memtable_total_space_in_mb to 1024 still no luck. Reducing memtable_total_space_in_mb will increase the frequency of flushing to disk, which will create more for compaction to do and result in increased IO. You should return it to the default. You are right, had to revert it back to default. when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. What are you measuring, request latency or local read/write latency ? If it’s write latency it’s probably GC, if it’s read is probably IO or data model. It is the write latency, read latency is ok. Interestingly the latency is low when there is one node. When I join other nodes the latency drops about 1/3. To be specific, when I start sending traffic to the other nodes the latency for all the nodes increases, if I stop traffic to other nodes the latency drops again, I checked, this is not node specific it happens to any node. I don't see any GC activity in logs. Tried to control the compaction by reducing the number of threads, did not help much. Hope that helps. - Aaron Morton New Zealand @aaronmorton Co-Founder Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 7/12/2013, at 8:05 am, srmore comom...@gmail.com wrote: Changed memtable_total_space_in_mb to 1024 still no luck. On Fri, Dec 6, 2013 at 11:05 AM, Vicky Kak vicky@gmail.com wrote: Can you set the memtable_total_space_in_mb value, it is defaulting to 1/3 which is 8/3 ~ 2.6 gb in capacity http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management The flushing of 2.6 gb to the disk might slow the performance if frequently called, may be you have lots of write operations going on. On Fri, Dec 6, 2013 at 10:06 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote: You have passed the JVM configurations and not the cassandra configurations which is in cassandra.yaml. Apologies, was tuning JVM and that's what was in my mind. Here are the cassandra settings http://pastebin.com/uN42GgYT The spikes are not that significant in our case and we are running the cluster with 1.7 gb heap. Are these spikes causing any issue at your end? There are no big spikes, the overall performance seems to be about 40% low. On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.005.33 2.67 0.16 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-3 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.249.80 0.13 0.32 dm-4 0.00 0.00 0.00 6.60 0.0052.80
Write performance with 1.2.12
We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.0058.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 0.0027.60 0.00 4.40 0.00 256.0058.18 0.012.55 1.32 0.58 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.005.33 2.67 0.16 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-3 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.249.80 0.13 0.32 dm-4 0.00 0.00 0.00 6.60 0.0052.80 8.00 0.011.36 0.55 0.36 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-6 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.29 11.60 0.13 0.32 I can see I am cpu bound here but couldn't figure out exactly what is causing it, is this caused by GC or Compaction ? I am thinking it is compaction, I see a lot of context switches and interrupts in my vmstat output. I don't see GC activity in the logs but see some compaction activity. Has anyone seen this ? or know what can be done to free up the CPU. Thanks, Sandeep
Re: Write performance with 1.2.12
On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.005.33 2.67 0.16 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-3 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.249.80 0.13 0.32 dm-4 0.00 0.00 0.00 6.60 0.0052.80 8.00 0.011.36 0.55 0.36 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-6 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.29 11.60 0.13 0.32 I can see I am cpu bound here but couldn't figure out exactly what is causing it, is this caused by GC or Compaction ? I am thinking it is compaction, I see a lot of context switches and interrupts in my vmstat output. I don't see GC activity in the logs but see some compaction activity. Has anyone seen this ? or know what can be done to free up the CPU. Thanks, Sandeep
Re: Write performance with 1.2.12
On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote: You have passed the JVM configurations and not the cassandra configurations which is in cassandra.yaml. Apologies, was tuning JVM and that's what was in my mind. Here are the cassandra settings http://pastebin.com/uN42GgYT The spikes are not that significant in our case and we are running the cluster with 1.7 gb heap. Are these spikes causing any issue at your end? There are no big spikes, the overall performance seems to be about 40% low. On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.005.33 2.67 0.16 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-3 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.249.80 0.13 0.32 dm-4 0.00 0.00 0.00 6.60 0.0052.80 8.00 0.011.36 0.55 0.36 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-6 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.29 11.60 0.13 0.32 I can see I am cpu bound here but couldn't figure out exactly what is causing it, is this caused by GC or Compaction ? I am thinking it is compaction, I see a lot of context switches and interrupts in my vmstat output. I don't see GC activity in the logs but see some compaction activity. Has anyone seen this ? or know what can be done to free up the CPU. Thanks, Sandeep
Re: Write performance with 1.2.12
Looks like I am spending some time in GC. java.lang:type=GarbageCollector,name=ConcurrentMarkSweep CollectionTime = 51707; CollectionCount = 103; java.lang:type=GarbageCollector,name=ParNew CollectionTime = 466835; CollectionCount = 21315; On Fri, Dec 6, 2013 at 9:58 AM, Jason Wee peich...@gmail.com wrote: Hi srmore, Perhaps if you use jconsole and connect to the jvm using jmx. Then uner MBeans tab, start inspecting the GC metrics. /Jason On Fri, Dec 6, 2013 at 11:40 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.005.33 2.67 0.16 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-3 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.249.80 0.13 0.32 dm-4 0.00 0.00 0.00 6.60 0.0052.80 8.00 0.011.36 0.55 0.36 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-6 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.29 11.60 0.13 0.32 I can see I am cpu bound here but couldn't figure out exactly what is causing it, is this caused by GC or Compaction ? I am thinking it is compaction, I see a lot of context switches and interrupts in my vmstat output. I don't see GC activity in the logs but see some compaction activity. Has anyone seen this ? or know what can be done to free up the CPU. Thanks, Sandeep
Re: Write performance with 1.2.12
Not long: Uptime (seconds) : 6828 Token: 56713727820156410577229101238628035242 ID : c796609a-a050-48df-bf56-bb09091376d9 Gossip active: true Thrift active: true Native Transport active: false Load : 49.71 GB Generation No: 1386344053 Uptime (seconds) : 6828 Heap Memory (MB) : 2409.71 / 8112.00 Data Center : DC Rack : RAC-1 Exceptions : 0 Key Cache: size 56154704 (bytes), capacity 104857600 (bytes), 27 hits, 155669426 requests, 0.000 recent hit rate, 14400 save period in seconds Row Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds On Fri, Dec 6, 2013 at 11:15 AM, Vicky Kak vicky@gmail.com wrote: Since how long the server had been up, hours,days,months? On Fri, Dec 6, 2013 at 10:41 PM, srmore comom...@gmail.com wrote: Looks like I am spending some time in GC. java.lang:type=GarbageCollector,name=ConcurrentMarkSweep CollectionTime = 51707; CollectionCount = 103; java.lang:type=GarbageCollector,name=ParNew CollectionTime = 466835; CollectionCount = 21315; On Fri, Dec 6, 2013 at 9:58 AM, Jason Wee peich...@gmail.com wrote: Hi srmore, Perhaps if you use jconsole and connect to the jvm using jmx. Then uner MBeans tab, start inspecting the GC metrics. /Jason On Fri, Dec 6, 2013 at 11:40 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.005.33 2.67 0.16 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-3 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.249.80 0.13 0.32 dm-4 0.00 0.00 0.00 6.60 0.0052.80 8.00 0.011.36 0.55 0.36 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-6 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.29 11.60 0.13 0.32 I can see I am cpu bound here but couldn't figure out exactly what is causing it, is this caused by GC or Compaction ? I am thinking it is compaction, I see a lot of context switches and interrupts in my vmstat output. I don't see GC activity in the logs but see some compaction activity. Has anyone seen this ? or know what can be done to free up the CPU. Thanks, Sandeep
Re: Write performance with 1.2.12
Changed memtable_total_space_in_mb to 1024 still no luck. On Fri, Dec 6, 2013 at 11:05 AM, Vicky Kak vicky@gmail.com wrote: Can you set the memtable_total_space_in_mb value, it is defaulting to 1/3 which is 8/3 ~ 2.6 gb in capacity http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management The flushing of 2.6 gb to the disk might slow the performance if frequently called, may be you have lots of write operations going on. On Fri, Dec 6, 2013 at 10:06 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:59 AM, Vicky Kak vicky@gmail.com wrote: You have passed the JVM configurations and not the cassandra configurations which is in cassandra.yaml. Apologies, was tuning JVM and that's what was in my mind. Here are the cassandra settings http://pastebin.com/uN42GgYT The spikes are not that significant in our case and we are running the cluster with 1.7 gb heap. Are these spikes causing any issue at your end? There are no big spikes, the overall performance seems to be about 40% low. On Fri, Dec 6, 2013 at 9:10 PM, srmore comom...@gmail.com wrote: On Fri, Dec 6, 2013 at 9:32 AM, Vicky Kak vicky@gmail.com wrote: Hard to say much without knowing about the cassandra configurations. The cassandra configuration is -Xms8G -Xmx8G -Xmn800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly Yes compactions/GC's could skipe the CPU, I had similar behavior with my setup. Were you able to get around it ? -VK On Fri, Dec 6, 2013 at 7:40 PM, srmore comom...@gmail.com wrote: We have a 3 node cluster running cassandra 1.2.12, they are pretty big machines 64G ram with 16 cores, cassandra heap is 8G. The interesting observation is that, when I send traffic to one node its performance is 2x more than when I send traffic to all the nodes. We ran 1.0.11 on the same box and we observed a slight dip but not half as seen with 1.2.12. In both the cases we were writing with LOCAL_QUORUM. Changing CL to ONE make a slight improvement but not much. The read_Repair_chance is 0.1. We see some compactions running. following is my iostat -x output, sda is the ssd (for commit log) and sdb is the spinner. avg-cpu: %user %nice %system %iowait %steal %idle 66.460.008.950.010.00 24.58 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sda2 0.0027.60 0.00 4.40 0.00 256.00 58.18 0.012.55 1.32 0.58 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.60 0.00 4.80 8.00 0.005.33 2.67 0.16 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-3 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.249.80 0.13 0.32 dm-4 0.00 0.00 0.00 6.60 0.0052.80 8.00 0.011.36 0.55 0.36 dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.000.00 0.00 0.00 dm-6 0.00 0.00 0.00 24.80 0.00 198.40 8.00 0.29 11.60 0.13 0.32 I can see I am cpu bound here but couldn't figure out exactly what is causing it, is this caused by GC or Compaction ? I am thinking it is compaction, I see a lot of context switches and interrupts in my vmstat output. I don't see GC activity in the logs but see some compaction activity. Has anyone seen this ? or know what can be done to free up the CPU. Thanks, Sandeep
Cassandra high heap utilization under heavy reads and writes.
Hello, We moved to cassandra 1.2.9 from 1.0.11 to take advantage of the off-heap bloom filters and other improvements. We see a lot of messages dropped under high load conditions. We noticed that when we do heavy read AND write simultaneously (we read first and check whether the key exists if not we write it) Cassandra heap increases dramatically and then gossip marks the node down (as a result of high load on the node). Under heavy 'reads only' we don't see this behavior. Has anyone seen this behavior ? any suggestions. Thanks !
Re: java.io.FileNotFoundException when setting up internode_compression
Thanks Christopher ! I don't think glibc is an issue (as it did go that far) /usr/tmp/ snappy-1.0.5-libsnappyjava.so is not there, permissions look ok, are there any special settings (like JVM args) that I should be using ? I can see libsnappyjava.so in the jar though (snappy-java-1.0.5.jar\org\xerial\snappy\native\Linux\i386\) one other thing I am using RedHat 6. I will try updating glibc ans see what happens. Thanks ! On Mon, Nov 11, 2013 at 5:01 PM, Christopher Wirt chris.w...@struq.comwrote: I had this the other day when we were accidentally provisioned a centos5 machine (instead of 6). Think it relates to the version of glibc. Notice it wants the native binary .so not the .jar So maybe update to a newer version of glibc? Or possibly make sure the .so exists at /usr/tmp/snappy-1.0.5-libsnappyjava.so? I was lucky and just did an OS reload to centos6. Here is someone having a similar issue. http://mail-archives.apache.org/mod_mbox/cassandra-commits/201307.mbox/%3CJIRA.12616012.1352862646995.6820.1373083550278@arcas%3E *From:* srmore [mailto:comom...@gmail.com] *Sent:* 11 November 2013 21:32 *To:* user@cassandra.apache.org *Subject:* java.io.FileNotFoundException when setting up internode_compression I might be missing something obvious here, for some reason I cannot seem to get internode_compression = all to work. I am getting the following exception. I am using cassandra 1.2.9 and have snappy-java-1.0.5.jar in my classpath. Google search did not return any useful result, has anyone seen this before ? java.io.FileNotFoundException: /usr/tmp/snappy-1.0.5-libsnappyjava.so (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:194) at java.io.FileOutputStream.init(FileOutputStream.java:145) at org.xerial.snappy.SnappyLoader.extractLibraryFile(SnappyLoader.java:394) at org.xerial.snappy.SnappyLoader.findNativeLibrary(SnappyLoader.java:468) at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:318) at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229) at org.xerial.snappy.Snappy.clinit(Snappy.java:48) at org.apache.cassandra.io.compress.SnappyCompressor.create(SnappyCompressor.java:45) at org.apache.cassandra.io.compress.SnappyCompressor.isAvailable(SnappyCompressor.java:55) at org.apache.cassandra.io.compress.SnappyCompressor.clinit(SnappyCompressor.java:37) at org.apache.cassandra.config.CFMetaData.clinit(CFMetaData.java:82) at org.apache.cassandra.config.KSMetaData.systemKeyspace(KSMetaData.java:81) at org.apache.cassandra.config.DatabaseDescriptor.loadYaml(DatabaseDescriptor.java:471) at org.apache.cassandra.config.DatabaseDescriptor.clinit(DatabaseDescriptor.java:123) Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1738) at java.lang.Runtime.loadLibrary0(Runtime.java:823) at java.lang.System.loadLibrary(System.java:1028) at org.xerial.snappy.SnappyNativeLoader.loadLibrary(SnappyNativeLoader.java:52) ... 18 more
Re: A lot of MUTATION and REQUEST_RESPONSE messages dropped
The problem was cross_node_timeout value,I had it set to true and my ntp clocks were not synchronized as a result, some of the requests were dropped. Thanks, Sandeep On Sat, Nov 9, 2013 at 6:02 PM, srmore comom...@gmail.com wrote: I recently upgraded to 1.2.9 and I am seeing a lot of REQUEST_RESPONSE and MUTATION messages are being dropped. This happens when I have multiple nodes in the cluster (about 3 nodes) and I send traffic to only one node. I don't think the traffic is that high, it is around 400 msg/sec with 100 threads. When I take down other two nodes I don't see any errors (at least on the client side) I am using Pelops. On the client I get UnavailableException, but the nodes are up. Initially I thought I am hitting CASSANDRA-6297 (gossip thread blocking) so I changed memtable_flush_writers to 3. Still no luck. UnavailableException: org.scale7.cassandra.pelops.exceptions.UnavailableException: null at org.scale7.cassandra.pelops.exceptions.IExceptionTranslator$ExceptionTranslator.translate(IExceptionTranslator.java:61) ~[na:na] at In the debug log on the cassandra node this is the exception I see DEBUG [Thrift:78] 2013-11-09 16:47:28,212 CustomTThreadPoolServer.java Thrift transport error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Could this be because of high load ? with Cassandra 1.0.011 I did not see this issue. Thanks, Sandeep
java.io.FileNotFoundException when setting up internode_compression
I might be missing something obvious here, for some reason I cannot seem to get internode_compression = all to work. I am getting the following exception. I am using cassandra 1.2.9 and have snappy-java-1.0.5.jar in my classpath. Google search did not return any useful result, has anyone seen this before ? java.io.FileNotFoundException: /usr/tmp/snappy-1.0.5-libsnappyjava.so (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:194) at java.io.FileOutputStream.init(FileOutputStream.java:145) at org.xerial.snappy.SnappyLoader.extractLibraryFile(SnappyLoader.java:394) at org.xerial.snappy.SnappyLoader.findNativeLibrary(SnappyLoader.java:468) at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:318) at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229) at org.xerial.snappy.Snappy.clinit(Snappy.java:48) at org.apache.cassandra.io.compress.SnappyCompressor.create(SnappyCompressor.java:45) at org.apache.cassandra.io.compress.SnappyCompressor.isAvailable(SnappyCompressor.java:55) at org.apache.cassandra.io.compress.SnappyCompressor.clinit(SnappyCompressor.java:37) at org.apache.cassandra.config.CFMetaData.clinit(CFMetaData.java:82) at org.apache.cassandra.config.KSMetaData.systemKeyspace(KSMetaData.java:81) at org.apache.cassandra.config.DatabaseDescriptor.loadYaml(DatabaseDescriptor.java:471) at org.apache.cassandra.config.DatabaseDescriptor.clinit(DatabaseDescriptor.java:123) Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1738) at java.lang.Runtime.loadLibrary0(Runtime.java:823) at java.lang.System.loadLibrary(System.java:1028) at org.xerial.snappy.SnappyNativeLoader.loadLibrary(SnappyNativeLoader.java:52) ... 18 more
A lot of MUTATION and REQUEST_RESPONSE messages dropped
I recently upgraded to 1.2.9 and I am seeing a lot of REQUEST_RESPONSE and MUTATION messages are being dropped. This happens when I have multiple nodes in the cluster (about 3 nodes) and I send traffic to only one node. I don't think the traffic is that high, it is around 400 msg/sec with 100 threads. When I take down other two nodes I don't see any errors (at least on the client side) I am using Pelops. On the client I get UnavailableException, but the nodes are up. Initially I thought I am hitting CASSANDRA-6297 (gossip thread blocking) so I changed memtable_flush_writers to 3. Still no luck. UnavailableException: org.scale7.cassandra.pelops.exceptions.UnavailableException: null at org.scale7.cassandra.pelops.exceptions.IExceptionTranslator$ExceptionTranslator.translate(IExceptionTranslator.java:61) ~[na:na] at In the debug log on the cassandra node this is the exception I see DEBUG [Thrift:78] 2013-11-09 16:47:28,212 CustomTThreadPoolServer.java Thrift transport error occurred during processing of message. org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Could this be because of high load ? with Cassandra 1.0.011 I did not see this issue. Thanks, Sandeep
Re: heap issues - looking for advices on gc tuning
We ran into similar heap issues a while ago for 1.0.11, I am not sure whether you are at the luxury of upgrading to at-least 1.2.9, we were not. After a lot of various painful attempts and weeks of testing (just as in your case) the following settings worked (did not completely relieve the heap pressure but helped a lot). We still see some heap issues but at-least it is a bit stable. Unlike in your case we had very heavy reads and writes. But its good to know that this happens for light load, I was thinking this was a symptom of heavy load. -XX:NewSize=1200M -XX:SurvivorRatio=4 -XX:MaxTenuringThreshold=2 Not sure whether this will help you or not but I think its worth a try. -sandeep On Wed, Oct 30, 2013 at 4:34 AM, Jason Tang ares.t...@gmail.com wrote: What's configuration of following parameters memtable_flush_queue_size: concurrent_compactors: 2013/10/30 Piavlo lolitus...@gmail.com Hi, Below I try to give a full picture to the problem I'm facing. This is a 12 node cluster, running on ec2 with m2.xlarge instances (17G ram , 2 cpus). Cassandra version is 1.0.8 Cluster normally having between 3000 - 1500 reads per second (depends on time of the day) and 1700 - 800 writes per second- according to Opscetner. RF=3, now row caches are used. Memory relevant configs from cassandra.yaml: flush_largest_memtables_at: 0.85 reduce_cache_sizes_at: 0.90 reduce_cache_capacity_to: 0.75 commitlog_total_space_in_mb: 4096 relevant JVM options used are: -Xms8000M -Xmx8000M -Xmn400M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:MaxTenuringThreshold=1 -XX:**CMSInitiatingOccupancyFraction**=80 -XX:+** UseCMSInitiatingOccupancyOnly Now what happens is that with these settings after cassandra process restart, the GC it working fine at the beginning, and heap used looks like a saw with perfect teeth, eventually the teeth size start to diminish until the teeth become not noticable, and then cassandra starts to spend lot's of CPU time doing gc. It takes about 2 weeks until for such cycle , and then I need to restart cassandra process to improve performance. During all this time there are no memory related messages in cassandra system.log, except a GC for ParNew: little above 200ms once in a while. Things i've already done trying to reduce this eventual heap pressure. 1) reducing bloom_filter_fp_chance resulting in reduction from ~700MB to ~280MB total per node based on all Filter.db files on the node. 2) reducing key cache sizes, and dropping key_caches for CFs which do no not have many reads 3) the heap size was increased from 7000M to 8000M All these have not really helped , just the increase from 7000M to 8000M, helped in increase the cycle till excessive gc from ~9 days to ~14 days. I've tried to graph overtime the data that is supposed to be in heap vs actual heap size, by summing up all CFs bloom filter sizes + all CFs key cache capacities multipled by average key size + all CFs memtables data size reported (i've overestimated the data size a bit on purpose to be on the safe size). Here is a link to graph showing last 2 day metrics for a node which could not effectively do GC, and then cassandra process was restarted. http://awesomescreenshot.com/**0401w5y534http://awesomescreenshot.com/0401w5y534 You can clearly see that before and after restart, the size of data that is in supposed to be in heap, is the same pretty much the same, which makes me think that I really need is GC tunning. Also I suppose that this is not due to number of total keys each node has , which is between 300 - 200 milions keys for all CF key estimates summed on a code. The nodes have datasize between 75G to 45G accordingly to milions of keys. And all nodes are starting to have having GC heavy load after about 14 days. Also the excessive GC and heap usage are not affected by load which varies depending on time of the day (see read/write rates at the beginning of the mail). So again based on this , I assume this is not due to large number of keys or too much load on the cluster, but due to a pure GC misconfiguration issue. Things I remember that I've tried for GC tunning: 1) Changing -XX:MaxTenuringThreshold=1 to values like 8 - did not help. 2) Adding -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:** CMSIncrementalDutyCycleMin=0 -XX:CMSIncrementalDutyCycle=10 -XX:ParallelGCThreads=2 JVM_OPTS -XX:ParallelCMSThreads=1 this actually made things worse. 3) Adding -XX:-XX-UseAdaptiveSizePolicy -XX:SurvivorRatio=8 - did not help. Also since it takes like 2 weeks to verify that changing GC setting did not help, the process is painfully slow to try all the possibilities :) I'd highly appreciate any help and hints on the GC tunning. tnx Alex
Re: Query a datacenter
Thanks Rob that helps ! On Fri, Oct 25, 2013 at 7:34 PM, Robert Coli rc...@eventbrite.com wrote: On Fri, Oct 25, 2013 at 2:47 PM, srmore comom...@gmail.com wrote: I don't know whether this is possible but was just curious, can you query for the data in the remote datacenter with a CL.ONE ? A coordinator at CL.ONE picks which replica(s) to query based in large part on the dynamic snitch. If your remote data center has a lower badness score from the perspective of the dynamic snitch, a CL.ONE request might go there. 1.2.11 adds [1] a LOCAL_ONE consistencylevel which does the opposite of what you are asking, restricting CL.ONE from going cross-DC. There could be a case where one might not have a QUORUM and would like to read the most recent data which includes the data from the other datacenter. AFAIK to reliably read the data from other datacenter we only have CL.EACH_QUORUM. Using CL.QUORUM requires a QUORUM number of responses, it does not care from which data center those responses come. Also, is there a way one can control how frequently the data is replicated across the datacenters ? Data centers don't really exist in this context [2], so your question is can one control how frequently data is replicated between replicas and the answer is no. All replication always goes to every replica. =Rob [1] https://issues.apache.org/jira/browse/CASSANDRA-6202 [2] this is slightly glib/reductive/inaccurate, but accurate enough for the purposes of this response.
Query a datacenter
I don't know whether this is possible but was just curious, can you query for the data in the remote datacenter with a CL.ONE ? There could be a case where one might not have a QUORUM and would like to read the most recent data which includes the data from the other datacenter. AFAIK to reliably read the data from other datacenter we only have CL.EACH_QUORUM. Also, is there a way one can control how frequently the data is replicated across the datacenters ? Thanks !
Re: Cassandra Heap Size for data more than 1 TB
Thanks Mohit and Michael, That's what I thought. I have tried all the avenues, will give ParNew a try. With the 1.0.xx I have issues when data sizes go up, hopefully that will not be the case with 1.2. Just curious, has anyone tried 1.2 with large data set, around 1 TB ? Thanks ! On Thu, Oct 3, 2013 at 7:20 AM, Michał Michalski mich...@opera.com wrote: I was experimenting with 128 vs. 512 some time ago and I was unable to see any difference in terms of performance. I'd probably check 1024 too, but we migrated to 1.2 and heap space was not an issue anymore. M. W dniu 02.10.2013 16:32, srmore pisze: I changed my index_interval from 128 to index_interval: 128 to 512, does it make sense to increase more than this ? On Wed, Oct 2, 2013 at 9:30 AM, cem cayiro...@gmail.com wrote: Have a look to index_interval. Cem. On Wed, Oct 2, 2013 at 2:25 PM, srmore comom...@gmail.com wrote: The version of Cassandra I am using is 1.0.11, we are migrating to 1.2.X though. We had tuned bloom filters (0.1) and AFAIK making it lower than this won't matter. Thanks ! On Tue, Oct 1, 2013 at 11:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Which Cassandra version are you on? Essentially heap size is function of number of keys/metadata. In Cassandra 1.2 lot of the metadata like bloom filters were moved off heap. On Tue, Oct 1, 2013 at 9:34 PM, srmore comom...@gmail.com wrote: Does anyone know what would roughly be the heap size for cassandra with 1TB of data ? We started with about 200 G and now on one of the nodes we are already on 1 TB. We were using 8G of heap and that served us well up until we reached 700 G where we started seeing failures and nodes flipping. With 1 TB of data the node refuses to come back due to lack of memory. needless to say repairs and compactions takes a lot of time. We upped the heap from 8 G to 12 G and suddenly everything started moving rapidly i.e. the repair tasks and the compaction tasks. But soon (in about 9-10 hrs) we started seeing the same symptoms as we were seeing with 8 G. So my question is how do I determine what is the optimal size of heap for data around 1 TB ? Following are some of my JVM settings -Xms8G -Xmx8G -Xmn800m -XX:NewSize=1200M XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=4 Thanks !
Re: Cassandra Heap Size for data more than 1 TB
The version of Cassandra I am using is 1.0.11, we are migrating to 1.2.X though. We had tuned bloom filters (0.1) and AFAIK making it lower than this won't matter. Thanks ! On Tue, Oct 1, 2013 at 11:54 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Which Cassandra version are you on? Essentially heap size is function of number of keys/metadata. In Cassandra 1.2 lot of the metadata like bloom filters were moved off heap. On Tue, Oct 1, 2013 at 9:34 PM, srmore comom...@gmail.com wrote: Does anyone know what would roughly be the heap size for cassandra with 1TB of data ? We started with about 200 G and now on one of the nodes we are already on 1 TB. We were using 8G of heap and that served us well up until we reached 700 G where we started seeing failures and nodes flipping. With 1 TB of data the node refuses to come back due to lack of memory. needless to say repairs and compactions takes a lot of time. We upped the heap from 8 G to 12 G and suddenly everything started moving rapidly i.e. the repair tasks and the compaction tasks. But soon (in about 9-10 hrs) we started seeing the same symptoms as we were seeing with 8 G. So my question is how do I determine what is the optimal size of heap for data around 1 TB ? Following are some of my JVM settings -Xms8G -Xmx8G -Xmn800m -XX:NewSize=1200M XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=4 Thanks !
Re: Cassandra Heap Size for data more than 1 TB
I changed my index_interval from 128 to index_interval: 128 to 512, does it make sense to increase more than this ? On Wed, Oct 2, 2013 at 9:30 AM, cem cayiro...@gmail.com wrote: Have a look to index_interval. Cem. On Wed, Oct 2, 2013 at 2:25 PM, srmore comom...@gmail.com wrote: The version of Cassandra I am using is 1.0.11, we are migrating to 1.2.X though. We had tuned bloom filters (0.1) and AFAIK making it lower than this won't matter. Thanks ! On Tue, Oct 1, 2013 at 11:54 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Which Cassandra version are you on? Essentially heap size is function of number of keys/metadata. In Cassandra 1.2 lot of the metadata like bloom filters were moved off heap. On Tue, Oct 1, 2013 at 9:34 PM, srmore comom...@gmail.com wrote: Does anyone know what would roughly be the heap size for cassandra with 1TB of data ? We started with about 200 G and now on one of the nodes we are already on 1 TB. We were using 8G of heap and that served us well up until we reached 700 G where we started seeing failures and nodes flipping. With 1 TB of data the node refuses to come back due to lack of memory. needless to say repairs and compactions takes a lot of time. We upped the heap from 8 G to 12 G and suddenly everything started moving rapidly i.e. the repair tasks and the compaction tasks. But soon (in about 9-10 hrs) we started seeing the same symptoms as we were seeing with 8 G. So my question is how do I determine what is the optimal size of heap for data around 1 TB ? Following are some of my JVM settings -Xms8G -Xmx8G -Xmn800m -XX:NewSize=1200M XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=4 Thanks !
Re: Cassandra Heap Size for data more than 1 TB
Sure, I was testing using high traffic with about 6K - 7K req/sec reads and writes combined I added a node and ran repair, at this time the traffic was stopped and heap was 8G. I saw a lot of flushing and GC activity and finally it died saying out of memory. So I gave it more memory 12 G and started the nodes. This sped up the compactions and validations for around 12 hours and now I am back to the flushing and high GC activity at this point there was no traffic for more than 24 hours. Again, thanks for the help ! On Wed, Oct 2, 2013 at 10:19 AM, cem cayiro...@gmail.com wrote: I think 512 is fine. Could you tell more about your traffic characteristics? Cem On Wed, Oct 2, 2013 at 4:32 PM, srmore comom...@gmail.com wrote: I changed my index_interval from 128 to index_interval: 128 to 512, does it make sense to increase more than this ? On Wed, Oct 2, 2013 at 9:30 AM, cem cayiro...@gmail.com wrote: Have a look to index_interval. Cem. On Wed, Oct 2, 2013 at 2:25 PM, srmore comom...@gmail.com wrote: The version of Cassandra I am using is 1.0.11, we are migrating to 1.2.X though. We had tuned bloom filters (0.1) and AFAIK making it lower than this won't matter. Thanks ! On Tue, Oct 1, 2013 at 11:54 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Which Cassandra version are you on? Essentially heap size is function of number of keys/metadata. In Cassandra 1.2 lot of the metadata like bloom filters were moved off heap. On Tue, Oct 1, 2013 at 9:34 PM, srmore comom...@gmail.com wrote: Does anyone know what would roughly be the heap size for cassandra with 1TB of data ? We started with about 200 G and now on one of the nodes we are already on 1 TB. We were using 8G of heap and that served us well up until we reached 700 G where we started seeing failures and nodes flipping. With 1 TB of data the node refuses to come back due to lack of memory. needless to say repairs and compactions takes a lot of time. We upped the heap from 8 G to 12 G and suddenly everything started moving rapidly i.e. the repair tasks and the compaction tasks. But soon (in about 9-10 hrs) we started seeing the same symptoms as we were seeing with 8 G. So my question is how do I determine what is the optimal size of heap for data around 1 TB ? Following are some of my JVM settings -Xms8G -Xmx8G -Xmn800m -XX:NewSize=1200M XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=4 Thanks !
Cassandra Heap Size for data more than 1 TB
Does anyone know what would roughly be the heap size for cassandra with 1TB of data ? We started with about 200 G and now on one of the nodes we are already on 1 TB. We were using 8G of heap and that served us well up until we reached 700 G where we started seeing failures and nodes flipping. With 1 TB of data the node refuses to come back due to lack of memory. needless to say repairs and compactions takes a lot of time. We upped the heap from 8 G to 12 G and suddenly everything started moving rapidly i.e. the repair tasks and the compaction tasks. But soon (in about 9-10 hrs) we started seeing the same symptoms as we were seeing with 8 G. So my question is how do I determine what is the optimal size of heap for data around 1 TB ? Following are some of my JVM settings -Xms8G -Xmx8G -Xmn800m -XX:NewSize=1200M XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=4 Thanks !
Re: Error during startup - java.lang.OutOfMemoryError: unable to create new native thread
I hit this issue again today and looks like changing -Xss option does not work :( I am on 1.0.11 (I know its old, we are upgrading to 1.2.9 right now) and have about 800-900GB of data. I can see cassandra is spending a lot of time reading the data files before it quits with java.lang.OutOfMemoryError: unable to create new native thread error. My hard and soft limits seems to be ok as well Datastax recommends [1] * soft nofile 32768 * hard nofile 32768 and I have hardnofile 65536 softnofile 65536 My ulimit -u output is 515038 (which again should be sufficient) complete output ulimit -a core file size (blocks, -c)0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 515038 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515038 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Has anyone run into this ? [1] http://www.datastax.com/docs/1.1/troubleshooting/index On Wed, Sep 11, 2013 at 8:47 AM, srmore comom...@gmail.com wrote: Thanks Viktor, - check (cassandra-env.sh) -Xss size, you may need to increase it for your JVM; This seems to have done the trick ! Thanks ! On Tue, Sep 10, 2013 at 12:46 AM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: For start: - check (cassandra-env.sh) -Xss size, you may need to increase it for your JVM; - check (cassandra-env.sh) -Xms and -Xmx size, you may need to increase it for your data load/bloom filter/index sizes. ** ** ** ** ** ** Best regards / Pagarbiai *Viktor Jevdokimov* Senior Developer [image: Adform News] http://www.adform.com *Visit us at Dmexco: *Hall 6 Stand B-52 September 18-19 Cologne, Germany Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-03163 Vilnius, Lithuania Follow us on Twitter: @adforminsiderhttp://twitter.com/#!/adforminsider Take a ride with Adform's Rich Media Suitehttp://vimeo.com/adform/richmedia [image: Dmexco 2013] http://www.dmexco.de/ Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. *From:* srmore [mailto:comom...@gmail.com] *Sent:* Tuesday, September 10, 2013 6:16 AM *To:* user@cassandra.apache.org *Subject:* Error during startup - java.lang.OutOfMemoryError: unable to create new native thread [heur] ** ** I have a 5 node cluster with a load of around 300GB each. A node went down and does not come up. I can see the following exception in the logs. ERROR [main] 2013-09-09 21:50:56,117 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[main,5,main] java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:640) at java.util.concurrent.ThreadPoolExecutor.addIfUnderCorePoolSize(ThreadPoolExecutor.java:703) at java.util.concurrent.ThreadPoolExecutor.prestartAllCoreThreads(ThreadPoolExecutor.java:1392) at org.apache.cassandra.concurrent.JMXEnabledThreadPoolExecutor.init(JMXEnabledThreadPoolExecutor.java:77) at org.apache.cassandra.concurrent.JMXEnabledThreadPoolExecutor.init(JMXEnabledThreadPoolExecutor.java:65) at org.apache.cassandra.concurrent.JMXConfigurableThreadPoolExecutor.init(JMXConfigurableThreadPoolExecutor.java:34) at org.apache.cassandra.concurrent.StageManager.multiThreadedConfigurableStage(StageManager.java:68) at org.apache.cassandra.concurrent.StageManager.clinit(StageManager.java:42) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:344) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:173)** ** ** ** The *ulimit -u* output is *515042* Which is far more than what is recommended [1] (10240) and I am skeptical to set it to unlimited as recommended here [2] Any pointers as to what could be the issue and how to get the node up.*** * [1] http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html?pagename
Re: Error during startup - java.lang.OutOfMemoryError: unable to create new native thread
Was too fast on the send button, sorry. The thing I wanted to add was the pending signals (-i) 515038 that looks odd to me, could that be related. On Thu, Sep 19, 2013 at 4:53 PM, srmore comom...@gmail.com wrote: I hit this issue again today and looks like changing -Xss option does not work :( I am on 1.0.11 (I know its old, we are upgrading to 1.2.9 right now) and have about 800-900GB of data. I can see cassandra is spending a lot of time reading the data files before it quits with java.lang.OutOfMemoryError: unable to create new native thread error. My hard and soft limits seems to be ok as well Datastax recommends [1] * soft nofile 32768 * hard nofile 32768 and I have hardnofile 65536 softnofile 65536 My ulimit -u output is 515038 (which again should be sufficient) complete output ulimit -a core file size (blocks, -c)0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 515038 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515038 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Has anyone run into this ? [1] http://www.datastax.com/docs/1.1/troubleshooting/index On Wed, Sep 11, 2013 at 8:47 AM, srmore comom...@gmail.com wrote: Thanks Viktor, - check (cassandra-env.sh) -Xss size, you may need to increase it for your JVM; This seems to have done the trick ! Thanks ! On Tue, Sep 10, 2013 at 12:46 AM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: For start: - check (cassandra-env.sh) -Xss size, you may need to increase it for your JVM; - check (cassandra-env.sh) -Xms and -Xmx size, you may need to increase it for your data load/bloom filter/index sizes. ** ** ** ** ** ** Best regards / Pagarbiai *Viktor Jevdokimov* Senior Developer [image: Adform News] http://www.adform.com *Visit us at Dmexco: *Hall 6 Stand B-52 September 18-19 Cologne, Germany Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-03163 Vilnius, Lithuania Follow us on Twitter: @adforminsiderhttp://twitter.com/#!/adforminsider Take a ride with Adform's Rich Media Suitehttp://vimeo.com/adform/richmedia [image: Dmexco 2013] http://www.dmexco.de/ Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. *From:* srmore [mailto:comom...@gmail.com] *Sent:* Tuesday, September 10, 2013 6:16 AM *To:* user@cassandra.apache.org *Subject:* Error during startup - java.lang.OutOfMemoryError: unable to create new native thread [heur] ** ** I have a 5 node cluster with a load of around 300GB each. A node went down and does not come up. I can see the following exception in the logs. ERROR [main] 2013-09-09 21:50:56,117 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[main,5,main] java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:640) at java.util.concurrent.ThreadPoolExecutor.addIfUnderCorePoolSize(ThreadPoolExecutor.java:703) at java.util.concurrent.ThreadPoolExecutor.prestartAllCoreThreads(ThreadPoolExecutor.java:1392) at org.apache.cassandra.concurrent.JMXEnabledThreadPoolExecutor.init(JMXEnabledThreadPoolExecutor.java:77) at org.apache.cassandra.concurrent.JMXEnabledThreadPoolExecutor.init(JMXEnabledThreadPoolExecutor.java:65) at org.apache.cassandra.concurrent.JMXConfigurableThreadPoolExecutor.init(JMXConfigurableThreadPoolExecutor.java:34) at org.apache.cassandra.concurrent.StageManager.multiThreadedConfigurableStage(StageManager.java:68) at org.apache.cassandra.concurrent.StageManager.clinit(StageManager.java:42) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:344) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:173)* *** ** ** The *ulimit -u* output is *515042* Which is far more than what
Re: Error during startup - java.lang.OutOfMemoryError: unable to create new native thread
Thanks Viktor, - check (cassandra-env.sh) -Xss size, you may need to increase it for your JVM; This seems to have done the trick ! Thanks ! On Tue, Sep 10, 2013 at 12:46 AM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: For start: - check (cassandra-env.sh) -Xss size, you may need to increase it for your JVM; - check (cassandra-env.sh) -Xms and -Xmx size, you may need to increase it for your data load/bloom filter/index sizes. ** ** ** ** ** ** Best regards / Pagarbiai *Viktor Jevdokimov* Senior Developer [image: Adform News] http://www.adform.com *Visit us at Dmexco: *Hall 6 Stand B-52 September 18-19 Cologne, Germany Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063, Fax +370 5 261 0453 J. Jasinskio 16C, LT-03163 Vilnius, Lithuania Follow us on Twitter: @adforminsider http://twitter.com/#!/adforminsider Take a ride with Adform's Rich Media Suitehttp://vimeo.com/adform/richmedia [image: Dmexco 2013] http://www.dmexco.de/ Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. *From:* srmore [mailto:comom...@gmail.com] *Sent:* Tuesday, September 10, 2013 6:16 AM *To:* user@cassandra.apache.org *Subject:* Error during startup - java.lang.OutOfMemoryError: unable to create new native thread [heur] ** ** I have a 5 node cluster with a load of around 300GB each. A node went down and does not come up. I can see the following exception in the logs. ERROR [main] 2013-09-09 21:50:56,117 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[main,5,main] java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:640) at java.util.concurrent.ThreadPoolExecutor.addIfUnderCorePoolSize(ThreadPoolExecutor.java:703) at java.util.concurrent.ThreadPoolExecutor.prestartAllCoreThreads(ThreadPoolExecutor.java:1392) at org.apache.cassandra.concurrent.JMXEnabledThreadPoolExecutor.init(JMXEnabledThreadPoolExecutor.java:77) at org.apache.cassandra.concurrent.JMXEnabledThreadPoolExecutor.init(JMXEnabledThreadPoolExecutor.java:65) at org.apache.cassandra.concurrent.JMXConfigurableThreadPoolExecutor.init(JMXConfigurableThreadPoolExecutor.java:34) at org.apache.cassandra.concurrent.StageManager.multiThreadedConfigurableStage(StageManager.java:68) at org.apache.cassandra.concurrent.StageManager.clinit(StageManager.java:42) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:344) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:173)*** * ** ** The *ulimit -u* output is *515042* Which is far more than what is recommended [1] (10240) and I am skeptical to set it to unlimited as recommended here [2] Any pointers as to what could be the issue and how to get the node up. [1] http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html?pagename=docsversion=1.2file=install/recommended_settings#cassandra/install/installRecommendSettings.html [2] http://mail-archives.apache.org/mod_mbox/cassandra-user/201303.mbox/%3CCAPqEvGE474Omea1BFLJ6U_pbAkOwWxk=dwo35_pc-atwb4_...@mail.gmail.com%3E Thanks ! signature-logo734.pngdmexco4bc1.png
Error during startup - java.lang.OutOfMemoryError: unable to create new native thread
I have a 5 node cluster with a load of around 300GB each. A node went down and does not come up. I can see the following exception in the logs. ERROR [main] 2013-09-09 21:50:56,117 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[main,5,main] java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:640) at java.util.concurrent.ThreadPoolExecutor.addIfUnderCorePoolSize(ThreadPoolExecutor.java:703) at java.util.concurrent.ThreadPoolExecutor.prestartAllCoreThreads(ThreadPoolExecutor.java:1392) at org.apache.cassandra.concurrent.JMXEnabledThreadPoolExecutor.init(JMXEnabledThreadPoolExecutor.java:77) at org.apache.cassandra.concurrent.JMXEnabledThreadPoolExecutor.init(JMXEnabledThreadPoolExecutor.java:65) at org.apache.cassandra.concurrent.JMXConfigurableThreadPoolExecutor.init(JMXConfigurableThreadPoolExecutor.java:34) at org.apache.cassandra.concurrent.StageManager.multiThreadedConfigurableStage(StageManager.java:68) at org.apache.cassandra.concurrent.StageManager.clinit(StageManager.java:42) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:344) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:173) The *ulimit -u* output is *515042* Which is far more than what is recommended [1] (10240) and I am skeptical to set it to unlimited as recommended here [2] Any pointers as to what could be the issue and how to get the node up. [1] http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html?pagename=docsversion=1.2file=install/recommended_settings#cassandra/install/installRecommendSettings.html [2] http://mail-archives.apache.org/mod_mbox/cassandra-user/201303.mbox/%3CCAPqEvGE474Omea1BFLJ6U_pbAkOwWxk=dwo35_pc-atwb4_...@mail.gmail.com%3E Thanks !
Re: Best way to track backups/delays for cross DC replication
I would be interested to know that too, it would be great if anyone can share how they do (or do not) track or monitor cross datacenter migrations. Thanks ! On Wed, Sep 4, 2013 at 10:13 AM, Anand Somani meatfor...@gmail.com wrote: Hi, Scenario is a cluster spanning across datacenters and we use Local_quorum and want to know when things are not getting replicated across data centers. What is the best way to track/alert on that? I was planning on using the HintedHandOffManager (JMX) = org.apache.cassandra.db:type=HintedHandoffManager countPendingHints. Are there other metrics (maybe exposed via nodetool) I should be looking at. At this point we are on 1.1.6 cassandra. Thanks Anand
Distributed lock for cassandra
All, There are some operations that demand the use lock and I was wondering whether Cassandra has a built in locking mechanism. After hunting the web for a while it appears that the answer is no, although I found this outdated wiki page which describes the algorithm http://wiki.apache.org/cassandra/Locking was this implemented ? It would be great if people on the list can share their experiences / best practices about locking. Does anyone use cages https://code.google.com/p/cages/ ? if yes it would be nice if you guys can share your experiences. Thanks, Sandeep
Re: Distributed lock for cassandra
On Mon, Aug 12, 2013 at 2:49 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Aug 12, 2013 at 12:31 PM, srmore comom...@gmail.com wrote: There are some operations that demand the use lock and I was wondering whether Cassandra has a built in locking mechanism. After hunting the web for a while it appears that the answer is no, although I found this outdated wiki page which describes the algorithm http://wiki.apache.org/cassandra/Locking was this implemented ? It would be great if people on the list can share their experiences / best practices about locking. If your application needs a lot of locking, it is probably not ideal for a distributed, log structured database with immutable data files. This was the answer I was afraid of ... , not a lot of locking but now and then I do need it, that said creating the username problem described in the bug pretty much describes my problem. That said, Cassandra 2.0 will support CAS via Paxos. Presumably at a much, much lower throughput than the base system. https://issues.apache.org/jira/browse/CASSANDRA-5062 Thanks a lot for the pointers I will look at some of the solutions described there. =Rob
Re: Alternate major compaction
Thanks Takenori, Looks like the tool provides some good info that people can use. It would be great if you can share it with the community. On Thu, Jul 11, 2013 at 6:51 AM, Takenori Sato ts...@cloudian.com wrote: Hi, I think it is a common headache for users running a large Cassandra cluster in production. Running a major compaction is not the only cause, but more. For example, I see two typical scenario. 1. backup use case 2. active wide row In the case of 1, say, one data is removed a year later. This means, tombstone on the row is 1 year away from the original row. To remove an expired row entirely, a compaction set has to include all the rows. So, when do the original, 1 year old row, and the tombstoned row are included in a compaction set? It is likely to take one year. In the case of 2, such an active wide row exists in most of sstable files. And it typically contains many expired columns. But none of them wouldn't be removed entirely because a compaction set practically do not include all the row fragments. Btw, there is a very convenient MBean API is available. It is CompactionManager's forceUserDefinedCompaction. You can invoke a minor compaction on a file set you define. So the question is how to find an optimal set of sstable files. Then, I wrote a tool to check garbage, and print outs some useful information to find such an optimal set. Here's a simple log output. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504071)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db --- TOTAL, 40, 40 === REMAINNING_SSTABLE_FILES means any other sstable files that contain the respective row. So, the following is an optimal set. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504131)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 223, 0, YES, YES --- TOTAL, 223, 0 === This tool relies on SSTableReader and an aggregation iterator as Cassandra does in compaction. I was considering to share this with the community. So let me know if anyone is interested. Ah, note that it is based on 1.0.7. So I will need to check and update for newer versions. Thanks, Takenori On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez tomas.nu...@groupalia.comwrote: Hi About a year ago, we did a major compaction in our cassandra cluster (a n00b mistake, I know), and since then we've had huge sstables that never get compacted, and we were condemned to repeat the major compaction process every once in a while (we are using SizeTieredCompaction strategy, and we've not avaluated yet LeveledCompaction, because it has its downsides, and we've had no time to test all of them in our environment). I was trying to find a way to solve this situation (that is, do something like a major compaction that writes small sstables, not huge as major compaction does), and I couldn't find it in the documentation. I tried cleanup and scrub/upgradesstables, but they don't do that (as documentation states). Then I tried deleting all data in a node and then bootstrapping it (or nodetool rebuild-ing it), hoping that this way the sstables would get cleaned from deleted records and updates. But the deleted node just copied the sstables from another node as they were, cleaning nothing. So I tried a new approach: I switched the sstable compaction strategy (SizeTiered to Leveled), forcing the sstables to be rewritten from scratch, and then switching it back (Leveled to SizeTiered). It took a while (but so do the major compaction process) and it worked, I have smaller sstables, and I've regained a lot of disk space. I'm happy with the results, but it doesn't seem a orthodox way of cleaning the sstables. What do you think, is it something wrong or crazy? Is there a different way to achieve the same thing? Let's put an example: Suppose you have a
Re: Migrating data from 2 node cluster to a 3 node cluster
On Fri, Jul 5, 2013 at 6:08 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Jul 4, 2013 at 10:03 AM, srmore comom...@gmail.com wrote: We are planning to move data from a 2 node cluster to a 3 node cluster. We are planning to copy the data from the two nodes (snapshot) to the new 2 nodes and hoping that Cassandra will sync it to the third node. Will this work ? are there any other commands to run after we are done migrating, like nodetool repair. What RF are old and new cluster? RF of old and new cluster is the same RF=3. Keyspaces and schema info is also same. What are the tokens of old and new nodes? tokens for old cluster ( 2-node ) node 0 - 0 node 1 - 85070591730234615865843651857942052864 Tokens for new cluster (3-node) node 0 - 0 node 1 - 56713727820156407428984779325531226112 node 2 - 113427455640312814857969558651062452224 http://www.palominodb.com/blog/2012/09/25/bulk-loading-options-cassandra Thanks this helps a lot ! =Rob
Migrating data from 2 node cluster to a 3 node cluster
We are planning to move data from a 2 node cluster to a 3 node cluster. We are planning to copy the data from the two nodes (snapshot) to the new 2 nodes and hoping that Cassandra will sync it to the third node. Will this work ? are there any other commands to run after we are done migrating, like nodetool repair. Thanks all.
Re: Heap is not released and streaming hangs at 0%
On Wed, Jun 26, 2013 at 12:16 AM, aaron morton aa...@thelastpickle.comwrote: bloom_filter_fp_chance value that was changed from default to 0.1, looked at the filters and they are about 2.5G on disk and I have around 8G of heap. I will try increasing the value to 0.7 and report my results. You need to re-write the sstables on disk using nodetool upgradesstables. Otherwise only the new tables with have the 0.1 setting. I will try increasing the value to 0.7 and report my results. No need to, it will probably be something like Oh no, really, what, how, please make it stop :) 0.7 will mean reads will hit most / all of the SSTables for the CF. Changing the bloom_filter_fp_chance to 0.7 did seem to correct the problem in short run. I do not see the out of heap errors but I am taking a bit of a performance hit. Planning to run some more tests, also my BloomFilterFalseRatio is 0.8367977262013025 this was the reason behind bumping bloom_filter_fp_chance. I covered a high row situation in on of my talks at the summit this month, the slide deck is here http://www.slideshare.net/aaronmorton/cassandra-sf-2013-in-case-of-emergency-break-glass and the videos will soon be up at Planet Cassandra. This was/is extremely helpful Aaron, cannot thank you enough for sharing this with the community, eagerly looking forward for the video. Rebuild the sstables, then reduce the index_interval if you still need to reduce mem pressure. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 22/06/2013, at 1:17 PM, sankalp kohli kohlisank...@gmail.com wrote: I will take a heap dump and see whats in there rather than guessing. On Fri, Jun 21, 2013 at 4:12 PM, Bryan Talbot btal...@aeriagames.comwrote: bloom_filter_fp_chance = 0.7 is probably way too large to be effective and you'll probably have issues compacting deleted rows and get poor read performance with a value that high. I'd guess that anything larger than 0.1 might as well be 1.0. -Bryan On Fri, Jun 21, 2013 at 5:58 AM, srmore comom...@gmail.com wrote: On Fri, Jun 21, 2013 at 2:53 AM, aaron morton aa...@thelastpickle.comwrote: nodetool -h localhost flush didn't do much good. Do you have 100's of millions of rows ? If so see recent discussions about reducing the bloom_filter_fp_chance and index_sampling. Yes, I have 100's of millions of rows. If this is an old schema you may be using the very old setting of 0.000744 which creates a lot of bloom filters. bloom_filter_fp_chance value that was changed from default to 0.1, looked at the filters and they are about 2.5G on disk and I have around 8G of heap. I will try increasing the value to 0.7 and report my results. It also appears to be a case of hard GC failure (as Rob mentioned) as the heap is never released, even after 24+ hours of idle time, the JVM needs to be restarted to reclaim the heap. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 20/06/2013, at 6:36 AM, Wei Zhu wz1...@yahoo.com wrote: If you want, you can try to force the GC through Jconsole. Memory-Perform GC. It theoretically triggers a full GC and when it will happen depends on the JVM -Wei -- *From: *Robert Coli rc...@eventbrite.com *To: *user@cassandra.apache.org *Sent: *Tuesday, June 18, 2013 10:43:13 AM *Subject: *Re: Heap is not released and streaming hangs at 0% On Tue, Jun 18, 2013 at 10:33 AM, srmore comom...@gmail.com wrote: But then shouldn't JVM C G it eventually ? I can still see Cassandra alive and kicking but looks like the heap is locked up even after the traffic is long stopped. No, when GC system fails this hard it is often a permanent failure which requires a restart of the JVM. nodetool -h localhost flush didn't do much good. This adds support to the idea that your heap is too full, and not full of memtables. You could try nodetool -h localhost invalidatekeycache, but that probably will not free enough memory to help you. =Rob
Re: Heap is not released and streaming hangs at 0%
On Fri, Jun 21, 2013 at 2:53 AM, aaron morton aa...@thelastpickle.comwrote: nodetool -h localhost flush didn't do much good. Do you have 100's of millions of rows ? If so see recent discussions about reducing the bloom_filter_fp_chance and index_sampling. Yes, I have 100's of millions of rows. If this is an old schema you may be using the very old setting of 0.000744 which creates a lot of bloom filters. bloom_filter_fp_chance value that was changed from default to 0.1, looked at the filters and they are about 2.5G on disk and I have around 8G of heap. I will try increasing the value to 0.7 and report my results. It also appears to be a case of hard GC failure (as Rob mentioned) as the heap is never released, even after 24+ hours of idle time, the JVM needs to be restarted to reclaim the heap. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 20/06/2013, at 6:36 AM, Wei Zhu wz1...@yahoo.com wrote: If you want, you can try to force the GC through Jconsole. Memory-Perform GC. It theoretically triggers a full GC and when it will happen depends on the JVM -Wei -- *From: *Robert Coli rc...@eventbrite.com *To: *user@cassandra.apache.org *Sent: *Tuesday, June 18, 2013 10:43:13 AM *Subject: *Re: Heap is not released and streaming hangs at 0% On Tue, Jun 18, 2013 at 10:33 AM, srmore comom...@gmail.com wrote: But then shouldn't JVM C G it eventually ? I can still see Cassandra alive and kicking but looks like the heap is locked up even after the traffic is long stopped. No, when GC system fails this hard it is often a permanent failure which requires a restart of the JVM. nodetool -h localhost flush didn't do much good. This adds support to the idea that your heap is too full, and not full of memtables. You could try nodetool -h localhost invalidatekeycache, but that probably will not free enough memory to help you. =Rob
Heap is not released and streaming hangs at 0%
I see an issues when I run high traffic to the Cassandra nodes, the heap gets full to about 94% (which is expected) but the thing that confuses me is that the heap usage never goes down after the traffic is stopped (at-least, it appears to be so) . I kept the nodes up for a day after stopping the traffic and the logs still tell me Heap is 0.9430032942657169 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically Things go back to normal when I restart Cassandra. nodetool netstats tells me the following: Mode: Normal Not sending streams and a bunch of keyspaces streaming from other nodes which are at 0% and this stays this way until I restart Cassandra. Also I see this at the bottom: Pool NameActive Pending Completed Commandsn/a 08267930 Responses n/a 0 15184810 Any ideas as to how I can speed up this up and reclaim the heap ? Thanks !
Re: Heap is not released and streaming hangs at 0%
Thanks Rob, But then shouldn't JVM C G it eventually ? I can still see Cassandra alive and kicking but looks like the heap is locked up even after the traffic is long stopped. nodetool -h localhost flush didn't do much good. the version I am running is 1.0.12 (I know its due for a upgrade but gotto work with this for now). On Tue, Jun 18, 2013 at 12:13 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Jun 18, 2013 at 8:25 AM, srmore comom...@gmail.com wrote: I see an issues when I run high traffic to the Cassandra nodes, the heap gets full to about 94% (which is expected) Which is expected to cause GC failure? ;) But seriously, the reason your node is unable to GC is that you have filled your heap too fast for it to keep up. The JVM has seized up like Joe Namath with vapor lock. Any ideas as to how I can speed up this up and reclaim the heap ? Don't exhaust the ability of GC to C G. :) =Rob PS - What version of cassandra? If you nodetool -h localhost flush does it help?
Re: Multiple data center performance
I am seeing the similar behavior, in my case I have 2 nodes in each datacenter and one node always has high latency (equal to the latency between the two datacenters). When one of the datacenters is shutdown the latency drops. I am curious to know whether anyone else has these issues and if yes how did to get around it. Thanks ! On Fri, Jun 7, 2013 at 11:49 PM, Daning Wang dan...@netseer.com wrote: We have deployed multi-center but got performance issue. When the nodes on other center are up, the read response time from clients is 4 or 5 times higher. when we take those nodes down, the response time becomes normal(compare to the time before we changed to multi-center). We have high volume on the cluster, the consistency level is one for read. so my understanding is most of traffic between data center should be read repair. but seems that could not create much delay. What could cause the problem? how to debug this? Here is the keyspace, [default@dsat] describe dsat; Keyspace: dsat: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [dc2:1, dc1:3] Column Families: ColumnFamily: categorization_cache Ring Datacenter: dc1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN xx.xx.xx..111 59.2 GB256 37.5% 4d6ed8d6-870d-4963-8844-08268607757e rac1 DN xx.xx.xx..121 99.63 GB 256 37.5% 9d0d56ce-baf6-4440-a233-ad6f1d564602 rac1 UN xx.xx.xx..120 66.32 GB 256 37.5% 0fd912fb-3187-462b-8c8a-7d223751b649 rac1 UN xx.xx.xx..118 63.61 GB 256 37.5% 3c6e6862-ab14-4a8c-9593-49631645349d rac1 UN xx.xx.xx..117 68.16 GB 256 37.5% ee6cdf23-d5e4-4998-a2db-f6c0ce41035a rac1 UN xx.xx.xx..116 32.41 GB 256 37.5% f783eeef-1c51-4f91-ab7c-a60669816770 rac1 UN xx.xx.xx..115 64.24 GB 256 37.5% e75105fb-b330-4f40-aa4f-8e6e11838e37 rac1 UN xx.xx.xx..112 61.32 GB 256 37.5% 2547ee54-88dd-4994-a1ad-d9ba367ed11f rac1 Datacenter: dc2 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack DN xx.xx.xx.19958.39 GB 256 50.0% 6954754a-e9df-4b3c-aca7-146b938515d8 rac1 DN xx.xx.xx..61 33.79 GB 256 50.0% 91b8d510-966a-4f2d-a666-d7edbe986a1c rac1 Thank you in advance, Daning
Cassandra optimizations for multi-core machines
Hello All, We are thinking of going with Cassandra on a 8 core machine, are there any optimizations that can help us here ? I have seen that during startup stage Cassandra uses only one core, is there a way we can speed up the startup process ? Thanks !
Re: Cassandra performance decreases drastically with increase in data size.
Thanks all for the help. I ran the traffic over the weekend surprisingly, my heap was doing OK (around 5.7G of 8G) but GC activity went nuts and dropped the throughput. I will probably increase the number of nodes. The other interesting thing I noticed was that there were some objects with finalize() methods, this could potentially cause GC issues. On Fri, May 31, 2013 at 1:47 AM, Aiman Parvaiz ai...@grapheffect.comwrote: I believe you should roll out more nodes as a temporary fix to your problem, 400GB on all nodes means (as correctly mentioned in other mails of this thread) you are spending more time on GC. Check out the second comment in this link by Aaron Morton, he says the more than 300GB can be problematic, though this post is about older version of cassandra but I believe concept still stands true: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-it-safe-to-stop-a-read-repair-and-any-suggestion-on-speeding-up-repairs-td6607367.html Thanks On May 29, 2013, at 9:32 PM, srmore comom...@gmail.com wrote: Hello, I am observing that my performance is drastically decreasing when my data size grows. I have a 3 node cluster with 64 GB of ram and my data size is around 400GB on all the nodes. I also see that when I re-start Cassandra the performance goes back to normal and then again starts decreasing after some time. Some hunting landed me to this page http://wiki.apache.org/cassandra/LargeDataSetConsiderations which talks about the large data sets and explains that it might be because I am going through multiple layers of OS cache, but does not tell me how to tune it. So, my question is, are there any optimizations that I can do to handle these large datatasets ? and why does my performance go back to normal when I restart Cassandra ? Thanks !
Consistency level for multi-datacenter setup
I am a bit confused when using the consistency level for multi datacenter setup. Following is my setup: I have 4 nodes the way these are set up are Node 1 DC 1 - N1DC1 Node 2 DC 1 - N2DC1 Node 1 DC 2 - N1DC2 Node 2 DC 2 - N2DC2 I setup a delay in between two datacenters (DC1 and DC2 around 1 sec one way) I am observing that when I use consistency level 2 for some reason the coordinate node is picking up the nodes from other datacenter. My understanding was that Cassandra picks up nodes which are close by (from local datacenter), determined by Gossip but looks like that's not the case. I found the following comment on Datastax website : If using a consistency level of ONE or LOCAL_QUORUM, only the nodes in the same data center as the coordinator node must respond to the client request in order for the request to succeed. Does this mean that for multi datacenter we can only use ONE or LOCAL_QUORUM if we want to use the local datacenter to avoid cross datacenter latency. I am using the GossipingPropertyFileSnitch. Thanks !
Re: Consistency level for multi-datacenter setup
With CL=TWO it appears that one node randomly picks the node from other datacenter to get the data. i.e. one node in the datacenter consistently underperforms. On Mon, Jun 3, 2013 at 3:21 PM, Hiller, Dean dean.hil...@nrel.gov wrote: What happens when you use CL=TWO. Dean From: srmore comom...@gmail.commailto:comom...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, June 3, 2013 2:09 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Consistency level for multi-datacenter setup I am a bit confused when using the consistency level for multi datacenter setup. Following is my setup: I have 4 nodes the way these are set up are Node 1 DC 1 - N1DC1 Node 2 DC 1 - N2DC1 Node 1 DC 2 - N1DC2 Node 2 DC 2 - N2DC2 I setup a delay in between two datacenters (DC1 and DC2 around 1 sec one way) I am observing that when I use consistency level 2 for some reason the coordinate node is picking up the nodes from other datacenter. My understanding was that Cassandra picks up nodes which are close by (from local datacenter), determined by Gossip but looks like that's not the case. I found the following comment on Datastax website : If using a consistency level of ONE or LOCAL_QUORUM, only the nodes in the same data center as the coordinator node must respond to the client request in order for the request to succeed. Does this mean that for multi datacenter we can only use ONE or LOCAL_QUORUM if we want to use the local datacenter to avoid cross datacenter latency. I am using the GossipingPropertyFileSnitch. Thanks !
Re: Consistency level for multi-datacenter setup
We observed that as well, please let us know what you find out it would be extremely helpful. There is also this property that you can play with to take care of slow nodes *dynamic_snitch_badness_threshold*. http://www.datastax.com/docs/1.1/configuration/node_configuration#dynamic-snitch-badness-threshold Thanks ! On Mon, Jun 3, 2013 at 3:24 PM, Hiller, Dean dean.hil...@nrel.gov wrote: Also, we had to put a fix into cassandra so it removed slow nodes from the list of nodes to read from. With that fix our QUOROM(not local quorom) started working again and would easily take the other DC nodes out of the list of reading from for you as well. I need to circle back to with my teammate to check if he got his fix posted to the dev list or not. Later, Dean From: srmore comom...@gmail.commailto:comom...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, June 3, 2013 2:09 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Consistency level for multi-datacenter setup I am a bit confused when using the consistency level for multi datacenter setup. Following is my setup: I have 4 nodes the way these are set up are Node 1 DC 1 - N1DC1 Node 2 DC 1 - N2DC1 Node 1 DC 2 - N1DC2 Node 2 DC 2 - N2DC2 I setup a delay in between two datacenters (DC1 and DC2 around 1 sec one way) I am observing that when I use consistency level 2 for some reason the coordinate node is picking up the nodes from other datacenter. My understanding was that Cassandra picks up nodes which are close by (from local datacenter), determined by Gossip but looks like that's not the case. I found the following comment on Datastax website : If using a consistency level of ONE or LOCAL_QUORUM, only the nodes in the same data center as the coordinator node must respond to the client request in order for the request to succeed. Does this mean that for multi datacenter we can only use ONE or LOCAL_QUORUM if we want to use the local datacenter to avoid cross datacenter latency. I am using the GossipingPropertyFileSnitch. Thanks !
Re: Consistency level for multi-datacenter setup
Yup, RF is 2 for both the datacenters. On Mon, Jun 3, 2013 at 3:36 PM, Sylvain Lebresne sylv...@datastax.comwrote: What's your replication factor? Do you have RF=2 on both datacenters? On Mon, Jun 3, 2013 at 10:09 PM, srmore comom...@gmail.com wrote: I am a bit confused when using the consistency level for multi datacenter setup. Following is my setup: I have 4 nodes the way these are set up are Node 1 DC 1 - N1DC1 Node 2 DC 1 - N2DC1 Node 1 DC 2 - N1DC2 Node 2 DC 2 - N2DC2 I setup a delay in between two datacenters (DC1 and DC2 around 1 sec one way) I am observing that when I use consistency level 2 for some reason the coordinate node is picking up the nodes from other datacenter. My understanding was that Cassandra picks up nodes which are close by (from local datacenter), determined by Gossip but looks like that's not the case. I found the following comment on Datastax website : If using a consistency level of ONE or LOCAL_QUORUM, only the nodes in the same data center as the coordinator node must respond to the client request in order for the request to succeed. Does this mean that for multi datacenter we can only use ONE or LOCAL_QUORUM if we want to use the local datacenter to avoid cross datacenter latency. I am using the GossipingPropertyFileSnitch. Thanks !
Re: Consistency level for multi-datacenter setup
After some more investigation it does not appear to be the CL issue. Every time I am starting up the node in other datacenter with 1sec delay my throughput starts degrading, even with CL=ONE and CL=LOCAL_QUORUM. I will put the logs on debug and investigate more and report back the findings. On Mon, Jun 3, 2013 at 3:37 PM, Hiller, Dean dean.hil...@nrel.gov wrote: Our badness threshold is 0.1 currently(just checked). Our website used to get slow during a slow node time until we rolled our own patch out. Dean From: srmore comom...@gmail.commailto:comom...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, June 3, 2013 2:31 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Consistency level for multi-datacenter setup We observed that as well, please let us know what you find out it would be extremely helpful. There is also this property that you can play with to take care of slow nodes dynamic_snitch_badness_threshold. http://www.datastax.com/docs/1.1/configuration/node_configuration#dynamic-snitch-badness-threshold Thanks ! On Mon, Jun 3, 2013 at 3:24 PM, Hiller, Dean dean.hil...@nrel.govmailto: dean.hil...@nrel.gov wrote: Also, we had to put a fix into cassandra so it removed slow nodes from the list of nodes to read from. With that fix our QUOROM(not local quorom) started working again and would easily take the other DC nodes out of the list of reading from for you as well. I need to circle back to with my teammate to check if he got his fix posted to the dev list or not. Later, Dean From: srmore comom...@gmail.commailto:comom...@gmail.commailto: comom...@gmail.commailto:comom...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org mailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, June 3, 2013 2:09 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Consistency level for multi-datacenter setup I am a bit confused when using the consistency level for multi datacenter setup. Following is my setup: I have 4 nodes the way these are set up are Node 1 DC 1 - N1DC1 Node 2 DC 1 - N2DC1 Node 1 DC 2 - N1DC2 Node 2 DC 2 - N2DC2 I setup a delay in between two datacenters (DC1 and DC2 around 1 sec one way) I am observing that when I use consistency level 2 for some reason the coordinate node is picking up the nodes from other datacenter. My understanding was that Cassandra picks up nodes which are close by (from local datacenter), determined by Gossip but looks like that's not the case. I found the following comment on Datastax website : If using a consistency level of ONE or LOCAL_QUORUM, only the nodes in the same data center as the coordinator node must respond to the client request in order for the request to succeed. Does this mean that for multi datacenter we can only use ONE or LOCAL_QUORUM if we want to use the local datacenter to avoid cross datacenter latency. I am using the GossipingPropertyFileSnitch. Thanks !
Re: Cassandra performance decreases drastically with increase in data size.
You are right, it looks like I am doing a lot of GC. Is there any short-term solution for this other than bumping up the heap ? because, even if I increase the heap I will run into the same issue. Only the time before I hit OOM will be lengthened. It will be while before we go to latest and greatest Cassandra. Thanks ! On Thu, May 30, 2013 at 12:05 AM, Jonathan Ellis jbel...@gmail.com wrote: Sounds like you're spending all your time in GC, which you can verify by checking what GCInspector and StatusLogger say in the log. Fix is increase your heap size or upgrade to 1.2: http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 On Wed, May 29, 2013 at 11:32 PM, srmore comom...@gmail.com wrote: Hello, I am observing that my performance is drastically decreasing when my data size grows. I have a 3 node cluster with 64 GB of ram and my data size is around 400GB on all the nodes. I also see that when I re-start Cassandra the performance goes back to normal and then again starts decreasing after some time. Some hunting landed me to this page http://wiki.apache.org/cassandra/LargeDataSetConsiderations which talks about the large data sets and explains that it might be because I am going through multiple layers of OS cache, but does not tell me how to tune it. So, my question is, are there any optimizations that I can do to handle these large datatasets ? and why does my performance go back to normal when I restart Cassandra ? Thanks ! -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced
Cassandra performance decreases drastically with increase in data size.
Hello, I am observing that my performance is drastically decreasing when my data size grows. I have a 3 node cluster with 64 GB of ram and my data size is around 400GB on all the nodes. I also see that when I re-start Cassandra the performance goes back to normal and then again starts decreasing after some time. Some hunting landed me to this page http://wiki.apache.org/cassandra/LargeDataSetConsiderations which talks about the large data sets and explains that it might be because I am going through multiple layers of OS cache, but does not tell me how to tune it. So, my question is, are there any optimizations that I can do to handle these large datatasets ? and why does my performance go back to normal when I restart Cassandra ? Thanks !
Re: Cannot resolve schema disagreement
Thanks Rob ! Tried the steps, that did not work, however I was able to resolve the problem by syncing the clocks. The thing that confuses me is that, the FAQ says Before 0.7.6, this can also be caused by cluster system clocks being substantially out of sync with each other. The version I am using was 1.0.12. This raises an important question, where does Cassandra get the time information from ? and is it required (I know it is highly highly advisable to) to keep clocks in sync, any suggestions/best practices on how to keep the clocks in sync ? /srm On Thu, May 9, 2013 at 1:58 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, May 8, 2013 at 5:40 PM, srmore comom...@gmail.com wrote: After running the commands, I get back to the same issue. Cannot afford to lose the data so I guess this is the only option for me. And unfortunately I am using 1.0.12 ( cannot upgrade as of now ). Any, ideas on what might be happening or any pointers will be greatly appreciated. If you can afford downtime on the cluster, the solution to this problem with the highest chance of success is : 1) dump the existing schema from a good node 2) nodetool drain on all nodes 3) stop cluster 4) move schema and migration CF tables out of the way on all nodes 5) start cluster 6) re-load schema, being careful to explicitly check for schema agreement on all nodes between schema modifying statements In many/most cases of schema disagreement, people try the FAQ approach and it doesn't work and they end up being forced to do the above anyway. In general if you can tolerate the downtime, you should save yourself the effort and just do the above process. =Rob
Re: Cannot resolve schema disagreement
Thought so. Thanks Aaron ! On Thu, May 9, 2013 at 6:09 PM, aaron morton aa...@thelastpickle.comwrote: This raises an important question, where does Cassandra get the time information from ? http://docs.oracle.com/javase/6/docs/api/java/lang/System.html normally milliSeconds, not sure if 1.0.12 may use nanoTime() which is less reliable on some VM's. and is it required (I know it is highly highly advisable to) to keep clocks in sync, any suggestions/best practices on how to keep the clocks in sync ? http://en.wikipedia.org/wiki/Network_Time_Protocol Hope that helps. - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 10/05/2013, at 9:16 AM, srmore comom...@gmail.com wrote: Thanks Rob ! Tried the steps, that did not work, however I was able to resolve the problem by syncing the clocks. The thing that confuses me is that, the FAQ says Before 0.7.6, this can also be caused by cluster system clocks being substantially out of sync with each other. The version I am using was 1.0.12. This raises an important question, where does Cassandra get the time information from ? and is it required (I know it is highly highly advisable to) to keep clocks in sync, any suggestions/best practices on how to keep the clocks in sync ? /srm On Thu, May 9, 2013 at 1:58 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, May 8, 2013 at 5:40 PM, srmore comom...@gmail.com wrote: After running the commands, I get back to the same issue. Cannot afford to lose the data so I guess this is the only option for me. And unfortunately I am using 1.0.12 ( cannot upgrade as of now ). Any, ideas on what might be happening or any pointers will be greatly appreciated. If you can afford downtime on the cluster, the solution to this problem with the highest chance of success is : 1) dump the existing schema from a good node 2) nodetool drain on all nodes 3) stop cluster 4) move schema and migration CF tables out of the way on all nodes 5) start cluster 6) re-load schema, being careful to explicitly check for schema agreement on all nodes between schema modifying statements In many/most cases of schema disagreement, people try the FAQ approach and it doesn't work and they end up being forced to do the above anyway. In general if you can tolerate the downtime, you should save yourself the effort and just do the above process. =Rob
Cannot resolve schema disagreement
Hello, I have a cluster of 4 nodes and two of them are on different schema. I tried to run the commands described in the FAQ section but no luck ( http://wiki.apache.org/cassandra/FAQ#schema_disagreement) . After running the commands, I get back to the same issue. Cannot afford to lose the data so I guess this is the only option for me. And unfortunately I am using 1.0.12 ( cannot upgrade as of now ). Any, ideas on what might be happening or any pointers will be greatly appreciated. /srm