Re: Row cache and counters
Reads during a write still occur during a counter increment with CL ONE, but that latency is not counted in the request latency for the write. Your local node write latency of 45 microseconds is pretty quick. what is your timeout and the write request latency you see. In our deployment we had some issues and we could trace the timeouts to parnew gc collections which were quite frequent. You might just want to take a look there too. On Sat, Dec 29, 2012 at 4:44 PM, André Cruz andre.c...@co.sapo.pt wrote: Hello. I recently was having some timeout issues while updating counters and turned on row cache for that particular CF. This is its stats: Column Family: UserQuotas SSTable count: 3 Space used (live): 2687239 Space used (total): 2687239 Number of Keys (estimate): 22912 Memtable Columns Count: 25766 Memtable Data Size: 180975 Memtable Switch Count: 17 Read Count: 356900 Read Latency: 1.004 ms. Write Count: 548996 Write Latency: 0.045 ms. Pending Tasks: 0 Bloom Filter False Postives: 17 Bloom Filter False Ratio: 0.0 Bloom Filter Space Used: 44232 Compacted row minimum size: 125 Compacted row maximum size: 770 Compacted row mean size: 308 Since it is rather small I was hoping that it would eventually be all cached, and the timeouts would go away. I'm updating the counters with a CL of ONE, so I thought that the timeout would be caused by the read step and the cache would help here. But I still get timeouts, and the cache hit rate is rather low: Row Cache: size 1436291 (bytes), capacity 524288000 (bytes), 125310 hits, 442760 requests, 0.247 recent hit rate, 0 save period in seconds Am I assuming something wrong about the row cache? Isn't it updated when a counter update occurs or is just invalidated? Best regards, André Cruz
Re: Row cache and counters
i assume u mean 8 seconds and not 8ms.. thats pretty huge to be caused by gc. Is there lot of load on your servers? You might also need to check for memory contention Regarding GC, since its parnew all u can really do is increase heap and young gen size, or modify tenuring rate. But that can't be the reason for a 8 second timeout. On Sat, Dec 29, 2012 at 11:37 PM, André Cruz andre.c...@co.sapo.pt wrote: On 29/12/2012, at 16:59, rohit bhatia rohit2...@gmail.com wrote: Reads during a write still occur during a counter increment with CL ONE, but that latency is not counted in the request latency for the write. Your local node write latency of 45 microseconds is pretty quick. what is your timeout and the write request latency you see. Most of the time the increments are pretty quick, in the millisecond range. I have a 8s timeout and sometimes timeouts happen in bursts. In our deployment we had some issues and we could trace the timeouts to parnew gc collections which were quite frequent. You might just want to take a look there too. What can we do about that? Which settings did you tune? Thanks, André
Re: Astyanax empty column check
See If you attempt to retrieve an entire row and it returns a result with no columns, it effectively means that row does not exist. Essentially a row without columns doesn't exist.. (except those with tombstones) from here http://stackoverflow.com/questions/8072253/is-there-a-difference-between-an-empty-key-and-a-key-that-doesnt-exist On Wed, Oct 17, 2012 at 2:17 PM, Xu Renjie xrjxrjxrj...@gmail.com wrote: Sorry for the version, I am using 1.0.1 Astyanax. On Wed, Oct 17, 2012 at 4:44 PM, Xu Renjie xrjxrjxrj...@gmail.com wrote: hello guys, I am currently using Astyanax as a client(new to Astyanax). But I am not clear how to differentiate the following 2 situations: a. A row which has only key without columns b. No this row in database. Since when I use RowQuery to query Cassandra with given key, both the above two situations will return a ColumnList with size 0. And also I didn't find other api can handle this. Do you have any better way for this? Thanks in advance. Cheers, Xu
Re: Persistent connection among nodes to communicate and redirect request
i guess 7000 is only for gossip protocol. Cassandra still uses 9160 for RPC even among nodes Also, I see Connections over port 9160 among various cassandra Nodes in my cluster. Please correct me if i am wrong.. PS: mentioned Here http://wiki.apache.org/cassandra/CloudConfig On Tue, Oct 2, 2012 at 4:56 PM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote: 9160 is a client port. Nodes are using messaging service on storage_port (7000) for intra-node communication. Best regards / Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063 Fax: +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. -Original Message- From: Niteesh kumar [mailto:nitees...@directi.com] Sent: Tuesday, October 02, 2012 12:32 To: user@cassandra.apache.org Subject: Persistent connection among nodes to communicate and redirect request while looking at netstat table i observed that my cluster nodes not using persistent connection to talk among themselves on port 9160 to redirect request. I also observed that local write latency is around 30-40 microsecond, while its takes around .5 miliseconds if the chosen node is not the node responsible for the key for 50K QPS. I think this attributes to connection making time among servers as my servers are on same rack. how can i configure my servers to use persistent connection on port 9160 thus exclude connection making time for each request that is redirected...
Re: Cassandra Counters
@Edward, We use counters in production with Cassandra 1.0.5. Though since our application is sensitive to write latency and we are seeing problems with Frequent Young Garbage Collections, and also we just do increments (decrements have caused problems for some people) We don't see inconsistencies in our data. So if you want 99.99% accurate counters, and can manage with eventual consistency. Cassandra works nicely. On Tue, Sep 25, 2012 at 4:52 PM, Edward Kibardin infa...@gmail.com wrote: I've recently noticed several threads about Cassandra Counters inconsistencies and started seriously think about possible workarounds like store realtime counters in Redis and dump them daily to Cassandra. So general question, should I rely on Counters if I want 100% accuracy? Thanks, Ed On Tue, Sep 25, 2012 at 8:15 AM, Robin Verlangen ro...@us2.nl wrote: From my point of view an other problem with using the standard column family for counting is transactions. Cassandra lacks of them, so if you're multithreaded updating counters, how will you keep track of that? Yes, I'm aware of software like Zookeeper to do that, however I'm not sure whether that's the best option. I think you should stick with Cassandra counter column families. Best regards, Robin Verlangen *Software engineer* * * W http://www.robinverlangen.nl E ro...@us2.nl http://goo.gl/Lt7BC Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/9/25 Roshni Rajagopal roshni_rajago...@hotmail.com Thanks for the reply and sorry for being bull - headed. Once you're past the stage where you've decided its distributed, and NoSQL and cassandra out of all the NoSQL options, Now to count something, you can do it in different ways in cassandra. In all the ways you want to use cassandra's best features of availability, tunable consistency , partition tolerance etc. Given this, what are the performance tradeoffs of using counters vs a standard column family for counting. Because as I see if the counter number in a counter column family becomes wrong, it will not be 'eventually consistent' - you will need intervention to correct it. So the key aspect is how much faster would be a counter column family, and at what numbers do we start seing a difference. -- Date: Tue, 25 Sep 2012 07:57:08 +0200 Subject: Re: Cassandra Counters From: oleksandr.pet...@gmail.com To: user@cassandra.apache.org Maybe I'm missing the point, but counting in a standard column family would be a little overkill. I assume that distributed counting here was more of a map/reduce approach, where Hadoop (+ Cascading, Pig, Hive, Cascalog) would help you a lot. We're doing some more complex counting (e.q. based on sets of rules) like that. Of course, that would perform _way_ slower than counting beforehand. On the other side, you will always have a consistent result for a consistent dataset. On the other hand, if you use things like AMQP or Storm (sorry to put up my sentence together like that, as tools are mostly either orthogonal or complementary, but I hope you get my point), you could build a topology that makes fault-tolerant writes independently of your original write. Of course, it would still have a consistency tradeoff, mostly because of race conditions and different network latencies etc. So I would say that building a data model in a distributed system often depends more on your problem than on the common patterns, because everything has a tradeoff. Want to have an immediate result? Modify your counter while writing the row. Can sacrifice speed, but have more counting opportunities? Go with offline distributed counting. Want to have kind of both, dispatch a message and react upon it, having the processing logic and writes decoupled from main application, allowing you to care less about speed. However, I may have missed the point somewhere (early morning, you know), so I may be wrong in any given statement. Cheers On Tue, Sep 25, 2012 at 6:53 AM, Roshni Rajagopal roshni_rajago...@hotmail.com wrote: Thanks Milind, Has anyone implemented counting in a standard col family in cassandra, when you can have increments and decrements to the count. Any comparisons in performance to using counter column families? Regards, Roshni -- Date: Mon, 24 Sep 2012 11:02:51 -0700 Subject: RE: Cassandra Counters From: milindpar...@gmail.com To: user@cassandra.apache.org IMO You would use Cassandra Counters (or
Re: Cassandra Counters
@Sylvain In a relatively untroubled cluster, even timed out writes go through, provided no messages are dropped. Which you can monitor on cassandra nodes. We have 100% consistency on our production servers as we don't see messages being dropped on our servers. Though as you mention, there would be no way to repair your dropped messages . On Tue, Sep 25, 2012 at 6:57 PM, Sylvain Lebresne sylv...@datastax.com wrote: So general question, should I rely on Counters if I want 100% accuracy? No. Even not considering potential bugs, counters being not idempotent, if you get a TimeoutException during a write (which can happen even in relatively normal conditions), you won't know if the increment went in or not (and you have no way to know unless you have an external way to check the value). This is probably fine if you use counters for say real-time analytics, but not if you use 100% accuracy. -- Sylvain
Re: are counters stable enough for production?
We use counters in a 8 node cluster with RF 2 in cassandra 1.0.5. We use phpcassa and execute cql queries through thrift to work with composite types. We do not have any problem of overcounts as we tally with RDBMS daily. It works fine but we are having some GC pressure for young generation. Per my calculation around 50-100 KB of garbage is generated every counter increment. Is this memory usage expected of counters? On Tue, Sep 18, 2012 at 7:16 AM, Bartłomiej Romański b...@sentia.pl wrote: Hi, Does anyone have any experience with using Cassandra counters in production? We rely heavily on them and recently we've got a few very serious problems. Our counters values suddenly became a few times higher than expected. From the business point of view this is a disaster :/ Also there a few open major bugs related to them. Some of them for quite long (months). We are seriously considering going back to other solutions (e.g. SQL databases). We simply cannot afford incorrect counter values. We can tolerate loosing a few increments from time to time, but we cannot tolerate having counters suddenly 3 times higher or lower than the expected values. What is the current status of counters? Should I consider them a production-ready feature and we just have some bad luck? Or should I rather consider them as a experimental-feature and look for some other solutions? Do you have any experiences with them? Any comments would be very helpful for us! Thanks, Bartek
Re: are counters stable enough for production?
@Robin I'm pretty sure the GC issue is due to counters only. Since we have only write-heavy counter incrementing traffic. GC Frequency also increases linearly with write load. @Bartlomiej On Stress Testing, we see GC frequency and consequently write latency increase to several milliseconds. At 50k qps we had GC running every 1-2 second. And since each Parnew takes around 100ms, we were spending 10% of each server's time GCing. Also, we don't have persistent connections, but testing with persistent connections give roughly the same result. At a traffic of roughly 20k qps for 8 nodes with RF 2, we have Young Gen GC running on each node every 4 seconds (approximately). We have a young gen heap size of 3200M which is already too big by any standards. Also decreasing Replication factor from 2 to 1 reduced the GC frequency 5-6 times. Any Advice? Also, our traffic is evenly On Tue, Sep 18, 2012 at 1:36 PM, Robin Verlangen ro...@us2.nl wrote: We've not been trying to create inconsistencies as you describe above. But it seems legit that those situations cause problems. Sometimes you can see log messages that indicate that counters are out of sync in the cluster and they get repaired. My guess would be that the repairs actually destroys it, however I have no knowledge of the underlying techniques. I think this because of the fact that those read repairs happen a lot (as you mention: lots of reads) and might get over-repaired or something? However, this is all just a guess. I hope someone with a lot knowledge about Cassandra internals can shed some light on this. Best regards, Robin Verlangen Software engineer W http://www.robinverlangen.nl E ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/9/18 Bartłomiej Romański b...@sentia.pl Garbage is one more issue we are having with counters. We are operating under very heavy load. Counters are spread over 7 nodes with SSD drives and we often seeing CPU usage between 90-100%. We are doing mostly reads. Latency is very important for us so GC pauses taking longer than 10ms (often around 50-100ms) are very annoying. I don't have actual numbers right now, but we've also got the impressions that cassandra generates too much garbage. Is there a possible that counters are somehow guilty? @Rohit: Did you tried something more stressful? Like sending more traffic to a node that it can actually handle, turning nodes up and down, changing the topology (moving/adding nodes)? I believe our problems comes from very high load and some operations like this (adding new nodes, replacing dead ones etc...). I was expecting that cassandra will fail some request, loose consistency temporarily or something like that in such cases, but generation highly incorrect values was very disappointing. Thanks, Bartek On Tue, Sep 18, 2012 at 9:30 AM, Robin Verlangen ro...@us2.nl wrote: @Rohit: We also use counters quite a lot (lets say 2000 increments / sec), but don't see the 50-100KB of garbage per increment. Are you sure that memory is coming from your counters? Best regards, Robin Verlangen Software engineer W http://www.robinverlangen.nl E ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/9/18 rohit bhatia rohit2...@gmail.com We use counters in a 8 node cluster with RF 2 in cassandra 1.0.5. We use phpcassa and execute cql queries through thrift to work with composite types. We do not have any problem of overcounts as we tally with RDBMS daily. It works fine but we are having some GC pressure for young generation. Per my calculation around 50-100 KB of garbage is generated every counter increment. Is this memory usage expected of counters? On Tue, Sep 18, 2012 at 7:16 AM, Bartłomiej Romański b...@sentia.pl wrote: Hi, Does anyone have any experience with using Cassandra counters in production? We rely heavily on them and recently we've got a few very serious problems. Our counters values suddenly became a few times higher than expected. From
Re: Cassandra 1.1.1 on Java 7
@dong, any reason to do so?? On Sun, Sep 9, 2012 at 4:43 PM, dong.yajun dongt...@gmail.com wrote: ruuning for a while, you should set the -Xss to more than 160k when you using jdk1.7. On Sun, Sep 9, 2012 at 3:39 AM, Peter Schuller peter.schul...@infidyne.com wrote: Has anyone tried running 1.1.1 on Java 7? Have been running jdk 1.7 on several clusters on 1.1 for a while now. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com) -- *Ric Dong * Newegg Ecommerce, MIS department
Re: Memory Usage of a connection
On Fri, Aug 31, 2012 at 11:27 AM, Peter Schuller peter.schul...@infidyne.com wrote: Could these 500 connections/second cause (on average) 2600Mb memory usage per 2 second ~ 1300Mb/second. or For 1 connection around 2-3Mb. In terms of garbage generated it's much less about number of connections as it is about what you're doing with them. Are you for example requesting large amounts of data? Large or many columns (or both), etc. Essentially all working data that your request touches is allocated on the heap and contributes to allocation rate and ParNew frequency. write requests are simple counter increments and in memtables existing in memory. There is negligible read traffic (100/200 reads/second). Also, increasing write traffic si the one that increases gc frequency while keeping read traffic constant. So the gc should be independent of reads. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Memory Usage of a connection
PS: everything above is in bytes, not bits. On Fri, Aug 31, 2012 at 11:03 AM, rohit bhatia rohit2...@gmail.com wrote: I was wondering how much would be the memory usage of an established connection in cassandra's heap space. We are noticing extremely frequent young generation garbage collections (3.2gb young generation, ParNew gc every 2 seconds) at a traffic of 20,000qps for 8 nodes. We do connection pooling but with 1 connection for 6 requests with phpcassa. So, essentially every node has on an average 500 connections created/destroyed every second. Could these 500 connections/second cause (on average) 2600Mb memory usage per 2 second ~ 1300Mb/second. or For 1 connection around 2-3Mb. Is this value expected? (our write requests are simple counter increments and cannot take up 500KB per request as calculation suggests, rather should take up only a few hundred bytes). Thanks Rohit
Re: Schema advice: (Single row or multiple row!?) How do I store millions of columns when I need to read a set of around 500 columns at a single read query using column names ?
You should probably try to break the one row scheme to 2*Number_of_nodes rows scheme.. This should ensure proper distribution of rows and still allow u to query from a few fixed number of rows. How u do it depends on how are u gonna choose ur 200-500 columns during reading (try having them in the same row) Even if u r forced to put them in seperate rows, u can make the row key as some modulus of hash of column name, ensuring symmetry and easy access of columns... On Mon, Jul 23, 2012 at 6:02 PM, Ertio Lew ertio...@gmail.com wrote: Any ideas/suggestions please?
Re: Composite Column Expiration Behavior
Hi, I don't think that composite columns have parent columns. your point might be true for supercolumns .. but each composite column is probably independent.. On Wed, Jul 18, 2012 at 9:14 PM, Thomas Van de Velde thomase...@gmail.com wrote: Hi there, I am trying to understand the expiration behavior of composite columns. Assume I have two entries both have the same parent column name but each one has a different ttl. Would expiration be applied at the parent column level (taking into account ttls set per column under the parent and expiring all of the child columns when the most recent ttl is met) or is each each child entry expired independently? Would this be correct? A:B-ttl=5 A:C-ttl=10 t+5: Nothing gets expired (because A:C's expiration has not yet been reached) t+10: Both A:B and A:C are expired Thanks, Thomas
Re: Using a node in separate cluster without decommissioning.
Hi Just wanted to say that it worked. I also made sure to modify thrift rpc_port and storage port so that the two clusters don't interfere. Thanks for the suggestion Thanks Rohit On Thu, Jul 12, 2012 at 10:01 AM, aaron morton aa...@thelastpickle.com wrote: Since replication factor is 2 in first cluster, I won't lose any data. Assuming you have been running repair or working at CL QUORUM (which is the same as CL ALL for RF 2) Is it advisable and safe to go ahead? um, so the plan is to turn off 2 nodes in the first cluster, restask them into the new cluster and then reverse the process ? If you simply turn two nodes off in the first cluster you will have reduce the availability for a portion of the ring. 25% of the keys will now have at best 1 node they can be stored on. If a node is having any sort of problems, and it's is a replica for one of the down nodes, the cluster will appear down for 12.5% of the keyspace. If you work at QUORUM you will not have enough nodes available to write / read 25% of the keys. If you decomission the nodes, you will still have 2 replicas available for each key range. This is the path I would recommend. If you _really_ need to do it what you suggest will probably work. Some tips: * do safe shutdowns - nodetool disablegossip, disablethrift, drain * don't forget to copy the yaml file. * in the first cluster the other nodes will collect hints for the first hour the nodes are down. You are not going to want these so disable HH. * get the nodes back into the first cluster before gc_grace_seconds expires. * bring them back and repair them. * when you bring them back, reading at CL ONE will give inconsistent results. Reading at QUOURM may result in a lot of repair activity. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 11/07/2012, at 6:35 AM, rohit bhatia wrote: Hi I want to take out 2 nodes from a 8 node cluster and use in another cluster, but can't afford the overhead of streaming the data and rebalance cluster. Since replication factor is 2 in first cluster, I won't lose any data. I'm planning to save my commit_log and data directories and bootstrapping the node in the second cluster. Afterwards I'll just replace both the directories and join the node back to the original cluster. This should work since cassandra saves all the cluster and schema info in the system keyspace. Is it advisable and safe to go ahead? Thanks Rohit
High RecentWriteLatencyMicro
Hi As I understand that writes in cassandra are directly pushed to memory and using counters with CL.ONE shouldn't take the read latency for counters in account. So Writes for incrementing counters with CL.ONE should basically be really fast. But in my 8 node cluster(16 core/32G ram/cassandra1.0.5/java7 each) with RF=2, At a traffic of 55k qps = 14k increments per node/7k write requests per node, the write latency(from jmx) increases to around 7-8 ms from the low traffic value of 0.5ms. The Nodes aren't even pushed with absent I/O, lots of free RAM and 30% CPU idle time/OS Load 20. The write latency by cfstats (supposedly the latency for 1 node to increment its counter) is a small amount ( 0.05ms). 1) Is the whole of 7-8ms being spent in thrift overheads and Scheduling delays ? (there is insignificant .1ms ping time between machines) 2) Do keeping a large number of CF(17 in our case) adversely affect write performance? (except from the extreme flushing scenario) 3) I see a lot of threads(4,000-10,000) with names like pool-2-thread-* (pointed out as client-connection-threads on the mailing list before) periodically forming up. but with idle cpu time and zero pending tasks in tpstats, why do requests keep piling up (GC stops threads for 100ms every 1-2 seconds, effectively pausing cassandra 5-10% of its time, but this doesn't seem to be the reason) Thanks Rohit
Using a node in separate cluster without decommissioning.
Hi I want to take out 2 nodes from a 8 node cluster and use in another cluster, but can't afford the overhead of streaming the data and rebalance cluster. Since replication factor is 2 in first cluster, I won't lose any data. I'm planning to save my commit_log and data directories and bootstrapping the node in the second cluster. Afterwards I'll just replace both the directories and join the node back to the original cluster. This should work since cassandra saves all the cluster and schema info in the system keyspace. Is it advisable and safe to go ahead? Thanks Rohit
Re: MeteredFlusher in system.log entries
@boris https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/MeteredFlusher.java#L51 On Sun, Jul 8, 2012 at 8:44 AM, Boris Yen yulin...@gmail.com wrote: I am not sure, but I think there should be only 6 memtables (max) based on the example. 1 is active, 4 are in the queue, 1 is being flushed. Is this correct? On Wed, Jun 6, 2012 at 9:08 PM, rohit bhatia rohit2...@gmail.com wrote: Also, Could someone please explain how the factor of 7 comes in the picture in this sentence For example if memtable_total_space_in_mb is 100MB, and memtable_flush_writers is the default 1 (with one data directory), and memtable_flush_queue_size is the default 4, and a Column Family has no secondary indexes. The CF will not be allowed to get above one seventh of 100MB or 14MB, as if the CF filled the flush pipeline with 7 memtables of this size it would take 98MB. On Wed, Jun 6, 2012 at 6:22 PM, rohit bhatia rohit2...@gmail.com wrote: Hi.. the link http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/ mentions that From version 0.7 onwards the worse case scenario is up to CF Count + Secondary Index Count + memtable_flush_queue_size (defaults to 4) + memtable_flush_writers (defaults to 1 per data directory) memtables in memory the JVM at once.. So it implies that for flushing, Cassandra copies the memtables content. So does this imply that writes to column families are not stopped even when it is being flushed? Thanks Rohit On Wed, Jun 6, 2012 at 9:42 AM, rohit bhatia rohit2...@gmail.com wrote: Hi Aaron Thanks for the link, I have gone through it. But this doesn't justify nodes of exactly same config/specs differing in their flushing frequency. The traffic on all node is same as we are using RandomPartitioner Thanks Rohit On Wed, Jun 6, 2012 at 12:24 AM, aaron morton aa...@thelastpickle.com wrote: See the section on memtable_total_space_in_mb here http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/ Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 6/06/2012, at 2:27 AM, rohit bhatia wrote: I am trying to understand the variance in flushes frequency in a 8 node Cassandra cluster. All the flushes are of the same type and initiated by MeteredFlusher.java = INFO [OptionalTasks:1] 2012-06-05 06:32:05,873 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='Stats', ColumnFamily='Minutewise_Channel_Stats') (estimated 501695882 bytes) [taken from system.log] Number of flushes for 1 column family vary from 6 flushes per day to 24 flushes per day among nodes of same configuration and same hardware. Could you please throw light on the what conditions does MeteredFlusher use to trigger memtable flushes. Also how accurate is the estimated size in the above logfile entry. Regards Rohit Bhatia Software Engineer, Media.net
Finding bottleneck of a cluster
Our Cassandra cluster consists of 8 nodes(16 core, 32G ram, 12G Heap, 1600Mb Young gen, cassandra1.0.5, JDK 1.7, 128 Concurrent writer threads). The replication factor is 2 with 10 column families and we service Counter incrementing write intensive tasks(CL=ONE). I am trying to figure out the bottleneck, 1) Is using JDK 1.7 any way detrimental to cassandra? 2) What is the max write operation qps that should be expected. Is the netflix benchmark also applicable for counter incrmenting tasks? http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 3) At around 50,000qps for the cluster (~12500 qps per node), the cpu idle time is around 30%, cassandra is not disk bound(insignificant read operations and cpu's iowait is around 0.05%) and is not swapping its memory(around 15 gb RAM is free or inactive). The average gc pause time for parnew are 100ms occuring every second. So cassandra spends 10% of its time stuck in Stop the world collector. The os load is around 16-20 and the average write latency is 3ms. tpstats do not show any significant pending tasks. At this point suddenly, Several nodes start dropping several Mutation messages. There are also lots of pending MutationStage,replicateOnWriteStage tasks in tpstats. The number of threads in the java process increase to around 25,000 from the usual 300-400. Almost all the new threads seem to be named pool-2-thread-*. The OS load jumps to around 30-40, the write request latency starts spiking to more than 500ms (even to several tens of seconds sometime). Even the Local write latency increases fourfolds to 200 microseconds from 50 microseconds. This happens across all the nodes and in around 2-3 minutes. My guess is that this might be due to the 128 Writer threads not being able to perform more writes.(though with average local write latency of 100-150 micro seconds, each thread should be able to serve 10,000 qps and with 128 writer threads, should be able to serve 1,280,000 qps per node) Could there be any other reason for this? What else should I monitor since system.log do not seem to say anything conclusive before dropping messages. Thanks Rohit
Re: Finding bottleneck of a cluster
Also, Looking at gc log. I see messages like this across different servers before they start dropping messages 2012-07-04T10:48:20.336+: 96771.117: [GC 96771.118: [ParNew: 1367297K-57371K(1474560K), 0.0617350 secs] 6641571K-5340088K(12419072K), 0.0634460 secs] [Times: user=0.56 sys=0.01, real=0.06 secs] Total time for which application threads were stopped: 0.0850010 seconds Total time for which application threads were stopped: 16.7663710 seconds The 16 second pause doesnt seem to be caused by the minor/major gc which are quite fast and are also logged. Total time for which ... messages are caused by PrintGCApplicationStoppedTime paramater which is supposed to be logged whenever threads reach a safepoint. Is there any way I can figure out what caused the java threads to pause. Thanks Rohit On Thu, Jul 5, 2012 at 12:19 PM, rohit bhatia rohit2...@gmail.com wrote: Our Cassandra cluster consists of 8 nodes(16 core, 32G ram, 12G Heap, 1600Mb Young gen, cassandra1.0.5, JDK 1.7, 128 Concurrent writer threads). The replication factor is 2 with 10 column families and we service Counter incrementing write intensive tasks(CL=ONE). I am trying to figure out the bottleneck, 1) Is using JDK 1.7 any way detrimental to cassandra? 2) What is the max write operation qps that should be expected. Is the netflix benchmark also applicable for counter incrmenting tasks? http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 3) At around 50,000qps for the cluster (~12500 qps per node), the cpu idle time is around 30%, cassandra is not disk bound(insignificant read operations and cpu's iowait is around 0.05%) and is not swapping its memory(around 15 gb RAM is free or inactive). The average gc pause time for parnew are 100ms occuring every second. So cassandra spends 10% of its time stuck in Stop the world collector. The os load is around 16-20 and the average write latency is 3ms. tpstats do not show any significant pending tasks. At this point suddenly, Several nodes start dropping several Mutation messages. There are also lots of pending MutationStage,replicateOnWriteStage tasks in tpstats. The number of threads in the java process increase to around 25,000 from the usual 300-400. Almost all the new threads seem to be named pool-2-thread-*. The OS load jumps to around 30-40, the write request latency starts spiking to more than 500ms (even to several tens of seconds sometime). Even the Local write latency increases fourfolds to 200 microseconds from 50 microseconds. This happens across all the nodes and in around 2-3 minutes. My guess is that this might be due to the 128 Writer threads not being able to perform more writes.(though with average local write latency of 100-150 micro seconds, each thread should be able to serve 10,000 qps and with 128 writer threads, should be able to serve 1,280,000 qps per node) Could there be any other reason for this? What else should I monitor since system.log do not seem to say anything conclusive before dropping messages. Thanks Rohit
Re: Upgrade for Cassandra 0.8.4 to 1.+
http://cassandra.apache.org/ says 1.1.2 On Thu, Jul 5, 2012 at 7:46 PM, Raj N raj.cassan...@gmail.com wrote: Hi experts, I am planning to upgrade from 0.8.4 to 1.+. Whats the latest stable version? Thanks -Rajesh
Re: Finding bottleneck of a cluster
On Fri, Jul 6, 2012 at 4:47 AM, aaron morton aa...@thelastpickle.com wrote: 12G Heap, 1600Mb Young gen, Is a bit higher than the normal recommendation. 1600MB young gen can cause some extra ParNew pauses. Thanks for heads up, i'll try tinkering on this 128 Concurrent writer threads Unless you are on SSD this is too many. I mean http://www.datastax.com/docs/0.8/configuration/node_configuration#concurrent-writes , this is not memtable flush queue writers. Suggested value is 8*number of cores(16) = 128 itself. 1) Is using JDK 1.7 any way detrimental to cassandra? as far as I know it's not fully certified, thanks for trying it :) 2) What is the max write operation qps that should be expected. Is the netflix benchmark also applicable for counter incrmenting tasks? Counters use a different write path than normal writes and are a bit slower. To benchmark, get a single node and work out the max throughput. Then multiply by the number of nodes and divide by the RF to get a rough idea. the cpu idle time is around 30%, cassandra is not disk bound(insignificant read operations and cpu's iowait is around 0.05%) Wait until compaction kicks in and handle all your inserts. The os load is around 16-20 and the average write latency is 3ms. tpstats do not show any significant pending tasks. The node is overloaded. What is the write latency for a single thread doing as single increment against a node that has not other traffic ? The latency for a request is the time spent working and the time spent waiting, once you read the max throughput the time spent waiting increases. The SEDA architecture is designed to limit the time spent working. At this point suddenly, Several nodes start dropping several Mutation messages. There are also lots of pending The cluster is overwhelmed. Almost all the new threads seem to be named pool-2-thread-*. These are client connection threads. My guess is that this might be due to the 128 Writer threads not being able to perform more writes.( Yes. https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L214 Work out the latency for a single client single node, then start adding replication, nodes and load. When the latency increases you are getting to the max throughput for that config. Also, as mentioned in my second mail, seeing messages like this Total time for which application threads were stopped: 16.7663710 seconds, if something pauses for this long, it might be overwhelmed by the hints stored at other nodes. This can further cause the node to wait on/drop a lot of client connection threads. I'll look into what is causing these non-gc pauses. Thanks for the help. Hope that helps - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/07/2012, at 6:49 PM, rohit bhatia wrote: Our Cassandra cluster consists of 8 nodes(16 core, 32G ram, 12G Heap, 1600Mb Young gen, cassandra1.0.5, JDK 1.7, 128 Concurrent writer threads). The replication factor is 2 with 10 column families and we service Counter incrementing write intensive tasks(CL=ONE). I am trying to figure out the bottleneck, 1) Is using JDK 1.7 any way detrimental to cassandra? 2) What is the max write operation qps that should be expected. Is the netflix benchmark also applicable for counter incrmenting tasks? http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 3) At around 50,000qps for the cluster (~12500 qps per node), the cpu idle time is around 30%, cassandra is not disk bound(insignificant read operations and cpu's iowait is around 0.05%) and is not swapping its memory(around 15 gb RAM is free or inactive). The average gc pause time for parnew are 100ms occuring every second. So cassandra spends 10% of its time stuck in Stop the world collector. The os load is around 16-20 and the average write latency is 3ms. tpstats do not show any significant pending tasks. At this point suddenly, Several nodes start dropping several Mutation messages. There are also lots of pending MutationStage,replicateOnWriteStage tasks in tpstats. The number of threads in the java process increase to around 25,000 from the usual 300-400. Almost all the new threads seem to be named pool-2-thread-*. The OS load jumps to around 30-40, the write request latency starts spiking to more than 500ms (even to several tens of seconds sometime). Even the Local write latency increases fourfolds to 200 microseconds from 50 microseconds. This happens across all the nodes and in around 2-3 minutes. My guess is that this might be due to the 128 Writer threads not being able to perform more writes.(though with average local write latency of 100-150 micro seconds, each thread should be able to serve 10,000 qps and with 128 writer threads, should be able to serve 1,280,000 qps per node) Could there be any other reason for this? What else should I monitor since system.log do
Re: Finding bottleneck of a cluster
On Fri, Jul 6, 2012 at 9:44 AM, rohit bhatia rohit2...@gmail.com wrote: On Fri, Jul 6, 2012 at 4:47 AM, aaron morton aa...@thelastpickle.com wrote: 12G Heap, 1600Mb Young gen, Is a bit higher than the normal recommendation. 1600MB young gen can cause some extra ParNew pauses. Thanks for heads up, i'll try tinkering on this 128 Concurrent writer threads Unless you are on SSD this is too many. I mean http://www.datastax.com/docs/0.8/configuration/node_configuration#concurrent-writes , this is not memtable flush queue writers. Suggested value is 8*number of cores(16) = 128 itself. 1) Is using JDK 1.7 any way detrimental to cassandra? as far as I know it's not fully certified, thanks for trying it :) 2) What is the max write operation qps that should be expected. Is the netflix benchmark also applicable for counter incrmenting tasks? Counters use a different write path than normal writes and are a bit slower. To benchmark, get a single node and work out the max throughput. Then multiply by the number of nodes and divide by the RF to get a rough idea. the cpu idle time is around 30%, cassandra is not disk bound(insignificant read operations and cpu's iowait is around 0.05%) Wait until compaction kicks in and handle all your inserts. The os load is around 16-20 and the average write latency is 3ms. tpstats do not show any significant pending tasks. The node is overloaded. What is the write latency for a single thread doing as single increment against a node that has not other traffic ? The latency for a request is the time spent working and the time spent waiting, once you read the max throughput the time spent waiting increases. The SEDA architecture is designed to limit the time spent working. The write latency I reported is as reported by datastax opscenter for the total latency of a client's request. This is minimum at .5ms. In contrast, the local write request latency as reported by cfstats are around 50 micro seconds but jump to 150 microseconds during the crash. At this point suddenly, Several nodes start dropping several Mutation messages. There are also lots of pending The cluster is overwhelmed. Almost all the new threads seem to be named pool-2-thread-*. These are client connection threads. My guess is that this might be due to the 128 Writer threads not being able to perform more writes.( Yes. https://github.com/apache/cassandra/blob/trunk/conf/cassandra.yaml#L214 Work out the latency for a single client single node, then start adding replication, nodes and load. When the latency increases you are getting to the max throughput for that config. Also, as mentioned in my second mail, seeing messages like this Total time for which application threads were stopped: 16.7663710 seconds, if something pauses for this long, it might be overwhelmed by the hints stored at other nodes. This can further cause the node to wait on/drop a lot of client connection threads. I'll look into what is causing these non-gc pauses. Thanks for the help. Hope that helps - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/07/2012, at 6:49 PM, rohit bhatia wrote: Our Cassandra cluster consists of 8 nodes(16 core, 32G ram, 12G Heap, 1600Mb Young gen, cassandra1.0.5, JDK 1.7, 128 Concurrent writer threads). The replication factor is 2 with 10 column families and we service Counter incrementing write intensive tasks(CL=ONE). I am trying to figure out the bottleneck, 1) Is using JDK 1.7 any way detrimental to cassandra? 2) What is the max write operation qps that should be expected. Is the netflix benchmark also applicable for counter incrmenting tasks? http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 3) At around 50,000qps for the cluster (~12500 qps per node), the cpu idle time is around 30%, cassandra is not disk bound(insignificant read operations and cpu's iowait is around 0.05%) and is not swapping its memory(around 15 gb RAM is free or inactive). The average gc pause time for parnew are 100ms occuring every second. So cassandra spends 10% of its time stuck in Stop the world collector. The os load is around 16-20 and the average write latency is 3ms. tpstats do not show any significant pending tasks. At this point suddenly, Several nodes start dropping several Mutation messages. There are also lots of pending MutationStage,replicateOnWriteStage tasks in tpstats. The number of threads in the java process increase to around 25,000 from the usual 300-400. Almost all the new threads seem to be named pool-2-thread-*. The OS load jumps to around 30-40, the write request latency starts spiking to more than 500ms (even to several tens of seconds sometime). Even the Local write latency increases fourfolds to 200 microseconds from 50 microseconds. This happens across all the nodes and in around 2-3 minutes. My guess is that this might
Re: GC freeze just after repair session
@ravi, u can increase young gen size, keep a high tenuring rate or increase survivor ratio.. On Fri, Jul 6, 2012 at 4:03 AM, aaron morton aa...@thelastpickle.com wrote: Ideally we would like to collect maximum garbage from ParNew itself, during compactions. What are the steps to take towards to achieving this? I'm not sure what you are asking. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/07/2012, at 6:56 PM, Ravikumar Govindarajan wrote: We have modified maxTenuringThreshold from 1 to 5. May be it is causing problems. Will change it back to 1 and see how the system is. concurrent_compactors=8. We will reduce this, as anyway our system won't be able to handle this number of compactions at the same time. Think it will ease GC also to some extent. Ideally we would like to collect maximum garbage from ParNew itself, during compactions. What are the steps to take towards to achieving this? On Wed, Jul 4, 2012 at 4:07 PM, aaron morton aa...@thelastpickle.com wrote: It *may* have been compaction from the repair, but it's not a big CF. I would look at the logs to see how much data was transferred to the node. Was their a compaction going on while the GC storm was happening ? Do you have a lot of secondary indexes ? If you think it correlated to compaction you can try reducing the concurrent_compactors Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 3/07/2012, at 6:33 PM, Ravikumar Govindarajan wrote: Recently, we faced a severe freeze [around 30-40 mins] on one of our servers. There were many mutations/reads dropped. The issue happened just after a routine nodetool repair for the below CF completed [1.0.7, NTS, DC1:3,DC2:2] Column Family: MsgIrtConv SSTable count: 12 Space used (live): 17426379140 Space used (total): 17426379140 Number of Keys (estimate): 122624 Memtable Columns Count: 31180 Memtable Data Size: 81950175 Memtable Switch Count: 31 Read Count: 8074156 Read Latency: 15.743 ms. Write Count: 2172404 Write Latency: 0.037 ms. Pending Tasks: 0 Bloom Filter False Postives: 1258 Bloom Filter False Ratio: 0.03598 Bloom Filter Space Used: 498672 Key cache capacity: 20 Key cache size: 20 Key cache hit rate: 0.9965579513062582 Row cache: disabled Compacted row minimum size: 51 Compacted row maximum size: 89970660 Compacted row mean size: 226626 Our heap config is as follows -Xms8G -Xmx8G -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=5 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly from yaml in_memory_compaction_limit=64 compaction_throughput_mb_sec=8 multi_threaded_compaction=false INFO [AntiEntropyStage:1] 2012-06-29 09:21:26,085 AntiEntropyService.java (line 762) [repair #2b6fcbf0-c1f9-11e1--2ea8811bfbff] MsgIrtConv is fully synced INFO [AntiEntropySessions:8] 2012-06-29 09:21:26,085 AntiEntropyService.java (line 698) [repair #2b6fcbf0-c1f9-11e1--2ea8811bfbff] session completed successfully INFO [CompactionExecutor:857] 2012-06-29 09:21:31,219 CompactionTask.java (line 221) Compacted to [/home/sas/system/data/ZMail/MsgIrtConv-hc-858-Data.db,]. 47,907,012 to 40,554,059 (~84% of original) bytes for 4,564 keys at 6.252080MB/s. Time: 6,186ms. After this, the logs were fully filled with GC [ParNew/CMS]. ParNew ran for every 3 seconds, while CMS ran for every 30 seconds approx continuous for 40 minutes. INFO [ScheduledTasks:1] 2012-06-29 09:23:39,921 GCInspector.java (line 122) GC for ParNew: 776 ms for 2 collections, 2901990208 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 09:23:42,265 GCInspector.java (line 122) GC for ParNew: 2028 ms for 2 collections, 3831282056 used; max is 8506048512 . INFO [ScheduledTasks:1] 2012-06-29 10:07:53,884 GCInspector.java (line 122) GC for ParNew: 817 ms for 2 collections, 2808685768 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 10:07:55,632 GCInspector.java (line 122) GC for ParNew: 1165 ms for 3 collections, 3264696776 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 10:07:57,773 GCInspector.java (line 122) GC for ParNew: 1444 ms for 3 collections, 4234372296 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 10:07:59,387 GCInspector.java (line 122) GC for ParNew: 1153 ms for 2 collections, 4910279080 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 10:08:00,389 GCInspector.java (line 122) GC for ParNew: 697 ms for 2 collections, 4873857072 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 10:08:01,443 GCInspector.java (line 122) GC for ParNew: 726 ms for 2 collections, 4941511184 used; max is 8506048512 After this, the node got stable and was back and running. Any
Re: Interpreting system.log MeteredFlusher messages
On Wed, Jun 27, 2012 at 2:27 PM, aaron morton aa...@thelastpickle.com wrote: , but I do not understand the remedy to the problem. Is increasing this variable my only option? There is nothing to be fixed. This is Cassandra flushing data to disk to free memory and checkpoint the commit log. yes, but it induces simultaneous flushes of around 7-8 column families which exceeds the flush queue size, I believe this can lead cassandra to stop accepting writes. I see memtables of serialized size of 100-200 MB with estimated live size of 500 MB get flushed to produce sstables of around 10-15 MB sizes. Are these factors of 10-20 between serialized on disk and memory and 3-5 for liveRatio expected? Do you have some log messages for this ? The elevated estimated size may be due to a lot of overwrites. Sample Log Message INFO [OptionalTasks:1] 2012-06-27 07:14:25,720 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='Stats', ColumnFamily='Minutewise_Adtype_Customer_Stats') (estimated 529810674 bytes) INFO [OptionalTasks:1] 2012-06-27 07:14:25,721 ColumnFamilyStore.java (line 688) Enqueuing flush of Memtable-Minutewise_Adtype_Customer_Stats@1651281270(163641387/529810674 serialized/live bytes, 1633074 ops) INFO [FlushWriter:3808] 2012-06-27 07:14:25,727 Memtable.java (line 239) Writing Memtable-Minutewise_Adtype_Customer_Stats@1651281270(163641387/529810674 serialized/live bytes, 1633074 ops) INFO [FlushWriter:3808] 2012-06-27 07:14:26,131 Memtable.java (line 275) Completed flushing /mnt/data/cassandra/data/Stats/Minutewise_Adtype_Customer_Stats-hb-70-Data.db (6315581 bytes) Yes, there are overwrites. Since these are Counter Column family, it sees a lot of increments, Does cassandra store all the history for a column (and is there some way to not store it)?? Since the formula is CF Count + Secondary Index Count + memtable_flush_queue_size (defaults to 4) + memtable_flush_writers (defaults to 1 per data directory) memtables in memory the JVM at once., shouldn't the limit be 6 (and not 7) memtables in memory? It's 7 because https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/MeteredFlusher.java#L51 Thanks a lot for this. I should have looked this up myself. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/06/2012, at 4:41 AM, rohit bhatia wrote: Hi We have 8 cassandra 1.0.5 nodes with 16 cores and 32G ram, Heap size is 12G, memtable_total_space_in_mb is one third = 4G, There are 12 Hot CFs (write-read ratio of 10). memtable_flush_queue_size = 4 and memtable_flush_writers = 2.. I got this log-entry MeteredFlusher.java (line 74) estimated 423318 bytes used by all memtables pre-flush, following which cassandra flushed several of its largest memtables. I understand that this message is due to the memtable_total_space_in_mb setting being reached, but I do not understand the remedy to the problem. Is increasing this variable my only option? Also, In standard MeteredFlusher flushes (the ones that trigger due to if my entire flush pipeline were full of memtables of this size, how big could I allow them to be. logic), I see memtables of serialized size of 100-200 MB with estimated live size of 500 MB get flushed to produce sstables of around 10-15 MB sizes. Are these factors of 10-20 between serialized on disk and memory and 3-5 for liveRatio expected? Also, this very informative article http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/ has this to say For example if memtable_total_space_in_mb is 100MB, and memtable_flush_writers is the default 1 (with one data directory), and memtable_flush_queue_size is the default 4, and a Column Family has no secondary indexes. The CF will not be allowed to get above one seventh of 100MB or 14MB, as if the CF filled the flush pipeline with 7 memtables of this size it would take 98MB. Since the formula is CF Count + Secondary Index Count + memtable_flush_queue_size (defaults to 4) + memtable_flush_writers (defaults to 1 per data directory) memtables in memory the JVM at once., shouldn't the limit be 6 (and not 7) memtables in memory? Thanks Rohit
Interpreting system.log MeteredFlusher messages
Hi We have 8 cassandra 1.0.5 nodes with 16 cores and 32G ram, Heap size is 12G, memtable_total_space_in_mb is one third = 4G, There are 12 Hot CFs (write-read ratio of 10). memtable_flush_queue_size = 4 and memtable_flush_writers = 2.. I got this log-entry MeteredFlusher.java (line 74) estimated 423318 bytes used by all memtables pre-flush, following which cassandra flushed several of its largest memtables. I understand that this message is due to the memtable_total_space_in_mb setting being reached, but I do not understand the remedy to the problem. Is increasing this variable my only option? Also, In standard MeteredFlusher flushes (the ones that trigger due to if my entire flush pipeline were full of memtables of this size, how big could I allow them to be. logic), I see memtables of serialized size of 100-200 MB with estimated live size of 500 MB get flushed to produce sstables of around 10-15 MB sizes. Are these factors of 10-20 between serialized on disk and memory and 3-5 for liveRatio expected? Also, this very informative article http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/ has this to say For example if memtable_total_space_in_mb is 100MB, and memtable_flush_writers is the default 1 (with one data directory), and memtable_flush_queue_size is the default 4, and a Column Family has no secondary indexes. The CF will not be allowed to get above one seventh of 100MB or 14MB, as if the CF filled the flush pipeline with 7 memtables of this size it would take 98MB. Since the formula is CF Count + Secondary Index Count + memtable_flush_queue_size (defaults to 4) + memtable_flush_writers (defaults to 1 per data directory) memtables in memory the JVM at once., shouldn't the limit be 6 (and not 7) memtables in memory? Thanks Rohit
Re: Cassandra out of Heap memory
I am using 1.0.5 . The logs suggest that it was one single instance of failure and I'm unable to reproduce it. From the logs, In a span of 30 seconds, heap usage went from 4.8 gb to 8.8 gb With stop-the-world gc running 20 times. I believe that parNew was unable to clean up memory due to some problem. I would report if I am able to reproduce this failure. On Mon, Jun 18, 2012 at 6:14 AM, aaron morton aa...@thelastpickle.com wrote: Not commenting on the GC advice but Cassandra memory usage has improved a lot since that was written. I would take a look at what was happening and see if tweeking Cassandra config helped before modifying GC settings. GCInspector.java(line 88): Heap is .9934 full. Is this expected? or should I adjust my flush_largest_memtable_at variable. flush_largetsmemtable_at is a a safety valve only. Reducing it may help avid OOM, by it will not treat the cause. What version are you using ? 1.0.0 had a an issue where deletes were not taken into consideration (https://github.com/apache/cassandra/blob/trunk/CHANGES.txt#L33) but this does not sound like the same problem. Take a look in the logs on the machine and see if it was associated with a compaction or repair operation. I would also consider experimenting on one node with 8GB / 800MB heap sizes. More is not always better. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 14/06/2012, at 8:05 PM, rohit bhatia wrote: Looking at http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html and server logs, I think my situation is this The default cassandra settings has the highest peak heap usage. The problem with this is that it raises the possibility that during the CMS cycle, a collection of the young generation runs out of memory to migrate objects to the old generation (a so-called concurrent mode failure), leading to stop-the-world full garbage collection. However, with a slightly lower setting of the CMS threshold, we get a bit more headroom, and more stable overall performance. I see concurrentMarkSweep system.log Entries trying to gc 2-4 collections. Any suggestions for preemptive measure for this would be welcome.
Re: Cassandra out of Heap memory
Looking at http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html and server logs, I think my situation is this The default cassandra settings has the highest peak heap usage. The problem with this is that it raises the possibility that during the CMS cycle, a collection of the young generation runs out of memory to migrate objects to the old generation (a so-called concurrent mode failure), leading to stop-the-world full garbage collection. However, with a slightly lower setting of the CMS threshold, we get a bit more headroom, and more stable overall performance. I see concurrentMarkSweep system.log Entries trying to gc 2-4 collections. Any suggestions for preemptive measure for this would be welcome.
Cassandra out of Heap memory
Hi My cassandra node went out of heap memory with this message GCInspector.java(line 88): Heap is .9934 full. Is this expected? or should I adjust my flush_largest_memtable_at variable. Also one change I did in my cluster was add 5 Column Families which are empty Should empty ColumnFamilies cause significant increase in cassandra heap usage? Thanks Rohit
Re: Cassandra out of Heap memory
To clarify things Our setup contains of 8 nodes of 32 gb ram... with a heap_max size of 12gb and heap new size of 1.6 gb The load on our nodes is write/read ratio of 10 with 6 main Column Families. Although the flushes of column families occur every hour with sstables sizes of around 50-100 mb. The memtable size for those seems to be around 500mb. (Is this 10-20 times overhead expected). Also This is the first time I'm seeing max Heap size reached Exceptions. Could there be a significant reason to this other than that the cassandra server were running without restarting for 2 months, On Wed, Jun 13, 2012 at 6:30 PM, rohit bhatia rohit2...@gmail.com wrote: Hi My cassandra node went out of heap memory with this message GCInspector.java(line 88): Heap is .9934 full. Is this expected? or should I adjust my flush_largest_memtable_at variable. Also one change I did in my cluster was add 5 Column Families which are empty Should empty ColumnFamilies cause significant increase in cassandra heap usage? Thanks Rohit
Re: Problem in getting data from a 2 node cluster of Cassandra
run nodetool -h localhost cfstats on the nodes... this gives node specific column family based data... just run this for both nodes... On Fri, Jun 8, 2012 at 12:46 PM, Prakrati Agrawal prakrati.agra...@mu-sigma.com wrote: Yes the code is the same for both 1 and 2 node cluster. It's a Hector code. How do I get the number of rows and columns from Cassandra CLI as the data is very large. Thanks and Regards Prakrati -Original Message- From: Roshni Rajagopal [mailto:roshni.rajago...@wal-mart.com] Sent: Friday, June 08, 2012 12:43 PM To: user@cassandra.apache.org Subject: Re: Problem in getting data from a 2 node cluster of Cassandra Hi Prakrati, In an ideal situation, no data should be lost when a node is added. How are you getting the statistics below. The output below looks like its from some code using Hector or Thrift..is the code to get statistics from a 1 node cluster or 2 exactly the same- with the only change being a node being added or removed? Could you verify the number of rows cols in the column family using CLI or CQL.. Regards, Roshni From: Prakrati Agrawal prakrati.agra...@mu-sigma.commailto:prakrati.agra...@mu-sigma.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Friday 8 June 2012 11:50 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Problem in getting data from a 2 node cluster of Cassandra Dear all I was originally having a 1 node cluster. Then I added one more node to it with initial token configured appropriately. Now when I run my queries I am not getting all my data ie all columns. Output on 2 nodes Time taken to retrieve columns 43707 of key range is 1276 Time taken to retrieve columns 2084199 of all tickers is 54334 Time taken to count is 230776 Total number of rows in the database are 183 Total number of columns in the database are 7903753 Output on 1 node Time taken to retrieve columns 43707 of key range is 767 Time taken to retrieve columns 382 of all tickers is 52793 Time taken to count is 268135 Total number of rows in the database are 396 Total number of columns in the database are 16316426 Please help me. Where is my data going or how should I retrieve it. I have consistency level specified as ONE and I did not specify any replication factor. Prakrati Agrawal | Developer - Big Data(ID)| 9731648376 | www.mu-sigma.com This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software. This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential *** This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software.
Re: Time taken to retrieve data from a 2 node cluster is more than 1 node cluster
Is ur client code calling with asyncrhynous requests?? and whats ur replication factor and read consistency level. In any case, 2 nodes might take as much time as one, but should not be slow (unless u also doubled the data)... On Fri, Jun 8, 2012 at 2:41 PM, Prakrati Agrawal prakrati.agra...@mu-sigma.com wrote: Dear all Initially I had a one node cluster and I flooded my data into it. I then ran my Hector code to get some rows and columns. It took around 52.793 seconds. Then I added one more node to the cluster. I again ran the same code and it took around 112.065 seconds. Cassandra should perform faster when there are more nodes was my belief.Is my belief wrong or am I doing something wrong? Please help me Thanks and Regards Prakrati This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software.
Re: Cassandra 1 node crashed in ring
Restart cassandra on new node with autobootstrap as true, seed node as the existing node in the cluster and an appropriate token... You should not need to run nodetool repair as autobootstrap would take care of it. On Thu, Jun 7, 2012 at 12:22 PM, Adeel Akbar adeel.ak...@panasiangroup.com wrote: Hi, I am running 2 nodes of Cassandra 0.8.1 in ring with replication factor 2. Last night one of the Cassandra servers crashed and now we are running on single node. Please help me that how I add new node in ring and its gets all update/data which lost in crash server. Thanks Regards Adeel Akbar
Re: Cassandra 1 node crashed in ring
pardon me for assuming that ur new node was the same as the failed node.. please see http://www.datastax.com/docs/1.0/operations/cluster_management#replacing-a-dead-node You should be able to proceed with the above link after decommissioning the new node... On Thu, Jun 7, 2012 at 1:12 PM, Adeel Akbar adeel.ak...@panasiangroup.com wrote: Hi, I have done same and now its displayed three node in ring. How I remove crashed node as well as what about data ? root@zerg:~/apache-cassandra-0.8.1/bin# ./nodetool -h XXX.XX.XXX.XX ring Address DC Rack Status State Load Owns Token 147906224866113468886003862620136792702 XX.XX.XX.XX 16 100 Up Normal 17.37 MB 14.93% 3159755813495848170708142250209621026 XX.XX.XX.XX 16 100 Down Normal ? 23.56% 43237339313998282086051322460691860905 XX.XX.XX.XX 16 100 Up Normal 15.21 KB 61.52% 147906224866113468886003862620136792702 Thanks Regards Adeel Akbar -Original Message- From: rohit bhatia [mailto:rohit2...@gmail.com] Sent: Thursday, June 07, 2012 12:28 PM To: user@cassandra.apache.org Subject: Re: Cassandra 1 node crashed in ring Restart cassandra on new node with autobootstrap as true, seed node as the existing node in the cluster and an appropriate token... You should not need to run nodetool repair as autobootstrap would take care of it. On Thu, Jun 7, 2012 at 12:22 PM, Adeel Akbar adeel.ak...@panasiangroup.com wrote: Hi, I am running 2 nodes of Cassandra 0.8.1 in ring with replication factor 2. Last night one of the Cassandra servers crashed and now we are running on single node. Please help me that how I add new node in ring and its gets all update/data which lost in crash server. Thanks Regards Adeel Akbar
Re: Cassandra 1 node crashed in ring
for 0.8 http://www.datastax.com/docs/0.8/operations/cluster_management#replacing-a-dead-node On Thu, Jun 7, 2012 at 1:22 PM, rohit bhatia rohit2...@gmail.com wrote: pardon me for assuming that ur new node was the same as the failed node.. please see http://www.datastax.com/docs/1.0/operations/cluster_management#replacing-a-dead-node You should be able to proceed with the above link after decommissioning the new node... On Thu, Jun 7, 2012 at 1:12 PM, Adeel Akbar adeel.ak...@panasiangroup.com wrote: Hi, I have done same and now its displayed three node in ring. How I remove crashed node as well as what about data ? root@zerg:~/apache-cassandra-0.8.1/bin# ./nodetool -h XXX.XX.XXX.XX ring Address DC Rack Status State Load Owns Token 147906224866113468886003862620136792702 XX.XX.XX.XX 16 100 Up Normal 17.37 MB 14.93% 3159755813495848170708142250209621026 XX.XX.XX.XX 16 100 Down Normal ? 23.56% 43237339313998282086051322460691860905 XX.XX.XX.XX 16 100 Up Normal 15.21 KB 61.52% 147906224866113468886003862620136792702 Thanks Regards Adeel Akbar -Original Message- From: rohit bhatia [mailto:rohit2...@gmail.com] Sent: Thursday, June 07, 2012 12:28 PM To: user@cassandra.apache.org Subject: Re: Cassandra 1 node crashed in ring Restart cassandra on new node with autobootstrap as true, seed node as the existing node in the cluster and an appropriate token... You should not need to run nodetool repair as autobootstrap would take care of it. On Thu, Jun 7, 2012 at 12:22 PM, Adeel Akbar adeel.ak...@panasiangroup.com wrote: Hi, I am running 2 nodes of Cassandra 0.8.1 in ring with replication factor 2. Last night one of the Cassandra servers crashed and now we are running on single node. Please help me that how I add new node in ring and its gets all update/data which lost in crash server. Thanks Regards Adeel Akbar
memtable_flush_queue_size and memtable_flush_writers
Hi I can't find this in any documentation online, so just wanted to ask Do all flush writers share the same flush queue or do they maintain their separate queues.. Thanks Rohit
Re: MeteredFlusher in system.log entries
Hi.. the link http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/ mentions that From version 0.7 onwards the worse case scenario is up to CF Count + Secondary Index Count + memtable_flush_queue_size (defaults to 4) + memtable_flush_writers (defaults to 1 per data directory) memtables in memory the JVM at once.. So it implies that for flushing, Cassandra copies the memtables content. So does this imply that writes to column families are not stopped even when it is being flushed? Thanks Rohit On Wed, Jun 6, 2012 at 9:42 AM, rohit bhatia rohit2...@gmail.com wrote: Hi Aaron Thanks for the link, I have gone through it. But this doesn't justify nodes of exactly same config/specs differing in their flushing frequency. The traffic on all node is same as we are using RandomPartitioner Thanks Rohit On Wed, Jun 6, 2012 at 12:24 AM, aaron morton aa...@thelastpickle.com wrote: See the section on memtable_total_space_in_mb here http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/ Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 6/06/2012, at 2:27 AM, rohit bhatia wrote: I am trying to understand the variance in flushes frequency in a 8 node Cassandra cluster. All the flushes are of the same type and initiated by MeteredFlusher.java = INFO [OptionalTasks:1] 2012-06-05 06:32:05,873 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='Stats', ColumnFamily='Minutewise_Channel_Stats') (estimated 501695882 bytes) [taken from system.log] Number of flushes for 1 column family vary from 6 flushes per day to 24 flushes per day among nodes of same configuration and same hardware. Could you please throw light on the what conditions does MeteredFlusher use to trigger memtable flushes. Also how accurate is the estimated size in the above logfile entry. Regards Rohit Bhatia Software Engineer, Media.net
Re: MeteredFlusher in system.log entries
Also, Could someone please explain how the factor of 7 comes in the picture in this sentence For example if memtable_total_space_in_mb is 100MB, and memtable_flush_writers is the default 1 (with one data directory), and memtable_flush_queue_size is the default 4, and a Column Family has no secondary indexes. The CF will not be allowed to get above one seventh of 100MB or 14MB, as if the CF filled the flush pipeline with 7 memtables of this size it would take 98MB. On Wed, Jun 6, 2012 at 6:22 PM, rohit bhatia rohit2...@gmail.com wrote: Hi.. the link http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/ mentions that From version 0.7 onwards the worse case scenario is up to CF Count + Secondary Index Count + memtable_flush_queue_size (defaults to 4) + memtable_flush_writers (defaults to 1 per data directory) memtables in memory the JVM at once.. So it implies that for flushing, Cassandra copies the memtables content. So does this imply that writes to column families are not stopped even when it is being flushed? Thanks Rohit On Wed, Jun 6, 2012 at 9:42 AM, rohit bhatia rohit2...@gmail.com wrote: Hi Aaron Thanks for the link, I have gone through it. But this doesn't justify nodes of exactly same config/specs differing in their flushing frequency. The traffic on all node is same as we are using RandomPartitioner Thanks Rohit On Wed, Jun 6, 2012 at 12:24 AM, aaron morton aa...@thelastpickle.com wrote: See the section on memtable_total_space_in_mb here http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/ Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 6/06/2012, at 2:27 AM, rohit bhatia wrote: I am trying to understand the variance in flushes frequency in a 8 node Cassandra cluster. All the flushes are of the same type and initiated by MeteredFlusher.java = INFO [OptionalTasks:1] 2012-06-05 06:32:05,873 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='Stats', ColumnFamily='Minutewise_Channel_Stats') (estimated 501695882 bytes) [taken from system.log] Number of flushes for 1 column family vary from 6 flushes per day to 24 flushes per day among nodes of same configuration and same hardware. Could you please throw light on the what conditions does MeteredFlusher use to trigger memtable flushes. Also how accurate is the estimated size in the above logfile entry. Regards Rohit Bhatia Software Engineer, Media.net
MeteredFlusher in system.log entries
I am trying to understand the variance in flushes frequency in a 8 node Cassandra cluster. All the flushes are of the same type and initiated by MeteredFlusher.java = INFO [OptionalTasks:1] 2012-06-05 06:32:05,873 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='Stats', ColumnFamily='Minutewise_Channel_Stats') (estimated 501695882 bytes) [taken from system.log] Number of flushes for 1 column family vary from 6 flushes per day to 24 flushes per day among nodes of same configuration and same hardware. Could you please throw light on the what conditions does MeteredFlusher use to trigger memtable flushes. Also how accurate is the estimated size in the above logfile entry. Regards Rohit Bhatia Software Engineer, Media.net
Re: MeteredFlusher in system.log entries
Hi Aaron Thanks for the link, I have gone through it. But this doesn't justify nodes of exactly same config/specs differing in their flushing frequency. The traffic on all node is same as we are using RandomPartitioner Thanks Rohit On Wed, Jun 6, 2012 at 12:24 AM, aaron morton aa...@thelastpickle.com wrote: See the section on memtable_total_space_in_mb here http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/ Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 6/06/2012, at 2:27 AM, rohit bhatia wrote: I am trying to understand the variance in flushes frequency in a 8 node Cassandra cluster. All the flushes are of the same type and initiated by MeteredFlusher.java = INFO [OptionalTasks:1] 2012-06-05 06:32:05,873 MeteredFlusher.java (line 62) flushing high-traffic column family CFS(Keyspace='Stats', ColumnFamily='Minutewise_Channel_Stats') (estimated 501695882 bytes) [taken from system.log] Number of flushes for 1 column family vary from 6 flushes per day to 24 flushes per day among nodes of same configuration and same hardware. Could you please throw light on the what conditions does MeteredFlusher use to trigger memtable flushes. Also how accurate is the estimated size in the above logfile entry. Regards Rohit Bhatia Software Engineer, Media.net