Re: clearing tombstones?
Levelled Compaction is a wholly different beast when it comes to tombstones. The tombstones are inserted, like any other write really, at the lower levels in the leveldb hierarchy. They are only removed after they have had the chance to "naturally" migrate upwards in the leveldb hierarchy to the highest level in your data store. How long that takes depends on: 1. The amount of data in your store and the number of levels your LCS strategy has 2. The amount of new writes entering the bottom funnel of your leveldb, forcing upwards compaction and combining To give you an idea, I had a similar scenario and ran a (slow, throttled) delete job on my cluster around December-January. Here's a graph of the disk space usage on one node. Notice the still-diclining usage long after the cleanup job has finished (sometime in January). I tend to think of tombstones in LCS as little bombs that get to explode much later in time: http://mina.naguib.ca/images/tombstones-cassandra-LCS.jpg On 2014-04-11, at 11:20 AM, Paulo Ricardo Motta Gomes wrote: > I have a similar problem here, I deleted about 30% of a very large CF using > LCS (about 80GB per node), but still my data hasn't shrinked, even if I used > 1 day for gc_grace_seconds. Would nodetool scrub help? Does nodetool scrub > forces a minor compaction? > > Cheers, > > Paulo > > > On Fri, Apr 11, 2014 at 12:12 PM, Mark Reddy wrote: > Yes, running nodetool compact (major compaction) creates one large SSTable. > This will mess up the heuristics of the SizeTiered strategy (is this the > compaction strategy you are using?) leading to multiple 'small' SSTables > alongside the single large SSTable, which results in increased read latency. > You will incur the operational overhead of having to manage compactions if > you wish to compact these smaller SSTables. For all these reasons it is > generally advised to stay away from running compactions manually. > > Assuming that this is a production environment and you want to keep > everything running as smoothly as possible I would reduce the gc_grace on the > CF, allow automatic minor compactions to kick in and then increase the > gc_grace once again after the tombstones have been removed. > > > On Fri, Apr 11, 2014 at 3:44 PM, William Oberman > wrote: > So, if I was impatient and just "wanted to make this happen now", I could: > > 1.) Change GCGraceSeconds of the CF to 0 > 2.) run nodetool compact (*) > 3.) Change GCGraceSeconds of the CF back to 10 days > > Since I have ~900M tombstones, even if I miss a few due to impatience, I > don't care *that* much as I could re-run my clean up tool against the now > much smaller CF. > > (*) A long long time ago I seem to recall reading advice about "don't ever > run nodetool compact", but I can't remember why. Is there any bad long term > consequence? Short term there are several: > -a heavy operation > -temporary 2x disk space > -one big SSTable afterwards > But moving forward, everything is ok right? CommitLog/MemTable->SStables, > minor compactions that merge SSTables, etc... The only flaw I can think of > is it will take forever until the SSTable minor compactions build up enough > to consider including the big SSTable in a compaction, making it likely I'll > have to self manage compactions. > > > > On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy wrote: > Correct, a tombstone will only be removed after gc_grace period has elapsed. > The default value is set to 10 days which allows a great deal of time for > consistency to be achieved prior to deletion. If you are operationally > confident that you can achieve consistency via anti-entropy repairs within a > shorter period you can always reduce that 10 day interval. > > > Mark > > > On Fri, Apr 11, 2014 at 3:16 PM, William Oberman > wrote: > I'm seeing a lot of articles about a dependency between removing tombstones > and GCGraceSeconds, which might be my problem (I just checked, and this CF > has GCGraceSeconds of 10 days). > > > On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli > wrote: > compaction should take care of it; for me it never worked so I run nodetool > compaction on every node; that does it. > > > 2014-04-11 16:05 GMT+02:00 William Oberman : > > I'm wondering what will clear tombstoned rows? nodetool cleanup, nodetool > repair, or time (as in just wait)? > > I had a CF that was more or less storing session information. After some > time, we decided that one piece of this information was pointless to track > (and was 90%+ of the columns, and in 99% of those cases was ALL columns for a > row). I wrote a process to remove all of those columns (which again in a > vast majority of cases had the effect of removing the whole row). > > This CF had ~1 billion rows, so I expect to be left with ~100m rows. After I > did this mass delete, everything was the same size on disk (which I expected, > knowing how tombstoning works).
Re: cassandra error on restart
There was mention of a similar crash on the mailing list. Does this apply to your case ? http://mail-archives.apache.org/mod_mbox/cassandra-user/201306.mbox/%3ccdecfcfa.11e95%25agundabatt...@threatmetrix.com%3E -- Mina Naguib AdGear Technologies Inc. http://adgear.com/ On 2013-09-10, at 10:09 AM, "Langston, Jim" wrote: > Hi all, > > I restarted my cassandra ring this morning, but it is refusing to > start. Everything was fine, but now I get this error in the log: > > …. > INFO 14:05:14,420 Compacting > [SSTableReader(path='/raid0/cassandra/data/system/local/system-local-ic-20-Data.db'), > > SSTableReader(path='/raid0/cassandra/data/system/local/system-local-ic-21-Data.db'), > > SSTableReader(path='/raid0/cassandra/data/system/local/system-local-ic-23-Data.db'), > > SSTableReader(path='/raid0/cassandra/data/system/local/system-local-ic-22-Data.db')] > INFO 14:05:14,493 Compacted 4 sstables to > [/raid0/cassandra/data/system/local/system-local-ic-24,]. 1,086 bytes to 486 > (~44% of original) in 66ms = 0.007023MB/s. 4 total rows, 1 unique. Row > merge counts were {1:0, 2:0, 3:0, 4:1, } > INFO 14:05:14,543 Starting Messaging Service on port 7000 > java.lang.NullPointerException > at > org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:745) > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:554) > at > org.apache.cassandra.service.StorageService.initServer(StorageService.java:451) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) > at org.apache.cassandra.service.CassandraDaemon.init(CassandraDaemon.java:381) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:212) > Cannot load daemon > > > and cassandra will not start. I get the same error on all the nodes in the > ring. > > Thoughts? > > Thanks, > > Jim
Re: Why don't you start off with a "single & small" Cassandra server as you usually do it with MySQL ?
On 2013-08-27, at 6:04 AM, Aklin_81 wrote: > For any website just starting out, the load initially is minimal & grows with > a slow pace initially. People usually start with their MySQL based sites > with a single server(***that too a VPS not a dedicated server) running as > both app server as well as DB server & usually get too far with this setup & > only as they feel the need they separate the DB from the app server giving it > a separate VPS server. This is what a start up expects the things to be while > planning about resources procurement. > > But so far what I have seen, it's something very different with Cassandra. > People usually recommend starting out with atleast a 3 node cluster, (on > dedicated servers) with lots & lots of RAM. 4GB or 8GB RAM is what they > suggest to start with. So is it that Cassandra requires more hardware > resources in comparison to MySQL, for a website to deliver similar > performance, serve similar load/ traffic & same amount of data. I understand > about higher storage requirements of Cassandra due to replication but what > about other hardware resources ? > > Can't we start off with Cassandra based apps just like MySQL. Starting with 1 > or 2 VPS & adding more whenever there's a need. Renting out dedicated servers > with lots of RAM just from the beginning may be viable for very well funded > startups but not for all. Yes you can, just make sure you do your homework, evaluate and measure things. MySQL is a row-oriented RDBMS. Cassandra is a distributed columnar key-value store. While both are "databases", they serve different use cases. I think it's an illusion that a startup can "get by" on just a single virtual instance somewhere. It's certainly doable, but very risky Doing that means that if the server catches on fire, your startup's data and other IP is lost. Any reasonable architecture in this day and age must account for such disasters. Cassandra is built around failure-is-a-norm, and this is handled by encouraging multiple servers and increased replication factor as a default. You can certainly scale that back down to a single-machine if you want, provided you understand what risks you're taking. Performance-wise, cassandra's quite fast even in a single-node scenario. Again, take that at face value and do your own benchmarks using your use cases and workloads.
Re: C language - cassandra
Hi Apostolis I'm the author of libcassie, a C library for cassandra that wraps the C++ libcassandra library. It's in use in production where I work, however it has not received much traction elsewhere as far as I know. You can get it here: https://github.com/minaguib/libcassandra/tree/kickstart-libcassie-0.7 It has not been updated for a while (for example no CQL support, no pooling support). I've been waiting for either the thrift C-glibc interface to mature, or the thriftless-CQL-binary protocol to mature, before putting effort into updating/rewriting it. It might however satisfy your needs with its current functionality. On 2013-05-17, at 10:42 AM, Apostolis Xekoukoulotakis wrote: > Hello, new here, What are my options in using cassandra from a program > written in c? > > A) > Thrift has no documentation, so it will take me time to understand. > Thrift also doesnt have a balancing pool, asking different nodes every time, > which is a big problem. > > B) > Should I use the hector (java) client and then send the data to my program > with my own protocol? > Seems a lot of unnecessary work. > > Any other suggestions? > > > -- > > Sincerely yours, > Apostolis Xekoukoulotakis
Re: Is this how to read the output of nodetool cfhistograms?
On 2013-01-22, at 8:59 AM, Brian Tarbox wrote: > The output of this command seems to make no sense unless I think of it as 5 > completely separate histograms that just happen to be displayed together. > > Using this example output should I read it as: my reads all took either 1 or > 2 sstable. And separately, I had write latencies of 3,7,19. And separately > I had read latencies of 2, 8,69, etc? > > In other words...each row isn't really a row...i.e. on those 16033 reads from > a single SSTable I didn't have 0 write latency, 0 read latency, 0 row size > and 0 column count. Is that right? Correct. A number in any of the metric columns is a count value bucketed in the offset on that row. There are no relationships between other columns on the same row. So your first row says "16033 reads were satisfied by 1 sstable". The other metrics (for example, latency of these reads) is reflected in the histogram under "Read Latency", under various other bucketed offsets. > > Offset SSTables Write Latency Read Latency Row Size > Column Count > 1 16033 00 > 0 0 > 2303 00 >0 1 > 3 0 00 > 0 0 > 4 0 00 > 0 0 > 5 0 00 > 0 0 > 6 0 00 > 0 0 > 7 0 00 > 0 0 > 8 0 02 > 0 0 > 10 0 00 > 0 6261 > 12 0 02 > 0 117 > 14 0 08 > 0 0 > 17 0 3 69 > 0 255 > 20 0 7 163 > 0 0 > 24 019 1369 > 0 0 >
Re: continue seeing "Finished hinted handoff of 0 rows to endpoint"
On 2012-11-24, at 10:37 AM, Chuan-Heng Hsiao wrote: > However, I continue seeing the following in /var/log/cassandra/system.log > > INFO [HintedHandoff:1] 2012-11-24 22:58:28,088 > HintedHandOffManager.java (line 296) Started hinted handoff for token: > 27949589543905115548813332729343195104 with IP: /192.168.0.10 > INFO [HintedHandoff:1] 2012-11-24 22:58:28,089 > HintedHandOffManager.java (line 392) Finished hinted handoff of 0 rows > to endpoint /192.168.0.10 > > every ten mins. See if https://issues.apache.org/jira/browse/CASSANDRA-4740 is relevant in your case.
Re: leveled compaction and tombstoned data
On 2012-11-08, at 1:12 PM, B. Todd Burruss wrote: > we are having the problem where we have huge SSTABLEs with tombstoned data in > them that is not being compacted soon enough (because size tiered compaction > requires, by default, 4 like sized SSTABLEs). this is using more disk space > than we anticipated. > > we are very write heavy compared to reads, and we delete the data after N > number of days (depends on the column family, but N is around 7 days) > > my question is would leveled compaction help to get rid of the tombstoned > data faster than size tiered, and therefore reduce the disk space usage From my experience, levelled compaction makes space reclamation after deletes even less predictable than sized-tier. The reason is that deletes, like all mutations, are just recorded into sstables. They enter level0, and get slowly, over time, promoted upwards to levelN. Depending on your *total* mutation volume VS your data set size, this may be quite a slow process. This is made even worse if the size of the data you're deleting (say, an entire row worth several hundred kilobytes) is to-be-deleted by a small row-level tombstone. If the row is sitting in level 4, the tombstone won't impact it until enough data has pushed over all existing data in level3, level2, level1, level0 Finally, to guard against the tombstone missing any data, the tombstone itself is not candidate for removal (I believe even after gc_grace has passed) unless it's reached the highest populated level in levelled compaction. This means if you have 4 levels and issue a ton of deletes (even deletes that will never impact existing data), these tombstones are deadweight that cannot be purged until they hit level4. For a write-heavy workload, I recommend you stick with sized-tier. You have several options at your disposal (compaction min/max thresholds, gc_grace) to move things along. If that doesn't help, I've heard of some fairly reputable people doing some fairly blasphemous things (major compactions every night).
Re: virtual memory of all cassandra-nodes is growing extremly since Cassandra 1.1.0
All our servers (cassandra and otherwise) get monitored with nagios + get many basic metrics graphed by pnp4nagios. This covers a large chunk of a box's health, as well as cassandra basics (specifically the pending tasks, JVM heap state). IMO it's not possible to clearly debug a cassandra issue if you don't have a good holistic view of the boxes' health (CPU, RAM, swap, disk throughput, etc.) Separate from that we have an operational dashboard. It's a bunch of manually-defined RRD files and custom scripts that grab metrics, store, and graph the health of various layers in the infrastructure in an an easy-to-digest way (for example, each data center gets a color scheme - stacked machines within multiple DCs can just be eyeballed). There we can see for example our total read volume, total write volume, struggling boxes, dynamic endpoint snitch reaction, etc... Finally, almost all the software we write integrates with statsd + graphite. In graphite we have more metrics than we know what to do with, but it's better than the other way around. From there for example we can see cassandra's response time including things cassandra itself can't measure (network, thrift, etc), across various different client softwares that talk to it. Within graphite we have several dashboards defined (users make their own, some infrastructure components have shared dashboards.) -- Mina Naguib :: Director, Infrastructure Engineering Bloom Digital Platforms :: T 514.394.7951 #208 http://bloom-hq.com/ On 2012-08-01, at 3:43 PM, Greg Fausak wrote: > Mina, > > Thanks for that post. Very interesting :-) > > What sort of things are you graphing? Standard *nux stuff > (mem/cpu/etc)? Or do you > have some hooks in to the C* process (I saw somoething about port 1414 > in the .yaml file). > > Best, > > -g > > > On Thu, Jul 26, 2012 at 9:27 AM, Mina Naguib > wrote: >> >> Hi Thomas >> >> On a modern 64bit server, I recommend you pay little attention to the >> virtual size. It's made up of almost everything within the process's >> address space, including on-disk files mmap()ed in for zero-copy access. >> It's not unreasonable for a machine with N amount RAM to have a process >> whose virtual size is several times the value of N. That in and of itself >> is not problematic >> >> In a default cassandra 1.1.x setup, the bulk of that will be your sstables' >> data and index files. On linux you can invoke the "pmap" tool on the >> cassandra process's PID to see what's in there. Much of it will be >> anonymous memory allocations (the JVM heap itself, off-heap data structures, >> etc), but lots of it will be references to files on disk (binaries, >> libraries, mmap()ed files, etc). >> >> What's more important to keep an eye on is the JVM heap - typically >> statically allocated to a fixed size at cassandra startup. You can get info >> about its used/capacity values via "nodetool -h localhost info". You can >> also hook up jconsole and trend it over time. >> >> The other critical piece is the process's RESident memory size, which >> includes the JVM heap but also other off-heap data structures and >> miscellanea. Cassandra has recently been making more use of off-heap >> structures (for example, row caching via SerializingCacheProvider). This is >> done as a matter of efficiency - a serialized off-heap row is much smaller >> than a classical object sitting in the JVM heap - so you can do more with >> less. >> >> Unfortunately, in my experience, it's not perfect. They still have a cost, >> in terms of on-heap usage, as well as off-heap growth over time. >> >> Specifically, my experience with cassandra 1.1.0 showed that off-heap row >> caches incurred a very high on-heap cost (ironic) - see my post at >> http://mail-archives.apache.org/mod_mbox/cassandra-user/201206.mbox/%3c6feb097f-287b-471d-bea2-48862b30f...@bloomdigital.com%3E >> - as documented in that email, I managed that with regularly scheduled full >> GC runs via System.gc() >> >> I have, since then, moved away from scheduled System.gc() to scheduled row >> cache invalidations. While this had the same effect as System.gc() I >> described in my email, it eliminated the 20-30 second pause associated with >> it. It did however introduce (or may be I never noticed earlier), slow >> creep in memory usage outside of the heap. >> >> It's typical in my case for example for a process configured with 6G of JVM >> heap to start up, stabilize at 6.5 - 7GB RESident usage, th
Re: virtual memory of all cassandra-nodes is growing extremly since Cassandra 1.1.0
Hi Thomas On a modern 64bit server, I recommend you pay little attention to the virtual size. It's made up of almost everything within the process's address space, including on-disk files mmap()ed in for zero-copy access. It's not unreasonable for a machine with N amount RAM to have a process whose virtual size is several times the value of N. That in and of itself is not problematic In a default cassandra 1.1.x setup, the bulk of that will be your sstables' data and index files. On linux you can invoke the "pmap" tool on the cassandra process's PID to see what's in there. Much of it will be anonymous memory allocations (the JVM heap itself, off-heap data structures, etc), but lots of it will be references to files on disk (binaries, libraries, mmap()ed files, etc). What's more important to keep an eye on is the JVM heap - typically statically allocated to a fixed size at cassandra startup. You can get info about its used/capacity values via "nodetool -h localhost info". You can also hook up jconsole and trend it over time. The other critical piece is the process's RESident memory size, which includes the JVM heap but also other off-heap data structures and miscellanea. Cassandra has recently been making more use of off-heap structures (for example, row caching via SerializingCacheProvider). This is done as a matter of efficiency - a serialized off-heap row is much smaller than a classical object sitting in the JVM heap - so you can do more with less. Unfortunately, in my experience, it's not perfect. They still have a cost, in terms of on-heap usage, as well as off-heap growth over time. Specifically, my experience with cassandra 1.1.0 showed that off-heap row caches incurred a very high on-heap cost (ironic) - see my post at http://mail-archives.apache.org/mod_mbox/cassandra-user/201206.mbox/%3c6feb097f-287b-471d-bea2-48862b30f...@bloomdigital.com%3E - as documented in that email, I managed that with regularly scheduled full GC runs via System.gc() I have, since then, moved away from scheduled System.gc() to scheduled row cache invalidations. While this had the same effect as System.gc() I described in my email, it eliminated the 20-30 second pause associated with it. It did however introduce (or may be I never noticed earlier), slow creep in memory usage outside of the heap. It's typical in my case for example for a process configured with 6G of JVM heap to start up, stabilize at 6.5 - 7GB RESident usage, then creep up slowly throughout a week to 10-11GB range. Depending on what else the box is doing, I've experienced the linux OOM killer killing cassandra as you've described, or heavy swap usage bringing everything down (we're latency-sensitive), etc.. And now for the good news. Since I've upgraded to 1.1.2: 1. There's no more need for regularly scheduled System.gc() 2. There's no more need for regularly scheduled row cache invalidation 3. The HEAP usage within the JVM is stable over time 4. The RESident size of the process appears also stable over time Point #4 above is still pending as I only have 3 day graphs since the upgrade, but they show promising results compared to the slope of the same graph before the upgrade to 1.1.2 So my advice is give 1.1.2 a shot - just be mindful of https://issues.apache.org/jira/browse/CASSANDRA-4411 On 2012-07-26, at 2:18 AM, Thomas Spengler wrote: > I saw this. > > All works fine upto version 1.1.0 > the 0.8.x takes 5GB of memory of an 8GB machine > the 1.0.x takes between 6 and 7 GB on a 8GB machine > and > the 1.1.0 takes all > > and it is a problem > for me it is no solution to wait of the OOM-Killer from the linux kernel > and restart the cassandraprocess > > when my machine has less then 100MB ram available then I have a problem. > > > > On 07/25/2012 07:06 PM, Tyler Hobbs wrote: >> Are you actually seeing any problems from this? High virtual memory usage >> on its own really doesn't mean anything. See >> http://wiki.apache.org/cassandra/FAQ#mmap >> >> On Wed, Jul 25, 2012 at 1:21 AM, Thomas Spengler < >> thomas.speng...@toptarif.de> wrote: >> >>> No one has any idea? >>> >>> we tryed >>> >>> update to 1.1.2 >>> DiskAccessMode standard, indexAccessMode standard >>> row_cache_size_in_mb: 0 >>> key_cache_size_in_mb: 0 >>> >>> >>> Our next try will to change >>> >>> SerializingCacheProvider to ConcurrentLinkedHashCacheProvider >>> >>> any other proposals are welcom >>> >>> On 07/04/2012 02:13 PM, Thomas Spengler wrote: Hi @all, since our upgrade form cassandra 1.0.3 to 1.1.0 the virtual memory usage of the cassandra-nodes explodes our setup is: * 5 - centos 5.8 nodes * each 4 CPU's and 8 GB RAM * each node holds about 100 GB on data * each jvm's uses 2GB Ram * DiskAccessMode is standard, indexAccessMode is standard The memory usage grows upto the whole memory is used. Just for in
High CPU usage as of 8pm eastern time
Hi folks Our cassandra (and other java-based apps) started experiencing extremely high CPU usage as of 8pm eastern time (midnight UTC). The issue appears to be related to specific versions of java + linux + ntpd There are many solutions floating around on IRC, twitter, stackexchange, LKML. The simplest one that worked for us is simply to run this command on each affected machine: date; date `date +"%m%d%H%M%C%y.%S"`; date; CPU drop was instantaneous - there was no need to restart the server, ntpd, or any of the affected JVMs.
Re: Random slow connects.
On 2012-06-14, at 10:38 AM, Henrik Schröder wrote: > Hi everyone, > > We have problem with our Cassandra cluster, and that is that sometimes it > takes several seconds to open a new Thrift connection to the server. We've > had this issue when we ran on windows, and we have this issue now that we run > on Ubuntu. We've had it with our old networking setup, and we have it with > our new networking setup where we're running it over a dedicated gigabit > network. Normally estabishing a new connection is instant, but once in a > while it seems like it's not accepting any new connections until three > seconds have passed. > > We're of course running a connection-pooling client which mitigates this, > since once a connection is established, it's rock solid. > > We tried switching the rpc_server_type to hsha, but that seems to have made > the problem worse, we're seeing more connection timeouts because of this. > > For what it's woth, we're running Cassandra version 1.0.10 on Ubuntu, and our > connection pool is configured to abort a connection attempt after two > seconds, and each connection lives for six hours and then it's recycled. > Under current load we do about 500 writes/s and 100 reads/s, we have 20 > clients, but each has a very small connection pool of maybe up to 5 > simultaneous connections against each Cassandra server. We see these > connection issues maybe once a day, but always at random intervals. > > We've tried to get more information through Datastax Opscenter, the JMX > console, and our own application monitoring and logging, but we can't see > anything out of the ordinary. Sometimes, seemingly by random, it's just > really slow to connect. We're all out of ideas. Does anyone here have > suggestions on where to look and what to do next? Have you ironed out non-cassandra potential causes ? 3 seconds constantly sounds it could be a timeout/retry somewhere. Do you contact cassandra via a hostname or IP address ? If via hostname, iron out DNS. Either way, I'd fire up tcpdump, both on both the client and the server, and observe the TCP handshake. Specifically see if the SYN packet is sent and received, whether the SYN-ACK is sent back right away and received, and final ACK. If that looks good, then TCP-wise you're in good shape and the problem is in a higher layer (thrift). If not, see where the delay/drop/retry happens. If it's in the first packet, it may be a networking/routing issue. If in the second, it may me capacity at the server (investigate with lsof/netstat/JMX), etc..
Re: memory issue on 1.1.0
Hi Wade I don't know if your scenario matches mine, but I've been struggling with memory pressure in 1.x as well. I made the jump from 0.7.9 to 1.1.0, along with enabling compression and levelled compactions, so I don't know which specifically is the main culprit. Specifically, all my nodes seem to "lose" heap memory. As parnew and CMS do their job, over any reasonable period of time, the "floor" of memory after a GC keeps rising. This is quite visible if you leave jconsole connected for a day or so, and manifests itself as a funny-looking cone like so: http://mina.naguib.ca/images/cassandra_jconsole.png Once memory pressure reaches a point where the heap can't be maintained reliably below 75%, cassandra goes into survival mode - via a bunch of tunables in cassandra.conf it'll do things like flush memtables, drop caches, etc - all of which, in my experience, especially with the recent off-heap data structures, exasperate the problem. I've been meaning, of course, to collect enough technical data to file a bug report, but haven't had the time. I have not yet tested 1.1.1 to see if it improves the situation. What I have found however, is a band-aid which you see at the rightmost section of the graph in the screenshot I posted. That is simply to hit "Perform GC" button in jconsole. It seems that a full System.gc() *DOES* reclaim heap memory that parnew and CMS fail to reclaim. On my production cluster I have a full-GC via JMX scheduled in a rolling fashion every 4 hours. It's extremely expensive (20-40 seconds of unresponsiveness) but is a necessary evil in my situation. Without it, my nodes enter a nasty spiral of constant flushing, constant compactions, high heap usage, instability and high latency. On 2012-06-05, at 2:56 PM, Poziombka, Wade L wrote: > Alas, upgrading to 1.1.1 did not solve my issue. > > -Original Message- > From: Brandon Williams [mailto:dri...@gmail.com] > Sent: Monday, June 04, 2012 11:24 PM > To: user@cassandra.apache.org > Subject: Re: memory issue on 1.1.0 > > Perhaps the deletes: https://issues.apache.org/jira/browse/CASSANDRA-3741 > > -Brandon > > On Sun, Jun 3, 2012 at 6:12 PM, Poziombka, Wade L > wrote: >> Running a very write intensive (new column, delete old column etc.) process >> and failing on memory. Log file attached. >> >> Curiously when I add new data I have never seen this have in past sent >> hundreds of millions "new" transactions. It seems to be when I >> modify. my process is as follows >> >> key slice to get columns to modify in batches of 100, in separate threads >> modify those columns. I advance the slice with the start key each with last >> key in previous batch. Mutations done are update a column value in one >> column family(token), delete column and add new column in another (pan). >> >> Runs well until after about 5 million rows then it seems to run out of >> memory. Note that these column families are quite small. >> >> WARN [ScheduledTasks:1] 2012-06-03 17:49:01,558 GCInspector.java (line >> 145) Heap is 0.7967470834946492 full. You may need to reduce memtable >> and/or cache sizes. Cassandra will now flush up to the two largest >> memtables to free up memory. Adjust flush_largest_memtables_at >> threshold in cassandra.yaml if you don't want Cassandra to do this >> automatically >> INFO [ScheduledTasks:1] 2012-06-03 17:49:01,559 StorageService.java >> (line 2772) Unable to reduce heap usage since there are no dirty >> column families >> INFO [GossipStage:1] 2012-06-03 17:49:01,999 Gossiper.java (line 797) >> InetAddress /10.230.34.170 is now UP >> INFO [ScheduledTasks:1] 2012-06-03 17:49:10,048 GCInspector.java >> (line 122) GC for ParNew: 206 ms for 1 collections, 7345969520 used; >> max is 8506048512 >> INFO [ScheduledTasks:1] 2012-06-03 17:49:53,187 GCInspector.java >> (line 122) GC for ConcurrentMarkSweep: 12770 ms for 1 collections, >> 5714800208 used; max is 8506048512 >> >> >> Keyspace: keyspace >>Read Count: 50042632 >>Read Latency: 0.23157864418482224 ms. >>Write Count: 44948323 >>Write Latency: 0.019460829472992797 ms. >>Pending Tasks: 0 >>Column Family: pan >>SSTable count: 5 >>Space used (live): 1977467326 >>Space used (total): 1977467326 >>Number of Keys (estimate): 16334848 >>Memtable Columns Count: 0 >>Memtable Data Size: 0 >>Memtable Switch Count: 74 >>Read Count: 14985122 >>Read Latency: 0.408 ms. >>Write Count: 19972441 >>Write Latency: 0.022 ms. >>Pending Tasks: 0 >>Bloom Filter False Postives: 829 >>Bloom Filter False Ratio: 0.00073 >>Bloom Filter Space Used: 37048400 >>Compacted row minimum size: 125 >>Compac
Re: Cassandra C client implementation
Hi Vlad I'm the author of libcassie. For what it's worth, it's in production where I work, consuming a heavily-used cassandra 0.7.9 cluster. We do have plans to upgrade the cluster to 1.x, to benefit from all the improvements, CQL, etc... but that includes revising all our clients (across several programming languages). So, it's definitely on my todo list to address our C clients by either upgrading libcassie, or possibly completely rewriting it. Currently it's a wrapper around the C++ parent project libcassandra. I haven't been fond of having that many layered abstractions, and the thrift Glib2 interface has definitely piqued my interest, so I'm leaning towards a complete rewrite. While we're at it, it would also be nice to have features like asynchronous modes for popular event loops, connection pooling, etc. Unfortunately, I have no milestones set for any of this, nor the time (currently) to experiment and proof-of-concept it. I'd be curious to hear from other C hackers whether they've experimented with the thrift Glib2 interface and gotten a "hello world" to work against cassandra 1.x. Perhaps there's room for some code sharing/collaboration on a new library to supersede the existing libcassie+libcassandra. On 2011-12-14, at 5:16 PM, Vlad Paiu wrote: > Hello Eric, > > We have that, thanks alot for the contribution. > The idea is to not play around with including C++ code in a C app, if there's > an alternative ( the thrift g_libc ). > > Unfortunately, since thrift does not generate a skeleton for the glibc code, > I don't know how to find out what the API functions are called, and guessing > them is not going that good :) > > I'll wait a little longer & see if anybody can help with the C thrift, or at > least tell me it's not working. :) > > Regards, > Vlad > > Eric Tamme wrote: > >> On 12/14/2011 04:18 PM, Vlad Paiu wrote: >>> Hi, >>> >>> Just tried libcassie and seems it's not compatible with latest cassandra, >>> as even simple inserts and fetches fail with InvalidRequestException... >>> >>> So can anybody please provide a very simple example in C for connecting& >>> fetching columns with thrift ? >>> >>> Regards, >>> Vlad >>> >>> Vlad Paiu wrote: >>> >> >> Vlad, >> >> We have written a specific cassandra db module for usrloc with opensips >> and have open sourced it on github. We use the thrift generated c++ >> bindings and extern stuff to c. I spoke to bogdan about this a while >> ago, and gave him the github link, but here it is for your reference >> https://github.com/junction/db_jnctn_usrloc >> >> Hopefully that helps. I idle in #opensips too, just ask about >> cassandra in there and I'll probably see it. >> >> - Eric Tamme >>
Re: Peculiar imbalance affecting 2 machines in a 6 node cluster
Hi Aaron Thank you very much for the reply and the pointers to the previous list discussions. The second was was particularly telling. I'm happy to say that the problem is fixed, and it's so trivial it's quite embarrassing - but I'll state it here for the sake of the archives. There was an extra semicolon in the topology file in the line defining IPLA3. It's just as visible in my prod config as it is in my example below ;-) I'm guessing the parser splits tuples on (":"), so it probably parsed the IPLA3 entry as "DCLA" , ":RAC1" (which is different than the others on "RAC1"), and so the NTS did its thing distributing evenly between racks, and IPLA3 got more of the data and IPLA2 got less. I''ve fixed it, and the reads/s and writes/s immediately equalized. I'm now doing a round of repairs/compactions/cleanups to equalize the data load as well. Unfortunately It's not easy in cassandra 0.7.8 to actually see the parsed topology state (unlike 0.8's nice ring output which shows the DC and rack), so I'm ashamed to say it took much longer than it should've to troubleshoot. Thanks for your help. On 2011-08-10, at 5:12 AM, aaron morton wrote: > WRT the load imbalance checking the basics: you've run cleanup after any > tokens moves? Repair is running ? Also sometimes nodes get a bit bloated > from repair and will settle down with compaction. > > Your slightly odd tokens in the MTL DC are making it a little tricky to > understand whats going on. But I'm trying to check if you've followed the > multi DC token selection here > http://wiki.apache.org/cassandra/Operations#Token_selection . Background > about what can happen in a multi dc deployment if the tokens are not right > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replica-data-distributing-between-racks-td6324819.html > > This is what you currently have…. > > DC:LA > IPLA1 Up Normal 34.57 GB11.11% 0 > > IPLA2 Up Normal 17.55 GB11.11% > 56713727820156410577229101238628035242 > IPLA3 Up Normal 51.37 GB11.11% > 113427455640312821154458202477256070485 > > DC: MTL > IPMTL1 Up Normal 34.43 GB22.22% > 37809151880104273718152734159085356828 > IPMTL2 Up Normal 34.56 GB22.22% > 94522879700260684295381835397713392071 > IPMTL3 Up Normal 34.71 GB22.22% > 151236607520417094872610936636341427313 > > Using the bump approach you would have > > IPLA1 0 > IPLA2 56713727820156410577229101238628035242 > IPLA3 113427455640312821154458202477256070484 > > IPMTL11 > IPMTL256713727820156410577229101238628035243 > IPMTL3113427455640312821154458202477256070485 > > Using the interleaving you would have > > IPLA1 0 > IPMTL128356863910078205288614550619314017621 > IPLA2 56713727820156410577229101238628035242 > IPMTL285070591730234615865843651857942052863 > IPLA3 113427455640312821154458202477256070484 > IPMTL3141784319550391026443072753096570088105 > > The current setup in LA give each node in LA 33% of the LA local ring. Which > should be right, just checking. > > If cleanup / repair / compaction is all good and you are confident the tokens > are right try poking around with nodetool getendpoints to see which nodes > keys are sent to. Like you I cannot see anything obvious in NTS that would > cause load to be imbalanced if they are all in the same rack. > > Cheers > > > - > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 10 Aug 2011, at 11:24, Mina Naguib wrote: > >> Hi everyone >> >> I'm observing a very peculiar type of imbalance and I'd appreciate any help >> or ideas to try. This is on cassandra 0.7.8. >> >> The original cluster was 3 machines in the DCMTL, equally balanced at 33.33% >> each and each holding roughly 34G. >> >> Then, I added to it 3 machines in the LA data center. The ring is currently >> as follows (IP addresses redacted for clarity): >> >> Address Status State LoadOwnsToken >> >> >> 151236607520417094872610936636341427313 >> IPLA1 Up Normal 34.57 GB11.11% 0 >> >
Peculiar imbalance affecting 2 machines in a 6 node cluster
Hi everyone I'm observing a very peculiar type of imbalance and I'd appreciate any help or ideas to try. This is on cassandra 0.7.8. The original cluster was 3 machines in the DCMTL, equally balanced at 33.33% each and each holding roughly 34G. Then, I added to it 3 machines in the LA data center. The ring is currently as follows (IP addresses redacted for clarity): Address Status State LoadOwnsToken 151236607520417094872610936636341427313 IPLA1 Up Normal 34.57 GB11.11% 0 IPMTL1 Up Normal 34.43 GB22.22% 37809151880104273718152734159085356828 IPLA2 Up Normal 17.55 GB11.11% 56713727820156410577229101238628035242 IPMTL2 Up Normal 34.56 GB22.22% 94522879700260684295381835397713392071 IPLA3 Up Normal 51.37 GB11.11% 113427455640312821154458202477256070485 IPMTL3 Up Normal 34.71 GB22.22% 151236607520417094872610936636341427313 The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more machines in yet another data center, but they're not ready yet to join the cluster. Once that third DC joins all nodes will be at 11.11%. However, I don't think this is related. The problem I'm currently observing is visible in the LA machines, specifically IPLA2 and IPLA3. IPLA2 has 50% the expected volume, and IPLA3 has 150% the expected volume. Putting their load side by side shows the peculiar ratio of 2:1:3 between the 3 LA nodes: 34.57 17.55 51.37 (the same 2:1:3 ratio is reflected in our internal tools trending reads/second and writes/second) I've tried several iterations of compactions/cleanups to no avail. In terms of config this is the main keyspace: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Options: [DCMTL:2, DCLA:2] And this is the cassandra-topology.properties file (IPs again redacted for clarity): IPMTL1:DCMTL:RAC1 IPMTL2:DCMTL:RAC1 IPMTL3:DCMTL:RAC1 IPLA1:DCLA:RAC1 IPLA2:DCLA:RAC1 IPLA3:DCLA::RAC1 IPLON1:DCLON:RAC1 IPLON2:DCLON:RAC1 IPLON3:DCLON:RAC1 # default for unknown nodes default=DCBAD:RACBAD One thing that did occur to me while reading the source code for the NetworkTopologyStrategy's calculateNaturalEndpoints is that it prefers placing data on different racks. Since all my machines are defined as in the same rack, I believe that the 2-pass approach would still yield balanced placement. However, just to test, I modified live the topology file to specify that IPLA1, IPLA2 and IPLA3 are in 3 different racks, and sure enough I saw immediately that the reads/second and writes/second equalized to expected fair volume (I quickly reverted that change). So, it seems somehow related to rack awareness, but I've been raking my head and I can't figure out how/why, or why the three MTL machines are not affected the same way. If the solution is to specify them in different racks and run repair on everything, I'm okay with that - but I hate doing that without first understanding *why* the current behavior is the way it is. Any ideas would be hugely appreciated. Thank you.
Re: Read latency is over 1 minute on a column family with 400,000 rows
Did you run that verbatim ? Or you appropriately substituted "keyspace" and "columnfamily1" ? Also, anything in cassandra's log file (system.log) ? Compacting 150Gb over 2057 SSTables should take a reasonable bit of time... On 2011-07-31, at 11:47 PM, myreasoner wrote: > Thanks. > > I did *./nodetool -h localhost compact keyspace columnfamily1 *. But it > came back really quick and the cfstats doesn't seem change much. > > After compaction: >Column Family: Fingerprint >SSTable count: 2057 >Space used (live): 164351343468 >Space used (total): 164742957014 >Memtable Columns Count: 33224 >Memtable Data Size: 22410133 >Memtable Switch Count: 378 >Read Count: 7 >Read Latency: NaN ms. >Write Count: 30972 >Write Latency: 1.579 ms. >Pending Tasks: 0 >Key cache capacity: 20 >Key cache size: 8157 >Key cache hit rate: 0.0 >Row cache: disabled >Compacted row minimum size: 104 >Compacted row maximum size: 315852 >Compacted row mean size: 33846
Re: Cassandra 0.7.8 and 0.8.1 fail when major compaction on 37GB database
From experience with similar-sized data sets, 1.5GB may be too little. Recently I bumped our java HEAP limit from 3GB to 4GB to get past an OOM doing a major compaction. Check "nodetool -h localhost info" while the compaction is running for a simple view into the memory state. If you can, also hook in jconsole and you'll get a better view, over time, of how cassandra's memory usage trends, the effect of GC, and the pressure of various operations such as compactions. On 2011-07-24, at 8:08 AM, lebron james wrote: > Hi, Please help me with my problem. For better performance i turn off > compaction and run massive inserts, after database reach 37GB i stop massive > inserts and start compaction with "NodeTool compaction Keyspace CFamily". > after half hour of work cassandra fall with error "Out of memory" i give > 1500M to JVM, all parameters in yaml file are default. testing OS ubuntu > 11.04 and windows server 2008 dc edition. Thanks!
Re: Equalizing nodes storage load
Hi Peter That was precisely it. Thank you :) Doing a major compaction on the heaviest node (74.65GB) reduced it to 33.55GB. I'll compact the other 2 nodes as well. I anticipate they will also settle around that size. On 2011-07-22, at 5:00 PM, Peter Tillotson wrote: > I'm not sure if this is the answer, but major compaction on each node > for each column family. I suspect the data shuffle has left quite a few > deleted keys which may get cleaned out on major compaction. As I > remember major compaction doesn't automatically in 7.x, I'm not sure if > it is triggered by repair. > > p > > On 22/07/11 16:08, Mina Naguib wrote: >> >> I'm trying to balance Load ( 41.98GB vs 59.4GB vs 74.65GB ) >> >> Owns looks ok. They're all 33.33% which is what I want. It was calculated >> simply by 2^127 / num_nodes. The only reason the first one doesn't start at >> 0 is that I''ve actually carved the ring planning for 9 machines (2 new data >> centers of 3 machines each). However only 1 data center (DCMTL) is >> currently up. >> >> >> On 2011-07-22, at 10:56 AM, Sasha Dolgy wrote: >> >>> are you trying to balance "load" or "owns" ? "owns" looks fine ... >>> 33.33% each ... which to me says balanced. >>> >>> how did you calculate your tokens? >>> >>> >>> On Fri, Jul 22, 2011 at 4:37 PM, Mina Naguib >>> wrote: >>>> >>>> Address Status State LoadOwnsToken >>>> xx.xx.x.105 Up Normal 41.98 GB33.33% >>>> 37809151880104273718152734159085356828 >>>> xx.xx.x.107 Up Normal 59.4 GB 33.33% >>>> 94522879700260684295381835397713392071 >>>> xx.xx.x.18 Up Normal 74.65 GB33.33% >>>> 151236607520417094872610936636341427313 >> >> >
Re: Equalizing nodes storage load
I'm trying to balance Load ( 41.98GB vs 59.4GB vs 74.65GB ) Owns looks ok. They're all 33.33% which is what I want. It was calculated simply by 2^127 / num_nodes. The only reason the first one doesn't start at 0 is that I''ve actually carved the ring planning for 9 machines (2 new data centers of 3 machines each). However only 1 data center (DCMTL) is currently up. On 2011-07-22, at 10:56 AM, Sasha Dolgy wrote: > are you trying to balance "load" or "owns" ? "owns" looks fine ... > 33.33% each ... which to me says balanced. > > how did you calculate your tokens? > > > On Fri, Jul 22, 2011 at 4:37 PM, Mina Naguib > wrote: >> >> Address Status State LoadOwnsToken >> xx.xx.x.105 Up Normal 41.98 GB33.33% >> 37809151880104273718152734159085356828 >> xx.xx.x.107 Up Normal 59.4 GB 33.33% >> 94522879700260684295381835397713392071 >> xx.xx.x.18 Up Normal 74.65 GB33.33% >> 151236607520417094872610936636341427313
Equalizing nodes storage load
Hi everyone I've been struggling trying to get the data volume ("load") to equalize across a balanced cluster, and I'm not sure what else I can try. Background: This was originally a 5-node cluster. We re-balanced the 3 faster machines across the ring, and decommissioned the 2 older ones. We also upgraded cassandra a few times from 0.7.4 through 0.7.5, 0.7.6-2 to 0.7.7. The ring currently looks like so: Address Status State LoadOwnsToken 151236607520417094872610936636341427313 xx.xx.x.105 Up Normal 41.98 GB33.33% 37809151880104273718152734159085356828 xx.xx.x.107 Up Normal 59.4 GB 33.33% 94522879700260684295381835397713392071 xx.xx.x.18 Up Normal 74.65 GB33.33% 151236607520417094872610936636341427313 What I've tried to far: 1. Running repair on each node (sequentially of course). 2. Running cleanup on the largest node (.18) hoping it would shed unneeded data The repairs helped a bit by, slightly, bumping up the load of the first 2 machines, but the cleanup on the 3rd failed to reduce its data volume. So, at this point, I'm out of ideas. In terms of tpstats metrics, each of the 3 nodes is serving roughly the same volume of ReadStage and MutationStage, so they're balanced in that respect. However I'm concerned about the imbalance of the data load ( 24% / 34% / 42% ) and being unable to equalize it. For the record, there's only 1 keyspace of meaningful data in the cluster, with the following schema settings: Keyspace: ZZ: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Options: [DCMTL:2] Column Families: ColumnFamily: AA default_validation_class: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 256000.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.88125/1440/188 (millions of ops/minutes/MB) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 0.1 Built indexes: [] ColumnFamily: B (Super) default_validation_class: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type/org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds: 75000.0/0 Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 0.88125/1440/188 (millions of ops/minutes/MB) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 0.25 Built indexes: [] Any tips or ideas to help get the nodes' load equalized would be highly appreciated. If this is normal behaviour and I shouldn't be trying too hard to get it equalized, I'd appreciate any notes/links explaining why. Thank you.