Re: Testing row cache feature in trunk: write should put record in cache
We don't use native java serialization for anything but the on-disk BitSets in our bloom filters (because those are deserialized once at startup, so the overhead doesn't matter), btw. We're talking about adding compression after https://issues.apache.org/jira/browse/CASSANDRA-674. On Sat, Feb 20, 2010 at 3:12 PM, Tatu Saloranta wrote: > On Fri, Feb 19, 2010 at 11:44 AM, Weijun Li wrote: >> I see. How much is the overhead of java serialization? Does it slow down the >> system a lot? It seems to be a tradeoff between CPU usage and memory. > > This should be relatively easy to measure, as a stand-alone thing. Or > maybe even from profiler stack traces > If native Java serialization is used, there may be more efficient > alternatives, depending on data -- default serialization is highly > inefficient for small object graphs (like individual objects), but ok > for larger graphs; this because much of class metadata is included, > result is very self-contained. > Beyond default serialization, there are more efficient general-purpose > Java serialization frameworks; like Kryo or fast(est) json-based > serializers (jackson); see > [http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking] > for some idea on alternatives. > > In fact: one interesting idea would be to further trade some CPU for > less memory by using fast compression (like LZF). I hope to experiment > with this idea some time in future. But challenge is that this would > help most with clustered scheme (compressing more than one distinct > item), which is much trickier to make work. Compression does ok with > individual items, but real boost comes from redundancy between similar > items. > > -+ Tatu +- >
Re: Testing row cache feature in trunk: write should put record in cache
On Fri, Feb 19, 2010 at 11:44 AM, Weijun Li wrote: > I see. How much is the overhead of java serialization? Does it slow down the > system a lot? It seems to be a tradeoff between CPU usage and memory. This should be relatively easy to measure, as a stand-alone thing. Or maybe even from profiler stack traces If native Java serialization is used, there may be more efficient alternatives, depending on data -- default serialization is highly inefficient for small object graphs (like individual objects), but ok for larger graphs; this because much of class metadata is included, result is very self-contained. Beyond default serialization, there are more efficient general-purpose Java serialization frameworks; like Kryo or fast(est) json-based serializers (jackson); see [http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking] for some idea on alternatives. In fact: one interesting idea would be to further trade some CPU for less memory by using fast compression (like LZF). I hope to experiment with this idea some time in future. But challenge is that this would help most with clustered scheme (compressing more than one distinct item), which is much trickier to make work. Compression does ok with individual items, but real boost comes from redundancy between similar items. -+ Tatu +-
Re: StackOverflowError on high load
if OPP is configured w/ imbalanced ranges (or less balanced than RP) then that would explain it. OPP is actually slightly faster in terms of raw speed. On Sat, Feb 20, 2010 at 2:31 PM, Ran Tavory wrote: > interestingly, I ran the same load but this time with a random partitioner > and, although from time to time test2 was a little behind with its > compaction task, it did not crash and was able to eventually close the gaps > that were opened. > Does this make sense? Is there a reason why random partitioner is less > likely to be faulty in this scenario? The scenario is of about 1300 > writes/sec of small amounts of data to a single CF on a cluster with two > nodes and no replication. With the order-preserving-partitioner after a few > hours of load the compaction pool is behind on one of the hosts and > eventually this host crashes, but with the random partitioner it doesn't > crash. > thanks > > On Sat, Feb 20, 2010 at 6:27 AM, Jonathan Ellis wrote: >> >> looks like test1 started gc storming, so test2 treats it as dead and >> starts doing hinted handoff for it, which increases test2's load, even >> though test1 is not completely dead yet. >> >> On Thu, Feb 18, 2010 at 1:16 AM, Ran Tavory wrote: >> > I found another interesting graph, attached. >> > I looked at the write-count and write-latency of the CF I'm writing to >> > and I >> > see a few interesting things: >> > 1. the host test2 crashed at 18:00 >> > 2. At 16:00, after a few hours of load both hosts dropped their >> > write-count. >> > test1 (which did not crash) started slowing down first and then test2 >> > slowed. >> > 3. At 16:00 I start seeing high write-latency on test2 only. This takes >> > about 2h until finally at 18:00 it crashes. >> > Does this help? >> > >> > On Thu, Feb 18, 2010 at 7:44 AM, Ran Tavory wrote: >> >> >> >> I ran the process again and after a few hours the same node crashed the >> >> same way. Now I can tell for sure this is indeed what Jonathan proposed >> >> - >> >> the data directory needs to be 2x of what it is, but it looks like a >> >> design >> >> problem, how large to I need to tell my admin to set it then? >> >> Here's what I see when the server crashes: >> >> $ df -h /outbrain/cassandra/data/ >> >> Filesystem Size Used Avail Use% Mounted on >> >> /dev/mapper/cassandra-data >> >> 97G 46G 47G 50% /outbrain/cassandra/data >> >> The directory is 97G and when the host crashes it's at 50% use. >> >> I'm also monitoring various JMX counters and I see that COMPACTION-POOL >> >> PendingTasks grows for a while on this host (not on the other host, >> >> btw, >> >> which is fine, just this host) and then flats for 3 hours. After 3 >> >> hours of >> >> flat it crashes. I'm attaching the graph. >> >> When I restart cassandra on this host (not changed file allocation >> >> size, >> >> just restart) it does manage to compact the data files pretty fast, so >> >> after >> >> a minute I get 12% use, so I wonder what made it crash before that >> >> doesn't >> >> now? (could be the load that's not running now) >> >> $ df -h /outbrain/cassandra/data/ >> >> Filesystem Size Used Avail Use% Mounted on >> >> /dev/mapper/cassandra-data >> >> 97G 11G 82G 12% /outbrain/cassandra/data >> >> The question is what size does the data directory need to be? It's not >> >> 2x >> >> the size of the data I expect to have (I only have 11G of real data >> >> after >> >> compaction and the dir is 97G, so it should have been enough). If it's >> >> 2x of >> >> something dynamic that keeps growing and isn't bound then it'll just >> >> grow infinitely, right? What's the bound? >> >> Alternatively, what jmx counter thresholds are the best indicators for >> >> the >> >> crash that's about to happen? >> >> Thanks >> >> >> >> On Wed, Feb 17, 2010 at 9:00 PM, Tatu Saloranta >> >> wrote: >> >>> >> >>> On Wed, Feb 17, 2010 at 6:40 AM, Ran Tavory wrote: >> >>> > If it's the data directory, then I have a pretty big one. Maybe it's >> >>> > something else >> >>> > $ df -h /outbrain/cassandra/data/ >> >>> > Filesystem Size Used Avail Use% Mounted on >> >>> > /dev/mapper/cassandra-data >> >>> > 97G 11G 82G 12% /outbrain/cassandra/data >> >>> >> >>> Perhaps a temporary file? JVM defaults to /tmp, which may be on a >> >>> smaller (root) partition? >> >>> >> >>> -+ Tatu +- >> >> >> > >> > > >
Re: StackOverflowError on high load
interestingly, I ran the same load but this time with a random partitioner and, although from time to time test2 was a little behind with its compaction task, it did not crash and was able to eventually close the gaps that were opened. Does this make sense? Is there a reason why random partitioner is less likely to be faulty in this scenario? The scenario is of about 1300 writes/sec of small amounts of data to a single CF on a cluster with two nodes and no replication. With the order-preserving-partitioner after a few hours of load the compaction pool is behind on one of the hosts and eventually this host crashes, but with the random partitioner it doesn't crash. thanks On Sat, Feb 20, 2010 at 6:27 AM, Jonathan Ellis wrote: > looks like test1 started gc storming, so test2 treats it as dead and > starts doing hinted handoff for it, which increases test2's load, even > though test1 is not completely dead yet. > > On Thu, Feb 18, 2010 at 1:16 AM, Ran Tavory wrote: > > I found another interesting graph, attached. > > I looked at the write-count and write-latency of the CF I'm writing to > and I > > see a few interesting things: > > 1. the host test2 crashed at 18:00 > > 2. At 16:00, after a few hours of load both hosts dropped their > write-count. > > test1 (which did not crash) started slowing down first and then test2 > > slowed. > > 3. At 16:00 I start seeing high write-latency on test2 only. This takes > > about 2h until finally at 18:00 it crashes. > > Does this help? > > > > On Thu, Feb 18, 2010 at 7:44 AM, Ran Tavory wrote: > >> > >> I ran the process again and after a few hours the same node crashed the > >> same way. Now I can tell for sure this is indeed what Jonathan proposed > - > >> the data directory needs to be 2x of what it is, but it looks like a > design > >> problem, how large to I need to tell my admin to set it then? > >> Here's what I see when the server crashes: > >> $ df -h /outbrain/cassandra/data/ > >> FilesystemSize Used Avail Use% Mounted on > >> /dev/mapper/cassandra-data > >>97G 46G 47G 50% /outbrain/cassandra/data > >> The directory is 97G and when the host crashes it's at 50% use. > >> I'm also monitoring various JMX counters and I see that COMPACTION-POOL > >> PendingTasks grows for a while on this host (not on the other host, btw, > >> which is fine, just this host) and then flats for 3 hours. After 3 hours > of > >> flat it crashes. I'm attaching the graph. > >> When I restart cassandra on this host (not changed file allocation size, > >> just restart) it does manage to compact the data files pretty fast, so > after > >> a minute I get 12% use, so I wonder what made it crash before that > doesn't > >> now? (could be the load that's not running now) > >> $ df -h /outbrain/cassandra/data/ > >> FilesystemSize Used Avail Use% Mounted on > >> /dev/mapper/cassandra-data > >>97G 11G 82G 12% /outbrain/cassandra/data > >> The question is what size does the data directory need to be? It's not > 2x > >> the size of the data I expect to have (I only have 11G of real data > after > >> compaction and the dir is 97G, so it should have been enough). If it's > 2x of > >> something dynamic that keeps growing and isn't bound then it'll just > >> grow infinitely, right? What's the bound? > >> Alternatively, what jmx counter thresholds are the best indicators for > the > >> crash that's about to happen? > >> Thanks > >> > >> On Wed, Feb 17, 2010 at 9:00 PM, Tatu Saloranta > >> wrote: > >>> > >>> On Wed, Feb 17, 2010 at 6:40 AM, Ran Tavory wrote: > >>> > If it's the data directory, then I have a pretty big one. Maybe it's > >>> > something else > >>> > $ df -h /outbrain/cassandra/data/ > >>> > FilesystemSize Used Avail Use% Mounted on > >>> > /dev/mapper/cassandra-data > >>> >97G 11G 82G 12% /outbrain/cassandra/data > >>> > >>> Perhaps a temporary file? JVM defaults to /tmp, which may be on a > >>> smaller (root) partition? > >>> > >>> -+ Tatu +- > >> > > > > >
Re: cassandra freezes
On Fri, Feb 19, 2010 at 7:40 PM, Santal Li wrote: > I meet almost same thing as you. When I do some benchmarks write test, some > times one Cassandra will freeze and other node will consider it was shutdown > and up after 30+ second. I am using 5 node, each node 8G mem for java heap. > > From my investigate, it was caused by GC thread, because I start the > JConsole and monitor with the memory heap usage, each time when the GC > happend, heap usage will drop down from 6G to 1G, and check the casandra > log, I found the freeze happend at exactly same times. With such a big heap, old generation GCs can definitely take a while. With just 1.5 gig heap, and with somewhat efficient parallel collection (on multi-core machine), we had trouble keeping collections below 5 seconds. But this depends a lot on survival ratio -- less garbage there is (and more live objects), slower things are. And relationship is super-linear too, so processing 6 gig (or whatever part of that is old generation space) can take a long time. It is certainly worth keeping in mind that more memory generally means longer gc collection time. But Jonathan is probably right in that this alone would not cause appearance of freeze -- rather, overload of GC blocking processing AND accumulation of new requests sounds more plausible. It is still good to consider both parts of the puzzle; preventing overflow that can turn bad situation into catastrophe, and trying to reduce impact of GC. > So I think when using huge memory(>2G), maybe need using some different GC > stratege other than the default one provide by Cassandra lunch script. > Dose't anyone meet this situation, can you please provide some guide? There are many ways to change GC settings, and specifically trying to reduce impact of old gen collections (young generation ones are less often problematic, although they can be tuned as well). Often there is a trade-off between frequency and impact of GC: to simplify, less often you configure it to occur (like increase heap), more impact it usually has when it does occur. Concurrent collectors (like traditional CMS) are good for steady state, and can keep oldgen GC from occuring maybe for hours (doing incremental concurrent "partial" collections). But can also lead to GC-from-hell when it must do full GC (since it's stop-the-world) kind. There is tons of information on how to deal with GC settings, but unfortunately it is bit of black arts and very dependant on your specific use case. There being dozens (more than a hundred I think) different switches makes it actually trickier, since you also need to learn which ones matter, and in what combinations. One somewhat counter-intuitive suggestion is to reduce size of heap at least with respect to caching. So mostly try to just keep live working set in memory, and not do caching inside Java process. Operating systems are pretty good at caching disk pages; and if storage engine is out of process (like native BDB), this can significantly reduce GC. In-process caches can be really bad for GC activity, because their contents are potentially long-living, yet relatively transient (that is, neither mostly live, nor mostly garbage, making GC optimizer try in vain to compact things). But once again, this may or may not help, and needs to be experimented with. Not sure if above helps, but I hope it gives at least some ideas, -+ Tatu +-
Re: cassandra freezes
haproxy should be fine. normal GCs aren't a problem, you don't need to worry about that. what is a problem is when you shove more requests into cassandra than it can handle, so it tries to GC to get enough memory to handle that, then you shove even more requests, so it GC's again, and it spirals out of control and "freezes." https://issues.apache.org/jira/browse/CASSANDRA-685 will address this by not allowing more requests than it can handle. On Sat, Feb 20, 2010 at 10:22 AM, Simon Smith wrote: > I'm still in the experimentation stage so perhaps forgive this hypothetical > question/idea. I am planning to load balance by putting haproxy in front of > the cassandra cluster. First of all, is that a bad idea? > > Secondly, if I have high enough replication and # of nodes, is it possible > and a good idea to proactively cause GCing to happen? (I.e. take a node out > of the haproxy LB pool, somehow cause it to gc, and then put the node back > in... repeat at intervals for each node?) > > Simon Smith >
Re: cassandra freezes
I'm still in the experimentation stage so perhaps forgive this hypothetical question/idea. I am planning to load balance by putting haproxy in front of the cassandra cluster. First of all, is that a bad idea? Secondly, if I have high enough replication and # of nodes, is it possible and a good idea to proactively cause GCing to happen? (I.e. take a node out of the haproxy LB pool, somehow cause it to gc, and then put the node back in... repeat at intervals for each node?) Simon Smith