Re: Testing row cache feature in trunk: write should put record in cache

2010-02-20 Thread Jonathan Ellis
We don't use native java serialization for anything but the on-disk
BitSets in our bloom filters (because those are deserialized once at
startup, so the overhead doesn't matter), btw.

We're talking about adding compression after
https://issues.apache.org/jira/browse/CASSANDRA-674.

On Sat, Feb 20, 2010 at 3:12 PM, Tatu Saloranta  wrote:
> On Fri, Feb 19, 2010 at 11:44 AM, Weijun Li  wrote:
>> I see. How much is the overhead of java serialization? Does it slow down the
>> system a lot? It seems to be a tradeoff between CPU usage and memory.
>
> This should be relatively easy to measure, as a stand-alone thing. Or
> maybe even from profiler stack traces
>  If native Java serialization is used, there may be more efficient
> alternatives, depending on data -- default serialization is highly
> inefficient for small object graphs (like individual objects), but ok
> for larger graphs; this because much of class metadata is included,
> result is very self-contained.
> Beyond default serialization, there are more efficient general-purpose
> Java serialization frameworks; like Kryo or fast(est) json-based
> serializers (jackson); see
> [http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking]
> for some idea on alternatives.
>
> In fact: one interesting idea would be to further trade some CPU for
> less memory by using fast compression (like LZF). I hope to experiment
> with this idea some time in future. But challenge is that this would
> help most with clustered scheme (compressing more than one distinct
> item), which is much trickier to make work. Compression does ok with
> individual items, but real boost comes from redundancy between similar
> items.
>
> -+ Tatu +-
>


Re: Testing row cache feature in trunk: write should put record in cache

2010-02-20 Thread Tatu Saloranta
On Fri, Feb 19, 2010 at 11:44 AM, Weijun Li  wrote:
> I see. How much is the overhead of java serialization? Does it slow down the
> system a lot? It seems to be a tradeoff between CPU usage and memory.

This should be relatively easy to measure, as a stand-alone thing. Or
maybe even from profiler stack traces
 If native Java serialization is used, there may be more efficient
alternatives, depending on data -- default serialization is highly
inefficient for small object graphs (like individual objects), but ok
for larger graphs; this because much of class metadata is included,
result is very self-contained.
Beyond default serialization, there are more efficient general-purpose
Java serialization frameworks; like Kryo or fast(est) json-based
serializers (jackson); see
[http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking]
for some idea on alternatives.

In fact: one interesting idea would be to further trade some CPU for
less memory by using fast compression (like LZF). I hope to experiment
with this idea some time in future. But challenge is that this would
help most with clustered scheme (compressing more than one distinct
item), which is much trickier to make work. Compression does ok with
individual items, but real boost comes from redundancy between similar
items.

-+ Tatu +-


Re: StackOverflowError on high load

2010-02-20 Thread Jonathan Ellis
if OPP is configured w/ imbalanced ranges (or less balanced than RP)
then that would explain it.

OPP is actually slightly faster in terms of raw speed.

On Sat, Feb 20, 2010 at 2:31 PM, Ran Tavory  wrote:
> interestingly, I ran the same load but this time with a random partitioner
> and, although from time to time test2 was a little behind with its
> compaction task, it did not crash and was able to eventually close the gaps
> that were opened.
> Does this make sense? Is there a reason why random partitioner is less
> likely to be faulty in this scenario? The scenario is of about 1300
> writes/sec of small amounts of data to a single CF on a cluster with two
> nodes and no replication. With the order-preserving-partitioner after a few
> hours of load the compaction pool is behind on one of the hosts and
> eventually this host crashes, but with the random partitioner it doesn't
> crash.
> thanks
>
> On Sat, Feb 20, 2010 at 6:27 AM, Jonathan Ellis  wrote:
>>
>> looks like test1 started gc storming, so test2 treats it as dead and
>> starts doing hinted handoff for it, which increases test2's load, even
>> though test1 is not completely dead yet.
>>
>> On Thu, Feb 18, 2010 at 1:16 AM, Ran Tavory  wrote:
>> > I found another interesting graph, attached.
>> > I looked at the write-count and write-latency of the CF I'm writing to
>> > and I
>> > see a few interesting things:
>> > 1. the host test2 crashed at 18:00
>> > 2. At 16:00, after a few hours of load both hosts dropped their
>> > write-count.
>> > test1 (which did not crash) started slowing down first and then test2
>> > slowed.
>> > 3. At 16:00 I start seeing high write-latency on test2 only. This takes
>> > about 2h until finally at 18:00 it crashes.
>> > Does this help?
>> >
>> > On Thu, Feb 18, 2010 at 7:44 AM, Ran Tavory  wrote:
>> >>
>> >> I ran the process again and after a few hours the same node crashed the
>> >> same way. Now I can tell for sure this is indeed what Jonathan proposed
>> >> -
>> >> the data directory needs to be 2x of what it is, but it looks like a
>> >> design
>> >> problem, how large to I need to tell my admin to set it then?
>> >> Here's what I see when the server crashes:
>> >> $ df -h /outbrain/cassandra/data/
>> >> Filesystem            Size  Used Avail Use% Mounted on
>> >> /dev/mapper/cassandra-data
>> >>                        97G   46G   47G  50% /outbrain/cassandra/data
>> >> The directory is 97G and when the host crashes it's at 50% use.
>> >> I'm also monitoring various JMX counters and I see that COMPACTION-POOL
>> >> PendingTasks grows for a while on this host (not on the other host,
>> >> btw,
>> >> which is fine, just this host) and then flats for 3 hours. After 3
>> >> hours of
>> >> flat it crashes. I'm attaching the graph.
>> >> When I restart cassandra on this host (not changed file allocation
>> >> size,
>> >> just restart) it does manage to compact the data files pretty fast, so
>> >> after
>> >> a minute I get 12% use, so I wonder what made it crash before that
>> >> doesn't
>> >> now? (could be the load that's not running now)
>> >> $ df -h /outbrain/cassandra/data/
>> >> Filesystem            Size  Used Avail Use% Mounted on
>> >> /dev/mapper/cassandra-data
>> >>                        97G   11G   82G  12% /outbrain/cassandra/data
>> >> The question is what size does the data directory need to be? It's not
>> >> 2x
>> >> the size of the data I expect to have (I only have 11G of real data
>> >> after
>> >> compaction and the dir is 97G, so it should have been enough). If it's
>> >> 2x of
>> >> something dynamic that keeps growing and isn't bound then it'll just
>> >> grow infinitely, right? What's the bound?
>> >> Alternatively, what jmx counter thresholds are the best indicators for
>> >> the
>> >> crash that's about to happen?
>> >> Thanks
>> >>
>> >> On Wed, Feb 17, 2010 at 9:00 PM, Tatu Saloranta 
>> >> wrote:
>> >>>
>> >>> On Wed, Feb 17, 2010 at 6:40 AM, Ran Tavory  wrote:
>> >>> > If it's the data directory, then I have a pretty big one. Maybe it's
>> >>> > something else
>> >>> > $ df -h /outbrain/cassandra/data/
>> >>> > Filesystem            Size  Used Avail Use% Mounted on
>> >>> > /dev/mapper/cassandra-data
>> >>> >                        97G   11G   82G  12% /outbrain/cassandra/data
>> >>>
>> >>> Perhaps a temporary file? JVM defaults to /tmp, which may be on a
>> >>> smaller (root) partition?
>> >>>
>> >>> -+ Tatu +-
>> >>
>> >
>> >
>
>


Re: StackOverflowError on high load

2010-02-20 Thread Ran Tavory
interestingly, I ran the same load but this time with a random partitioner
and, although from time to time test2 was a little behind with its
compaction task, it did not crash and was able to eventually close the gaps
that were opened.
Does this make sense? Is there a reason why random partitioner is less
likely to be faulty in this scenario? The scenario is of about 1300
writes/sec of small amounts of data to a single CF on a cluster with two
nodes and no replication. With the order-preserving-partitioner after a few
hours of load the compaction pool is behind on one of the hosts and
eventually this host crashes, but with the random partitioner it doesn't
crash.
thanks

On Sat, Feb 20, 2010 at 6:27 AM, Jonathan Ellis  wrote:

> looks like test1 started gc storming, so test2 treats it as dead and
> starts doing hinted handoff for it, which increases test2's load, even
> though test1 is not completely dead yet.
>
> On Thu, Feb 18, 2010 at 1:16 AM, Ran Tavory  wrote:
> > I found another interesting graph, attached.
> > I looked at the write-count and write-latency of the CF I'm writing to
> and I
> > see a few interesting things:
> > 1. the host test2 crashed at 18:00
> > 2. At 16:00, after a few hours of load both hosts dropped their
> write-count.
> > test1 (which did not crash) started slowing down first and then test2
> > slowed.
> > 3. At 16:00 I start seeing high write-latency on test2 only. This takes
> > about 2h until finally at 18:00 it crashes.
> > Does this help?
> >
> > On Thu, Feb 18, 2010 at 7:44 AM, Ran Tavory  wrote:
> >>
> >> I ran the process again and after a few hours the same node crashed the
> >> same way. Now I can tell for sure this is indeed what Jonathan proposed
> -
> >> the data directory needs to be 2x of what it is, but it looks like a
> design
> >> problem, how large to I need to tell my admin to set it then?
> >> Here's what I see when the server crashes:
> >> $ df -h /outbrain/cassandra/data/
> >> FilesystemSize  Used Avail Use% Mounted on
> >> /dev/mapper/cassandra-data
> >>97G   46G   47G  50% /outbrain/cassandra/data
> >> The directory is 97G and when the host crashes it's at 50% use.
> >> I'm also monitoring various JMX counters and I see that COMPACTION-POOL
> >> PendingTasks grows for a while on this host (not on the other host, btw,
> >> which is fine, just this host) and then flats for 3 hours. After 3 hours
> of
> >> flat it crashes. I'm attaching the graph.
> >> When I restart cassandra on this host (not changed file allocation size,
> >> just restart) it does manage to compact the data files pretty fast, so
> after
> >> a minute I get 12% use, so I wonder what made it crash before that
> doesn't
> >> now? (could be the load that's not running now)
> >> $ df -h /outbrain/cassandra/data/
> >> FilesystemSize  Used Avail Use% Mounted on
> >> /dev/mapper/cassandra-data
> >>97G   11G   82G  12% /outbrain/cassandra/data
> >> The question is what size does the data directory need to be? It's not
> 2x
> >> the size of the data I expect to have (I only have 11G of real data
> after
> >> compaction and the dir is 97G, so it should have been enough). If it's
> 2x of
> >> something dynamic that keeps growing and isn't bound then it'll just
> >> grow infinitely, right? What's the bound?
> >> Alternatively, what jmx counter thresholds are the best indicators for
> the
> >> crash that's about to happen?
> >> Thanks
> >>
> >> On Wed, Feb 17, 2010 at 9:00 PM, Tatu Saloranta 
> >> wrote:
> >>>
> >>> On Wed, Feb 17, 2010 at 6:40 AM, Ran Tavory  wrote:
> >>> > If it's the data directory, then I have a pretty big one. Maybe it's
> >>> > something else
> >>> > $ df -h /outbrain/cassandra/data/
> >>> > FilesystemSize  Used Avail Use% Mounted on
> >>> > /dev/mapper/cassandra-data
> >>> >97G   11G   82G  12% /outbrain/cassandra/data
> >>>
> >>> Perhaps a temporary file? JVM defaults to /tmp, which may be on a
> >>> smaller (root) partition?
> >>>
> >>> -+ Tatu +-
> >>
> >
> >
>


Re: cassandra freezes

2010-02-20 Thread Tatu Saloranta
On Fri, Feb 19, 2010 at 7:40 PM, Santal Li  wrote:
> I meet almost same thing as you. When I do some benchmarks write test, some
> times one Cassandra will freeze and other node will consider it was shutdown
> and up after 30+ second. I am using 5 node, each node 8G mem for java heap.
>
> From my investigate, it was caused by GC thread, because I start the
> JConsole and monitor with the memory heap usage, each time when the GC
> happend, heap usage will drop down from 6G to 1G, and check the casandra
> log, I found the freeze happend at exactly same times.

With such a big heap, old generation GCs can definitely take a while.
With just 1.5 gig heap, and with somewhat efficient parallel
collection (on multi-core machine), we had trouble keeping collections
below 5 seconds. But this depends a lot on survival ratio -- less
garbage there is (and more live objects), slower things are. And
relationship is super-linear too, so processing 6 gig (or whatever
part of that is old generation space) can take a long time.

It is certainly worth keeping in mind that more memory generally means
longer gc collection time.

But Jonathan is probably right in that this alone would not cause
appearance of freeze -- rather, overload of GC blocking processing AND
accumulation of new requests sounds more plausible.
It is still good to consider both parts of the puzzle; preventing
overflow that can turn bad situation into catastrophe, and trying to
reduce impact of GC.

> So I think when using huge memory(>2G), maybe need using some different GC
> stratege other than the default one provide by Cassandra lunch script.
> Dose't anyone meet this situation, can you please provide some guide?

There are many ways to change GC settings, and specifically trying to
reduce impact of old gen collections (young generation ones are less
often problematic, although they can be tuned as well).
Often there is a trade-off between frequency and impact of GC: to
simplify, less often you configure it to occur (like increase heap),
more impact it usually has when it does occur.
Concurrent collectors (like traditional CMS) are good for steady
state, and can keep oldgen GC from occuring maybe for hours (doing
incremental concurrent "partial" collections). But can also lead to
GC-from-hell when it must do full GC (since it's stop-the-world) kind.

There is tons of information on how to deal with GC settings, but
unfortunately it is bit of black arts and very dependant on your
specific use case. There being dozens (more than a hundred I think)
different switches makes it actually trickier, since you also need to
learn which ones matter, and in what combinations.

One somewhat counter-intuitive suggestion is to reduce size of heap at
least with respect to caching. So mostly try to just keep live working
set in memory, and not do caching inside Java process. Operating
systems are pretty good at caching disk pages; and if storage engine
is out of process (like native BDB), this can significantly reduce GC.
In-process caches can be really bad for GC activity, because their
contents are potentially long-living, yet relatively transient (that
is, neither mostly live, nor mostly garbage, making GC optimizer try
in vain to compact things).
But once again, this may or may not help, and needs to be experimented with.

Not sure if above helps, but I hope it gives at least some ideas,

-+ Tatu +-


Re: cassandra freezes

2010-02-20 Thread Jonathan Ellis
haproxy should be fine.

normal GCs aren't a problem, you don't need to worry about that.  what
is a problem is when you shove more requests into cassandra than it
can handle, so it tries to GC to get enough memory to handle that,
then you shove even more requests, so it GC's again, and it spirals
out of control and "freezes."

https://issues.apache.org/jira/browse/CASSANDRA-685 will address this
by not allowing more requests than it can handle.

On Sat, Feb 20, 2010 at 10:22 AM, Simon Smith  wrote:
> I'm still in the experimentation stage so perhaps forgive this hypothetical
> question/idea.  I am planning to load balance by putting haproxy in front of
> the cassandra cluster.  First of all, is that a bad idea?
>
> Secondly, if I have high enough replication and # of nodes, is it possible
> and a good idea to proactively cause GCing to happen?  (I.e. take a node out
> of the haproxy LB pool, somehow cause it to gc, and then put the node back
> in... repeat at intervals for each node?)
>
> Simon Smith
>


Re: cassandra freezes

2010-02-20 Thread Simon Smith
I'm still in the experimentation stage so perhaps forgive this hypothetical
question/idea.  I am planning to load balance by putting haproxy in front of
the cassandra cluster.  First of all, is that a bad idea?

Secondly, if I have high enough replication and # of nodes, is it possible
and a good idea to proactively cause GCing to happen?  (I.e. take a node out
of the haproxy LB pool, somehow cause it to gc, and then put the node back
in... repeat at intervals for each node?)

Simon Smith