Peter,

Thanks for your input.  Can you tell me more about what we should be
looking for in the gc log?   We've already got the gc logging turned
on and, and we've already done the plotting to show that in most 
cases the outliers are happening periodically (with a period of 
10s of seconds to a few minutes, depnding on load and tuning)

I've tried to correlate the times of the outliers with messages either
in the system log or the gc log.   There seemms to be some (but not
complete) correlation between the outliers and system log messages about
memtable flushing.   I can not find anything in the gc log that 
seems to be an obvious problem, or that matches up with the time 
times of the outliers.


On Mon, Dec 19, 2011 at 09:47:32PM -0800, Peter Schuller wrote:
> > I'm trying to understand if this is expected or not, and if there is
> 
> Without careful tuning, outliers around a couple of hundred ms are
> definitely expected in general (not *necessarily*, depending on
> workload) as a result of garbage collection pauses. The impact will be
> worsened a bit if you are running under high CPU load (or even maxing
> it out with stress) because post-pause, if you are close to max CPU
> usage you will take considerably longer to "catch up".
> 
> Personally, I would just log each response time and feed it to gnuplot
> or something. It should be pretty obvious whether or not the latencies
> are due to periodic pauses.
> 
> If you are concerned with eliminating or reducing outliers, I would:
> 
> (1) Make sure that when you're benchmarking, that you're putting
> Cassandra under a reasonable amount of load. Latency benchmarks are
> usually useless if you're benchmarking against a saturated system. At
> least, start by achieving your latency goals at 25% or less CPU usage,
> and then go from there if you want to up it.
> 
> (2) One can affect GC pauses, but it's non-trivial to eliminate the
> problem completely. For example, the length of frequent young-gen
> pauses can typically be decreased by decreasing the size of the young
> generation, leading to more frequent shorter GC pauses. But that
> instead causes more promotion into the old generation, which will
> result in more frequent very long pauses (relative to normal; they
> would still be infrequent relative to young gen pauses) - IF your
> workload is such that you are suffering from fragmentation and
> eventually seeing Cassandra fall back to full compacting GC:s
> (stop-the-world) for the old generation.
> 
> I would start by adjusting young gen so that your frequent pauses are
> at an acceptable level, and then see whether or not you can sustain
> that in terms of old-gen.
> 
> Start with this in any case: Run Cassandra with -XX:+PrintGC
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
> 
> -- 
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)

-- 
Peter Fales
Alcatel-Lucent
Member of Technical Staff
1960 Lucent Lane
Room: 9H-505
Naperville, IL 60566-7033
Email: peter.fa...@alcatel-lucent.com
Phone: 630 979 8031

Reply via email to