Re: Expensive metrics?

Michael Burman Thu, 22 Feb 2018 09:15:04 -0800

Hi,

I've looked at the high level the metrics' expense. It's around ~4% ofthe total CPU time in my machine. But the problem with that higher levelmeasurement is that it does not show waits. When I push writes to theCassandra (through CQL) I'm mostly getting stalls according to thekernel level measurements - that is, the CPU can't do work. IIRC (I'm onthe wrong computer now - can't check exact numbers) my CPU reports thatabout ~70% of the CPU cycles are wasted when running write test on liveCassandra instance. But at that level there's too much data to actuallyget into the details and what is blocking the CPU from getting things toexecute, are they cache misses, branch misses (these are actually quitehigh) and where they do happen.

That's why I'm trying to look at the smaller parts first because it iseasier to measure and debug (while that approach certainly can't solveso easily architectural issues). I do have some more higher levelversions of that gist-microbench also (but I omitted for clarity) and Iintend to create more of them.


  - Micke


On 02/22/2018 06:32 PM, Jonathan Haddad wrote:

Hey Micke, very cool you're looking to improve C*'s performance, we would
absolutely benefit from it.

Have you done any other benchmarks beside the micro one to determine the
total effect of these metrics on the system overall?  Microbenchmarks are a
great way to tune small sections of code but they aren't a great starting
point.  It would be good if we could put some context around the idea by
benchmarking a tuned, single node (so there's less network overhead)
running on fast disks with compaction disabled so we can see what kind of
impact these metrics are adding.  Ideally we'd look at GC promotion and CPU
time using something like YourKit to identify the overall effect of the
metrics, so we can set our expectations and goals in a reasonable manner.
Happy to coordinate with you on this!

On Thu, Feb 22, 2018 at 8:08 AM Jeremiah D Jordan <jeremiah.jor...@gmail.com>
wrote:

re: nanoTime vs currentTimeMillis there is a good blog post here about the
timing of both and how your choice of Linux clock source can drastically
effect the speed of the calls, and also showing that in general on linux
there is no perf improvement for one over the other.
http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html

On Feb 22, 2018, at 11:01 AM, Blake Eggleston <beggles...@apple.com>

wrote:

Hi Micke,

This is really cool, thanks for taking the time to investigate this. I

believe the metrics around memtable insert time come in handy in
identifying high partition contention in the memtable. I know I've been
involved in a situation over the past year where we got actionable info
from this metric. Reducing resolution to milliseconds is probably a no go
since most things in this path should complete in less than a millisecond.

Revisiting the use of the codahale metrics in the hot path like this

definitely seems like a good idea though. I don't think it's been something
we've talked about a lot, and it definitely looks like we could benefit
from using something more specialized here. I think it's worth doing,
especially since there won't be any major changes to how we do threading in
4.0. It's probably also worth opening a JIRA and investigating the calls to
nano time. We at least need microsecond resolution here, and there could be
something we haven't thought of? It's worth a look at least.

Thanks,

Blake

On 2/22/18, 6:10 AM, "Michael Burman" <mibur...@redhat.com> wrote:

    Hi,

    I wanted to get some input from the mailing list before making a JIRA
    and potential fixes. I'll touch the performance more on latter part,

but

    there's one important question regarding the write latency metric
    recording place. Currently we measure the writeLatency (and metric

write

    sampler..) in ColumnFamilyStore.apply() and this is also the metric we
    then replicate to Keyspace metrics etc.

    This is an odd place for writeLatency. Not to mention it is in a
    hot-path of Memtable-modifications, but it also does not measure the
    real write latency, since it completely ignores the CommitLog latency

in

    that same process. Is the intention really to measure
    Memtable-modification latency only or the actual write latencies?

    Then the real issue.. this single metric is a cause of huge overhead

in

    Memtable processing. There are several metrics / events in the CFS

apply

    method, including metric sampler, storageHook reportWrite,
    colUpdateTimeDeltaHistogram and metric.writeLatency. These are not

free

    at all when it comes to the processing. I made a small JMH benchmark
    here:

https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade

    that I'll be referring to.

    The most offending of all these metrics is the writeLatency metric.

What

    it does is update the latency in codahale's timer, doing a histogram
    update and then going through all the parent metrics also which update
    the keyspace writeLatency and globalWriteLatency. When measuring the
    performance of Memtable.put with parameter of 1 partition (to reduce

the

    ConcurrentSkipListMap search speed impact - that's separate issue and
    takes a little bit longer to solve although I've started to prototype
    something..) on my machine I see 1.3M/s performance with the metric

and

    when it is disabled the performance climbs to 4M/s. So the overhead

for

    this single metric is ~2/3 of total performance. That's insane. My

perf

    stats indicate that the CPU is starved as it can't get enough data in.

    Removing the replication from TableMetrics to the Keyspace & global
    latencies in the write time (and doing this when metrics are requested
    instead) improves the performance to 2.1M/s on my machine. It's an
    improvement, but it's still huge amount. Even when we pressure the
    ConcurrentSkipListMap with 100 000 partitions in one active Memtable,
    the performance drops by about ~40% due to this metric, so it's never

free.

    i did not find any discussion replacing the metric processing with
    something faster, so has this been considered before? At least for

these

    performance sensitive ones. The other issue is obviously the use of
    System.nanotime() which by itself is very slow (two System.nanotime()
    calls eat another ~1M/s from the performance)

    My personal quick fix would be to move writeLatency to Keyspace.apply,
    change write time aggregates to read time processing (metrics are read
    less often than we write data) and maybe even reduce the nanotime ->
    currentTimeMillis (even given it's relative lack of precision). That

is

    - if these metrics make any sense at all at CFS level? Maybe these
    should be measured from the network processing time (including all the
    deserializations and such) ? Especially if at some point the smarter
    threading / eventlooping changes go forward (in which case they might
    sleep at some "queue" for a while).

       - Micke


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
    For additional commands, e-mail: dev-h...@cassandra.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Expensive metrics?

Reply via email to