Hi Micke,

This is really cool, thanks for taking the time to investigate this. I believe 
the metrics around memtable insert time come in handy in identifying high 
partition contention in the memtable. I know I've been involved in a situation 
over the past year where we got actionable info from this metric. Reducing 
resolution to milliseconds is probably a no go since most things in this path 
should complete in less than a millisecond. 

Revisiting the use of the codahale metrics in the hot path like this definitely 
seems like a good idea though. I don't think it's been something we've talked 
about a lot, and it definitely looks like we could benefit from using something 
more specialized here. I think it's worth doing, especially since there won't 
be any major changes to how we do threading in 4.0. It's probably also worth 
opening a JIRA and investigating the calls to nano time. We at least need 
microsecond resolution here, and there could be something we haven't thought 
of? It's worth a look at least.

Thanks,

Blake

On 2/22/18, 6:10 AM, "Michael Burman" <mibur...@redhat.com> wrote:

    Hi,
    
    I wanted to get some input from the mailing list before making a JIRA 
    and potential fixes. I'll touch the performance more on latter part, but 
    there's one important question regarding the write latency metric 
    recording place. Currently we measure the writeLatency (and metric write 
    sampler..) in ColumnFamilyStore.apply() and this is also the metric we 
    then replicate to Keyspace metrics etc.
    
    This is an odd place for writeLatency. Not to mention it is in a 
    hot-path of Memtable-modifications, but it also does not measure the 
    real write latency, since it completely ignores the CommitLog latency in 
    that same process. Is the intention really to measure 
    Memtable-modification latency only or the actual write latencies?
    
    Then the real issue.. this single metric is a cause of huge overhead in 
    Memtable processing. There are several metrics / events in the CFS apply 
    method, including metric sampler, storageHook reportWrite, 
    colUpdateTimeDeltaHistogram and metric.writeLatency. These are not free 
    at all when it comes to the processing. I made a small JMH benchmark 
    here: https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade 
    that I'll be referring to.
    
    The most offending of all these metrics is the writeLatency metric. What 
    it does is update the latency in codahale's timer, doing a histogram 
    update and then going through all the parent metrics also which update 
    the keyspace writeLatency and globalWriteLatency. When measuring the 
    performance of Memtable.put with parameter of 1 partition (to reduce the 
    ConcurrentSkipListMap search speed impact - that's separate issue and 
    takes a little bit longer to solve although I've started to prototype 
    something..) on my machine I see 1.3M/s performance with the metric and 
    when it is disabled the performance climbs to 4M/s. So the overhead for 
    this single metric is ~2/3 of total performance. That's insane. My perf 
    stats indicate that the CPU is starved as it can't get enough data in.
    
    Removing the replication from TableMetrics to the Keyspace & global 
    latencies in the write time (and doing this when metrics are requested 
    instead) improves the performance to 2.1M/s on my machine. It's an 
    improvement, but it's still huge amount. Even when we pressure the 
    ConcurrentSkipListMap with 100 000 partitions in one active Memtable, 
    the performance drops by about ~40% due to this metric, so it's never free.
    
    i did not find any discussion replacing the metric processing with 
    something faster, so has this been considered before? At least for these 
    performance sensitive ones. The other issue is obviously the use of 
    System.nanotime() which by itself is very slow (two System.nanotime() 
    calls eat another ~1M/s from the performance)
    
    My personal quick fix would be to move writeLatency to Keyspace.apply, 
    change write time aggregates to read time processing (metrics are read 
    less often than we write data) and maybe even reduce the nanotime -> 
    currentTimeMillis (even given it's relative lack of precision). That is 
    - if these metrics make any sense at all at CFS level? Maybe these 
    should be measured from the network processing time (including all the 
    deserializations and such) ? Especially if at some point the smarter 
    threading / eventlooping changes go forward (in which case they might 
    sleep at some "queue" for a while).
    
       - Micke
    
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
    For additional commands, e-mail: dev-h...@cassandra.apache.org
    
    



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Reply via email to