Hi,

I wrote CASSANDRA-14281 for the initial idea and where I ended up with my current prototype. This maintains the current layout of the JMX metrics so it shouldn't be visible to users. Should, because I couldn't really find any definition of our metrics. For example, our histograms when built with a Meter use a ExponentiallyDecayingReservoir but our histograms built directly use DecayingEstimatedHistogramReservoir algorithm. So these two histograms behave differently, but the JMX requester has no idea which way they're built.

Also, recentValues() for example does not use either behavior, but actually follows two different behaviors again (!). CASSANDRA-13642 added the recentValues() which uses values() underneath. The ticket itself did not really describe what's the behavior that should be exposed (luckily it was in 4.0, so it technically is not in any released version yet).

I doubt anyone really knows how our metrics are supposed to work? What if I change how they decay? Or should we use a single decay strategy for all our histograms (and related) ?

I guess my ticket became suddenly slightly more complex ;)

  - Micke

On 02/28/2018 12:25 AM, Nate McCall wrote:
Hi Micke,
There is some good research in here - have you had a chance to create
some issues in Jira from this?

On Fri, Feb 23, 2018 at 6:28 AM, Michael Burman <mibur...@redhat.com> wrote:
Hi,

I was referring to this article by Shipilev (there are few small issues
forgotten in that url you pasted):

https://shipilev.net/blog/2014/nanotrusting-nanotime/

And his lovely recommendation on it: "System.nanoTime is as bad as
String.intern now: you can use it, but use it wisely. ". And Cassandra uses
it quite a lot in the write path at least. There isn't necessarily a better
option in Java for it, but for that reason we shouldn't push them everywhere
in the code "for fun".

   - Micke



On 02/22/2018 06:08 PM, Jeremiah D Jordan wrote:
re: nanoTime vs currentTimeMillis there is a good blog post here about the
timing of both and how your choice of Linux clock source can drastically
effect the speed of the calls, and also showing that in general on linux
there is no perf improvement for one over the other.
http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html

On Feb 22, 2018, at 11:01 AM, Blake Eggleston <beggles...@apple.com>
wrote:

Hi Micke,

This is really cool, thanks for taking the time to investigate this. I
believe the metrics around memtable insert time come in handy in identifying
high partition contention in the memtable. I know I've been involved in a
situation over the past year where we got actionable info from this metric.
Reducing resolution to milliseconds is probably a no go since most things in
this path should complete in less than a millisecond.

Revisiting the use of the codahale metrics in the hot path like this
definitely seems like a good idea though. I don't think it's been something
we've talked about a lot, and it definitely looks like we could benefit from
using something more specialized here. I think it's worth doing, especially
since there won't be any major changes to how we do threading in 4.0. It's
probably also worth opening a JIRA and investigating the calls to nano time.
We at least need microsecond resolution here, and there could be something
we haven't thought of? It's worth a look at least.

Thanks,

Blake

On 2/22/18, 6:10 AM, "Michael Burman" <mibur...@redhat.com> wrote:

     Hi,

     I wanted to get some input from the mailing list before making a JIRA
     and potential fixes. I'll touch the performance more on latter part,
but
     there's one important question regarding the write latency metric
     recording place. Currently we measure the writeLatency (and metric
write
     sampler..) in ColumnFamilyStore.apply() and this is also the metric
we
     then replicate to Keyspace metrics etc.

     This is an odd place for writeLatency. Not to mention it is in a
     hot-path of Memtable-modifications, but it also does not measure the
     real write latency, since it completely ignores the CommitLog latency
in
     that same process. Is the intention really to measure
     Memtable-modification latency only or the actual write latencies?

     Then the real issue.. this single metric is a cause of huge overhead
in
     Memtable processing. There are several metrics / events in the CFS
apply
     method, including metric sampler, storageHook reportWrite,
     colUpdateTimeDeltaHistogram and metric.writeLatency. These are not
free
     at all when it comes to the processing. I made a small JMH benchmark
     here:
https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade
     that I'll be referring to.

     The most offending of all these metrics is the writeLatency metric.
What
     it does is update the latency in codahale's timer, doing a histogram
     update and then going through all the parent metrics also which
update
     the keyspace writeLatency and globalWriteLatency. When measuring the
     performance of Memtable.put with parameter of 1 partition (to reduce
the
     ConcurrentSkipListMap search speed impact - that's separate issue and
     takes a little bit longer to solve although I've started to prototype
     something..) on my machine I see 1.3M/s performance with the metric
and
     when it is disabled the performance climbs to 4M/s. So the overhead
for
     this single metric is ~2/3 of total performance. That's insane. My
perf
     stats indicate that the CPU is starved as it can't get enough data
in.

     Removing the replication from TableMetrics to the Keyspace & global
     latencies in the write time (and doing this when metrics are
requested
     instead) improves the performance to 2.1M/s on my machine. It's an
     improvement, but it's still huge amount. Even when we pressure the
     ConcurrentSkipListMap with 100 000 partitions in one active Memtable,
     the performance drops by about ~40% due to this metric, so it's never
free.

     i did not find any discussion replacing the metric processing with
     something faster, so has this been considered before? At least for
these
     performance sensitive ones. The other issue is obviously the use of
     System.nanotime() which by itself is very slow (two System.nanotime()
     calls eat another ~1M/s from the performance)

     My personal quick fix would be to move writeLatency to
Keyspace.apply,
     change write time aggregates to read time processing (metrics are
read
     less often than we write data) and maybe even reduce the nanotime ->
     currentTimeMillis (even given it's relative lack of precision). That
is
     - if these metrics make any sense at all at CFS level? Maybe these
     should be measured from the network processing time (including all
the
     deserializations and such) ? Especially if at some point the smarter
     threading / eventlooping changes go forward (in which case they might
     sleep at some "queue" for a while).

        - Micke


     ---------------------------------------------------------------------
     To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
     For additional commands, e-mail: dev-h...@cassandra.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Reply via email to