Hi,
I wanted to get some input from the mailing list before making a JIRA
and potential fixes. I'll touch the performance more on latter part, but
there's one important question regarding the write latency metric
recording place. Currently we measure the writeLatency (and metric write
sampler..) in ColumnFamilyStore.apply() and this is also the metric we
then replicate to Keyspace metrics etc.
This is an odd place for writeLatency. Not to mention it is in a
hot-path of Memtable-modifications, but it also does not measure the
real write latency, since it completely ignores the CommitLog latency in
that same process. Is the intention really to measure
Memtable-modification latency only or the actual write latencies?
Then the real issue.. this single metric is a cause of huge overhead in
Memtable processing. There are several metrics / events in the CFS apply
method, including metric sampler, storageHook reportWrite,
colUpdateTimeDeltaHistogram and metric.writeLatency. These are not free
at all when it comes to the processing. I made a small JMH benchmark
here: https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade
that I'll be referring to.
The most offending of all these metrics is the writeLatency metric. What
it does is update the latency in codahale's timer, doing a histogram
update and then going through all the parent metrics also which update
the keyspace writeLatency and globalWriteLatency. When measuring the
performance of Memtable.put with parameter of 1 partition (to reduce the
ConcurrentSkipListMap search speed impact - that's separate issue and
takes a little bit longer to solve although I've started to prototype
something..) on my machine I see 1.3M/s performance with the metric and
when it is disabled the performance climbs to 4M/s. So the overhead for
this single metric is ~2/3 of total performance. That's insane. My perf
stats indicate that the CPU is starved as it can't get enough data in.
Removing the replication from TableMetrics to the Keyspace & global
latencies in the write time (and doing this when metrics are requested
instead) improves the performance to 2.1M/s on my machine. It's an
improvement, but it's still huge amount. Even when we pressure the
ConcurrentSkipListMap with 100 000 partitions in one active Memtable,
the performance drops by about ~40% due to this metric, so it's never free.
i did not find any discussion replacing the metric processing with
something faster, so has this been considered before? At least for these
performance sensitive ones. The other issue is obviously the use of
System.nanotime() which by itself is very slow (two System.nanotime()
calls eat another ~1M/s from the performance)
My personal quick fix would be to move writeLatency to Keyspace.apply,
change write time aggregates to read time processing (metrics are read
less often than we write data) and maybe even reduce the nanotime ->
currentTimeMillis (even given it's relative lack of precision). That is
- if these metrics make any sense at all at CFS level? Maybe these
should be measured from the network processing time (including all the
deserializations and such) ? Especially if at some point the smarter
threading / eventlooping changes go forward (in which case they might
sleep at some "queue" for a while).
- Micke
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org