[ https://issues.apache.org/jira/browse/CASSANDRA-14611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
C. Scott Andreas updated CASSANDRA-14611: ----------------------------------------- Component/s: Coordination > Quorum reads of wide partitions are expensive due to inefficient hashing in > AbstractCell > ---------------------------------------------------------------------------------------- > > Key: CASSANDRA-14611 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14611 > Project: Cassandra > Issue Type: Improvement > Components: Coordination > Environment: AWS, i3.4xlarge (very fast nvme drives), Linux 4.13. > Cassandra 3.0.16, Java 8u162 > Reporter: Joseph Lynch > Priority: Minor > Labels: performance > Attachments: quorum_reads_lots_of_small_columns.svg > > > While I was analyzing flame graphs as part of investigating CASSANDRA-14605 I > noticed that digest overhead for quorum reads of partitions with many small > columns is much higher than I would expect. > In this particular use case of ({{LOCAL_QUORUM}}) reads of partitions with > (p50, p95, p99) = (60, 5k, 20k) small columns (a {{set<int>}}), digest > requests make up 45%+ of the CPU time that Cassandra is using. About 20% is > reading the columns into the iterator abstraction (bummer) but surprisingly > about 20% is _calculating the digests themselves_. I would expect digest > calculation itself to be close to 2% total ... I've attached a flame graph > showing this behavior. > The part that is particularly concerning to me is that ~17% of CPU time (~1/3 > of the overall digest time) is spent in calculating {{[AbstractCell::digest > |https://github.com/apache/cassandra/blob/d52c7b8c595cc0d06fc3607bf16e3f595f016bb6/src/java/org/apache/cassandra/db/rows/AbstractCell.java#L54-L56]}} > but only ~2% is digesting the actual cell values. Looking deeper the > implementation of > {{[FBUtilities::updateWithInt|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/utils/FBUtilities.java#L834]}} > and other related functions ({{updateWithBoolean}}, {{updateWithLong}}, > etc...) are very inefficient as they call update on the underlying > {{MessageDigest}} on each byte individually. When you have lots of small > cells the overhead of this approach is apparently significant. In particular > it looks like the underlying MD5 implementation is not as efficient as it > should be for lots of single byte updates as it does attempt to buffer them > into a 64 byte buffer, but that's too small to get maximum throughput. In my > local benchmarking I'm seeing just grouping the metadata updates into a 13 > byte array yields a 3x performance improvement but it has 13 bytes of > allocated memory per cell during the hash that previously we didn't allocate. > I've been playing around with a patch which changes out the normal MD5 digest > for a buffered one (e.g. has a fixed 1kb byte array per read thread which is > used to roll up digest updates and then whenever the buffer fills it does an > underlying update). So far such a buffered MessageDigest is showing 2x better > performance on hashing 10k column metadata in a jmh microbenchmark I wrote. > If I combine that with metadata grouping into a 13 byte array I get 10x more > throughput. I think I need to do more profiling but I think that buffering > our digest calls could really help for wide partitions of smaller columns (as > opposed to fat single columns, where the metadata hashing overhead is > probably pretty negligible). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org