[ 
https://issues.apache.org/jira/browse/CASSANDRA-15213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012511#comment-17012511
 ] 

Jordan West edited comment on CASSANDRA-15213 at 1/10/20 4:14 PM:
------------------------------------------------------------------

I've updated [https://github.com/jrwest/cassandra/tree/jwest/15213] with the 
changes you suggested. Regarding the extra bucket, I realized that even more 
simply we could return {{bucketOffsets.length}} if the estimate is greater than 
or equal to it – saving us an access in this case altogether. 

Also added a very simple striping approach -- without any attempts at 
distributed the buckets more uniformly. The performance is about the same (the 
runs are within each others' margin of error) and still significantly better 
than before CASSANDRA-14281. From the memory consumption perspective, however, 
4 stripes ends up allocating more than 12MiB after startup. This isn't 
surprising because with a single stripe (or pre CASSANDRA-14281) after startup 
was a little more than 3MiB and we're quadrupling the size of each array.  
However, these won't grow under contention like {{LongAdder[]}} would: memory 
consumption is on the order of the number of histograms created not number 
created and contention as before. I've found comparable performance using 2 
stripes on my 4 core machine. I have an 8 core available to me I can test with 
tomorrow. If the memory consumption is still a concern I can investigate 
on-demand striping but I prefer the simpler approach and think the trade-offs 
of 2-4 stripes is reasonable. 

 
{code:java}
 3.0 (4 core):
     [java] Benchmark                                       Mode  Cnt        
Score        Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
5848504.963 ± 147881.511  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5  
1946154.537 ± 623185.290  ops/s


Trunk (4 core)


     [java] Benchmark                                       Mode  Cnt         
Score         Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
21887453.678 ± 2108481.285  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
8908817.316 ±  115453.642  ops/s


15213 (4 core, linear search changes only):


     [java] Benchmark                                       Mode  Cnt         
Score         Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
24646022.304 ± 1052105.818  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
9175928.594 ±  269984.204  ops/s




15213 (4 core, 4 stripes, striping and linear search):


[java] Benchmark                                       Mode  Cnt         Score  
      Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
18818181.576 ± 506997.366  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
8895569.814 ± 154219.113  ops/s


 {code}
 


was (Author: jrwest):
I've updated [https://github.com/jrwest/cassandra/tree/jwest/15213] with the 
changes you suggested. Regarding the extra bucket, I realized that even more 
simply we could return {{bucketOffsets.length}} if the estimate is greater than 
or equal to it – saving us an access in this case altogether. 

Also added a very simple striping approach. The performance is about the same 
(the runs are within each others' margin of error) and still significantly 
better than before CASSANDRA-14281. From the memory consumption perspective, 
however, 4 stripes ends up allocating more than 12MiB after startup. This isn't 
surprising because with a single stripe (or pre CASSANDRA-14281) after startup 
was a little more than 3MiB and we're quadrupling the size of each array.  
However, these won't grow under contention like {{LongAdder[]}} would: memory 
consumption is on the order of the number of histograms created not number 
created and contention as before. I've found comparable performance using 2 
stripes on my 4 core machine. I have an 8 core available to me I can test with 
tomorrow. If the memory consumption is still a concern I can investigate 
on-demand striping but I prefer the simpler approach and think the trade-offs 
of 2-4 stripes is reasonable. 

 
{code:java}
 3.0 (4 core):
     [java] Benchmark                                       Mode  Cnt        
Score        Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
5848504.963 ± 147881.511  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5  
1946154.537 ± 623185.290  ops/s


Trunk (4 core)


     [java] Benchmark                                       Mode  Cnt         
Score         Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
21887453.678 ± 2108481.285  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
8908817.316 ±  115453.642  ops/s


15213 (4 core, linear search changes only):


     [java] Benchmark                                       Mode  Cnt         
Score         Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
24646022.304 ± 1052105.818  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
9175928.594 ±  269984.204  ops/s




15213 (4 core, 4 stripes, striping and linear search):


[java] Benchmark                                       Mode  Cnt         Score  
      Error  Units
     [java] LatencyTrackingBench.benchInsertToDEHR         thrpt    5  
18818181.576 ± 506997.366  ops/s
     [java] LatencyTrackingBench.benchLatencyMetricsWrite  thrpt    5   
8895569.814 ± 154219.113  ops/s


 {code}
 

> DecayingEstimatedHistogramReservoir Inefficiencies
> --------------------------------------------------
>
>                 Key: CASSANDRA-15213
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15213
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Observability/Metrics
>            Reporter: Benedict Elliott Smith
>            Assignee: Jordan West
>            Priority: Normal
>             Fix For: 4.0-beta
>
>
> * {{LongAdder}} introduced to trunk consumes 9MiB of heap without user 
> schemas, and this will grow significantly under contention and user schemas 
> with many tables.  This is because {{LongAdder}} is a very heavy class 
> designed for single contended values.  
>  ** This can likely be improved significantly, without significant loss of 
> performance in the contended case, by simply increasing the size of our 
> primitive backing array and providing multiple buckets, with each thread 
> picking a bucket to increment, or simply multiple backing arrays.  Probably a 
> better way still to do this would be to introduce some competition detection 
> to the update, much like {{LongAdder}} utilises, that increases the number of 
> backing arrays under competition.
>  ** To save memory this approach could partition the space into chunks that 
> are likely to be updated together, so that we do not need to duplicate the 
> entire array under competition.
>  * Similarly, binary search is costly and a measurable cost as a share of the 
> new networking work (without filtering it was > 10% of the CPU used overall). 
>  We can compute an approximation floor(log2 n / log2 1.2) extremely cheaply, 
> to save the random memory access costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to