[GitHub] [pinot] gortiz opened a new pull request, #8766: Optimize ColumnValueSegmentPruner by caching value hashes

GitBox Tue, 24 May 2022 09:04:58 -0700


gortiz opened a new pull request, #8766:
URL: https://github.com/apache/pinot/pull/8766


   ColumnValueSegmentPruner was calculating the hash of each value once per 
segment. This PR introduces a hash cache that is populated the first time one 
of each element is hashed. 
   
   The cost of this map is that it may have more entries than a normal map, but 
they are always going to be limited by the number of literal expressions in the 
where expression.
   
   The PR includes a two JMH microbenchmarks:
   - `BenchmarkBloomFilter` is a benchmark that shows the performance 
difference between calling several times `mightContain(String)` and 
`mightContain(long, long)`.
   - `BenchmarkServerSegmentPruner` is a benchmark that tests 
`ColumnValueSegmentPruner`. It initially create several IndexSegments and given 
a query, analyses how much time is spent in pruning it
   
   `BenchmarkServerSegmentPruner` has been executed with the previous 
`ColumnValueSegmentPruner` version and with the optimized one. The results on 
my laptop are the following:
   
   ```
   Without optimization
   Benchmark                           (_numRows)  (_numSegments)  Mode  Cnt   
Score   Error  Units
   BenchmarkServerSegmentPruner.query          10              10  avgt    5   
1.426 ± 0.003  us/op
   BenchmarkServerSegmentPruner.query          10             100  avgt    5  
14.719 ± 1.342  us/op
   BenchmarkServerSegmentPruner.query          10            1000  avgt    5  
174.879 ± 5.115  us/op
   BenchmarkServerSegmentPruner.query         100              10  avgt    5   
1.459 ± 0.009  us/op
   BenchmarkServerSegmentPruner.query         100             100  avgt    5  
13.966 ± 0.520  us/op
   BenchmarkServerSegmentPruner.query        1000              10  avgt    5   
1.445 ± 0.004  us/op
   BenchmarkServerSegmentPruner.query        1000             100  avgt    5  
15.349 ± 0.071  us/op
   
   With optimization
   Benchmark                           (_numRows)  (_numSegments)  Mode  Cnt   
Score   Error  Units
   BenchmarkServerSegmentPruner.query          10              10  avgt    5   
0.972 ± 0.003  us/op
   BenchmarkServerSegmentPruner.query          10             100  avgt    5  
10.149 ± 0.046  us/op
   BenchmarkServerSegmentPruner.query          10            1000  avgt    5  
135.063 ± 14.670  us/op
   BenchmarkServerSegmentPruner.query         100              10  avgt    5   
0.960 ± 0.006  us/op
   BenchmarkServerSegmentPruner.query         100             100  avgt    5  
10.512 ± 0.054  us/op
   BenchmarkServerSegmentPruner.query        1000              10  avgt    5   
0.796 ± 0.003  us/op
   BenchmarkServerSegmentPruner.query        1000             100  avgt    5  
10.636 ± 0.021  us/op
   ```
   
   As expected, the cost seems to be linear on the number of segments and it 
doesn't seem that the number of rows affects the results in a significant way. 
Given that the tests with 1000 segments takes a while to build, I've decided to 
do not test it with different number of rows. In this benchmark the time spent 
pruning has been reduced 23% and 30%.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] gortiz opened a new pull request, #8766: Optimize ColumnValueSegmentPruner by caching value hashes

Reply via email to