gortiz opened a new pull request, #8766: URL: https://github.com/apache/pinot/pull/8766
ColumnValueSegmentPruner was calculating the hash of each value once per segment. This PR introduces a hash cache that is populated the first time one of each element is hashed. The cost of this map is that it may have more entries than a normal map, but they are always going to be limited by the number of literal expressions in the where expression. The PR includes a two JMH microbenchmarks: - `BenchmarkBloomFilter` is a benchmark that shows the performance difference between calling several times `mightContain(String)` and `mightContain(long, long)`. - `BenchmarkServerSegmentPruner` is a benchmark that tests `ColumnValueSegmentPruner`. It initially create several IndexSegments and given a query, analyses how much time is spent in pruning it `BenchmarkServerSegmentPruner` has been executed with the previous `ColumnValueSegmentPruner` version and with the optimized one. The results on my laptop are the following: ``` Without optimization Benchmark (_numRows) (_numSegments) Mode Cnt Score Error Units BenchmarkServerSegmentPruner.query 10 10 avgt 5 1.426 ± 0.003 us/op BenchmarkServerSegmentPruner.query 10 100 avgt 5 14.719 ± 1.342 us/op BenchmarkServerSegmentPruner.query 10 1000 avgt 5 174.879 ± 5.115 us/op BenchmarkServerSegmentPruner.query 100 10 avgt 5 1.459 ± 0.009 us/op BenchmarkServerSegmentPruner.query 100 100 avgt 5 13.966 ± 0.520 us/op BenchmarkServerSegmentPruner.query 1000 10 avgt 5 1.445 ± 0.004 us/op BenchmarkServerSegmentPruner.query 1000 100 avgt 5 15.349 ± 0.071 us/op With optimization Benchmark (_numRows) (_numSegments) Mode Cnt Score Error Units BenchmarkServerSegmentPruner.query 10 10 avgt 5 0.972 ± 0.003 us/op BenchmarkServerSegmentPruner.query 10 100 avgt 5 10.149 ± 0.046 us/op BenchmarkServerSegmentPruner.query 10 1000 avgt 5 135.063 ± 14.670 us/op BenchmarkServerSegmentPruner.query 100 10 avgt 5 0.960 ± 0.006 us/op BenchmarkServerSegmentPruner.query 100 100 avgt 5 10.512 ± 0.054 us/op BenchmarkServerSegmentPruner.query 1000 10 avgt 5 0.796 ± 0.003 us/op BenchmarkServerSegmentPruner.query 1000 100 avgt 5 10.636 ± 0.021 us/op ``` As expected, the cost seems to be linear on the number of segments and it doesn't seem that the number of rows affects the results in a significant way. Given that the tests with 1000 segments takes a while to build, I've decided to do not test it with different number of rows. In this benchmark the time spent pruning has been reduced 23% and 30%. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
