guiyanakuang commented on pull request #915: URL: https://github.com/apache/orc/pull/915#issuecomment-936629718
I have completed a benchmark test (dba9a1a) using the current implementation for the time being. To show the benefits of custom statistics. Added OptimizeFilterBenchmark. Tested the performance of the default query, and of the filter condition with the base percentage of filter values re-ordered by TDigest. proportion: Ratio of cardinal number between columns quota: Minimum cardinal number ``` Benchmark (proportion) (quota) Mode Cnt Score Error Units OptimizeFilterBenchmark.noUseTDigest 2 10 avgt 20 1052.305 ± 10.632 us/op OptimizeFilterBenchmark.noUseTDigest 2 100 avgt 20 1109.375 ± 10.162 us/op OptimizeFilterBenchmark.noUseTDigest 2 1000 avgt 20 1173.790 ± 11.696 us/op OptimizeFilterBenchmark.noUseTDigest 3 10 avgt 20 1056.139 ± 8.359 us/op OptimizeFilterBenchmark.noUseTDigest 3 100 avgt 20 1154.665 ± 9.152 us/op OptimizeFilterBenchmark.noUseTDigest 3 1000 avgt 20 1168.113 ± 9.115 us/op OptimizeFilterBenchmark.useTDigest 2 10 avgt 20 1116.076 ± 6.330 us/op OptimizeFilterBenchmark.useTDigest 2 100 avgt 20 1162.956 ± 9.865 us/op OptimizeFilterBenchmark.useTDigest 2 1000 avgt 20 1220.028 ± 22.544 us/op OptimizeFilterBenchmark.useTDigest 3 10 avgt 20 1114.617 ± 10.220 us/op OptimizeFilterBenchmark.useTDigest 3 100 avgt 20 1219.488 ± 138.798 us/op OptimizeFilterBenchmark.useTDigest 3 1000 avgt 20 651.001 ± 20.784 us/op ``` Tests show some performance loss in reading custom statistics structures, around 100 us/op. When there are sparse values in the filter conditions, reordering the filter conditions results in a larger performance gain, as this allows for early pruning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org