[ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487977#comment-17487977 ]
ASF subversion and git services commented on LUCENE-10315: ---------------------------------------------------------- Commit 28ba89b8c951cc757e99eea22095375ddeb49f70 in lucene's branch refs/heads/branch_9x from gf2121 [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=28ba89b ] LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil (#541) > Speed up BKD leaf block ids codec by a 512 ints ForUtil > ------------------------------------------------------- > > Key: LUCENE-10315 > URL: https://issues.apache.org/jira/browse/LUCENE-10315 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Feng Guo > Assignee: Feng Guo > Priority: Major > Attachments: addall.svg > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Elasticsearch (which based on lucene) can automatically infers types for > users with its dynamic mapping feature. When users index some low cardinality > fields, such as gender / age / status... they often use some numbers to > represent the values, while ES will infer these fields as {{{}long{}}}, and > ES uses BKD as the index of {{long}} fields. When the data volume grows, > building the result set of low-cardinality fields will make the CPU usage and > load very high. > This is a flame graph we obtained from the production environment: > [^addall.svg] > It can be seen that almost all CPU is used in addAll. When we reindex > {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly > reduced ( We spent weeks of time to reindex all indices... ). I know that ES > recommended to use {{keyword}} for term/terms query and {{long}} for range > query in the document, but there are always some users who didn't realize > this and keep their habit of using sql database, or dynamic mapping > automatically selects the type for them. All in all, users won't realize that > there would be such a big difference in performance between {{long}} and > {{keyword}} fields in low cardinality fields. So from my point of view it > will make sense if we can make BKD works better for the low/medium > cardinality fields. > As far as i can see, for low cardinality fields, there are two advantages of > {{keyword}} over {{{}long{}}}: > 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's > delta VInt, because its batch reading (readLongs) and SIMD decode. > 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily > materialize of its result set, and when another small result clause > intersects with this low cardinality condition, the low cardinality field can > avoid reading all docIds into memory. > This ISSUE is targeting to solve the first point. The basic idea is trying to > use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization > by mocking some random {{LongPoint}} and querying them with > {{PointInSetQuery}}. > *Benchmark Result* > |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff > percentage| > |100000000|32|1|51.44|148.26|188.22%| > |100000000|32|2|26.8|101.88|280.15%| > |100000000|32|4|14.04|53.52|281.20%| > |100000000|32|8|7.04|28.54|305.40%| > |100000000|32|16|3.54|14.61|312.71%| > |100000000|128|1|110.56|350.26|216.81%| > |100000000|128|8|16.6|89.81|441.02%| > |100000000|128|16|8.45|48.07|468.88%| > |100000000|128|32|4.2|25.35|503.57%| > |100000000|128|64|2.13|13.02|511.27%| > |100000000|1024|1|536.19|843.88|57.38%| > |100000000|1024|8|109.71|251.89|129.60%| > |100000000|1024|32|33.24|104.11|213.21%| > |100000000|1024|128|8.87|30.47|243.52%| > |100000000|1024|512|2.24|8.3|270.54%| > |100000000|8192|1|3333.33|5000|50.00%| > |100000000|8192|32|139.47|214.59|53.86%| > |100000000|8192|128|54.59|109.23|100.09%| > |100000000|8192|512|15.61|36.15|131.58%| > |100000000|8192|2048|4.11|11.14|171.05%| > |100000000|1048576|1|2597.4|3030.3|16.67%| > |100000000|1048576|32|314.96|371.75|18.03%| > |100000000|1048576|128|99.7|116.28|16.63%| > |100000000|1048576|512|30.5|37.15|21.80%| > |100000000|1048576|2048|10.38|12.3|18.50%| > |100000000|8388608|1|2564.1|3174.6|23.81%| > |100000000|8388608|32|196.27|238.95|21.75%| > |100000000|8388608|128|55.36|68.03|22.89%| > |100000000|8388608|512|15.58|19.24|23.49%| > |100000000|8388608|2048|4.56|5.71|25.22%| > The indices size is reduced for low cardinality fields and flat for high > cardinality fields. > {code:java} > 113M index_100000000_doc_32_cardinality_baseline > 114M index_100000000_doc_32_cardinality_candidate > 140M index_100000000_doc_128_cardinality_baseline > 133M index_100000000_doc_128_cardinality_candidate > 193M index_100000000_doc_1024_cardinality_baseline > 174M index_100000000_doc_1024_cardinality_candidate > 241M index_100000000_doc_8192_cardinality_baseline > 233M index_100000000_doc_8192_cardinality_candidate > 314M index_100000000_doc_1048576_cardinality_baseline > 315M index_100000000_doc_1048576_cardinality_candidate > 392M index_100000000_doc_8388608_cardinality_baseline > 391M index_100000000_doc_8388608_cardinality_candidate > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org