[ https://issues.apache.org/jira/browse/LUCENE-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412204#comment-17412204 ]
Marc D'Mello commented on LUCENE-10080: --------------------------------------- Here is the pull request link for the change: [https://github.com/apache/lucene/pull/288] I ran this change on {{wikimediumall}} and got the following results: {code:java} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value HighTermMonthSort 31.46 (18.1%) 30.46 (17.8%) -3.2% ( -33% - 39%) 0.573 IntNRQ 26.95 (37.9%) 26.40 (35.2%) -2.1% ( -54% - 114%) 0.859 TermDTSort 44.88 (17.3%) 44.04 (16.8%) -1.9% ( -30% - 39%) 0.728 Fuzzy1 51.03 (8.8%) 50.21 (9.3%) -1.6% ( -18% - 18%) 0.578 HighTermDayOfYearSort 62.41 (11.2%) 61.41 (10.7%) -1.6% ( -21% - 22%) 0.646 AndHighMed 61.78 (7.2%) 60.91 (6.9%) -1.4% ( -14% - 13%) 0.524 BrowseDayOfYearSSDVFacets 4.43 (7.5%) 4.39 (6.8%) -1.0% ( -14% - 14%) 0.647 OrNotHighHigh 454.43 (7.4%) 449.88 (6.7%) -1.0% ( -14% - 14%) 0.654 BrowseMonthSSDVFacets 4.87 (7.3%) 4.82 (6.6%) -1.0% ( -13% - 13%) 0.653 AndHighLow 484.80 (7.8%) 480.21 (8.4%) -0.9% ( -15% - 16%) 0.712 HighIntervalsOrdered 1.14 (7.8%) 1.13 (7.8%) -0.9% ( -15% - 16%) 0.730 HighSloppyPhrase 11.92 (7.8%) 11.82 (7.6%) -0.8% ( -15% - 15%) 0.737 BrowseMonthTaxoFacets 0.82 (8.6%) 0.81 (8.0%) -0.8% ( -16% - 17%) 0.764 OrHighMed 53.48 (7.0%) 53.09 (6.6%) -0.7% ( -13% - 13%) 0.734 OrHighHigh 8.81 (6.4%) 8.75 (6.1%) -0.7% ( -12% - 12%) 0.716 BrowseDateTaxoFacets 0.78 (8.4%) 0.77 (8.2%) -0.7% ( -15% - 17%) 0.786 Wildcard 62.05 (8.6%) 61.66 (10.0%) -0.6% ( -17% - 19%) 0.830 BrowseDayOfYearTaxoFacets 0.78 (8.4%) 0.77 (8.2%) -0.6% ( -15% - 17%) 0.821 MedPhrase 19.38 (8.0%) 19.27 (7.8%) -0.6% ( -15% - 16%) 0.819 Respell 28.41 (7.7%) 28.25 (7.2%) -0.5% ( -14% - 15%) 0.820 AndHighHigh 55.43 (7.1%) 55.15 (6.2%) -0.5% ( -12% - 13%) 0.812 MedSloppyPhrase 88.90 (8.9%) 88.47 (8.5%) -0.5% ( -16% - 18%) 0.860 LowPhrase 18.44 (6.8%) 18.37 (7.4%) -0.4% ( -13% - 14%) 0.849 MedIntervalsOrdered 6.51 (7.8%) 6.49 (7.2%) -0.4% ( -14% - 15%) 0.880 LowSpanNear 10.95 (7.0%) 10.91 (6.7%) -0.3% ( -13% - 14%) 0.878 OrHighNotMed 544.36 (8.1%) 542.80 (7.6%) -0.3% ( -14% - 16%) 0.909 MedSpanNear 9.89 (7.8%) 9.87 (6.9%) -0.2% ( -13% - 15%) 0.925 LowSloppyPhrase 14.41 (7.4%) 14.38 (6.8%) -0.2% ( -13% - 15%) 0.931 LowIntervalsOrdered 36.25 (7.8%) 36.22 (7.1%) -0.1% ( -13% - 16%) 0.972 HighPhrase 115.92 (8.5%) 115.85 (8.1%) -0.1% ( -15% - 18%) 0.982 PKLookup 160.74 (6.7%) 160.80 (6.1%) 0.0% ( -11% - 13%) 0.985 Prefix3 72.60 (8.9%) 72.64 (9.2%) 0.1% ( -16% - 20%) 0.984 HighSpanNear 2.85 (7.9%) 2.86 (6.6%) 0.3% ( -13% - 16%) 0.912 OrHighNotLow 570.42 (8.5%) 572.40 (8.4%) 0.3% ( -15% - 18%) 0.896 OrNotHighMed 581.16 (7.9%) 584.23 (9.2%) 0.5% ( -15% - 19%) 0.845 OrNotHighLow 564.66 (8.5%) 568.35 (9.5%) 0.7% ( -15% - 20%) 0.819 HighTerm 713.52 (7.0%) 718.50 (7.0%) 0.7% ( -12% - 15%) 0.751 HighTermTitleBDVSort 66.21 (17.1%) 66.79 (18.1%) 0.9% ( -29% - 43%) 0.875 MedTerm 1140.30 (6.1%) 1150.77 (7.4%) 0.9% ( -11% - 15%) 0.670 OrHighNotHigh 433.10 (7.9%) 440.81 (8.3%) 1.8% ( -13% - 19%) 0.488 Fuzzy2 46.81 (14.9%) 47.66 (13.7%) 1.8% ( -23% - 35%) 0.688 OrHighLow 331.28 (8.7%) 338.74 (9.4%) 2.3% ( -14% - 22%) 0.432 LowTerm 1459.96 (8.9%) 1493.52 (7.5%) 2.3% ( -12% - 20%) 0.377{code} This change shouldn't affect the QPS, but it should ideally improve heap usage. I could do method profiling to get an idea of heap usage changes, but is there other ways I can test to see if there is a heap usage reduction? Thanks! > Use a bit set to count long-tail of singleton FacetLabels? > ---------------------------------------------------------- > > Key: LUCENE-10080 > URL: https://issues.apache.org/jira/browse/LUCENE-10080 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet > Reporter: Michael McCandless > Priority: Major > > I was talking about this with [~rcmuir ] about LUCENE-9969, and he had a neat > idea for more efficient facet counting. > Today we accumulate counts directly in an HPPC native int/int map, or a > non-sparse {{int[]}} (if enough hits match the query). > But it is likely that many of these facet counts are singletons (occur only > once in each query). To be more space efficient, we could wrap a bit set > around the map or {{int[]}}. The first time we see an ordinal, we set its > bit. The second and subsequent times, we increment the count as we do today. > If we use a non-sparse bitset (e.g. {{FixedBitSet}}) that will add some > non-sparse heap cost O(maxDoc) for each segment, but if there are enough > ordinals to count, that can be a win over just the HPPC native int map for > some cases? > Maybe this could be an intermediate implementation, since we already cover > the "very low hit count" (use HPPC int/int map) and "very high hit count" > (using {{int[]}}) today? > Also, this bit set would be able to quickly iterate over the sorted ordinals, > which might be helpful if we move the three big {{int[]}} into numeric doc > values? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org