[jira] [Commented] (LUCENE-10080) Use a bit set to count long-tail of singleton FacetLabels?

Marc D'Mello (Jira) Wed, 08 Sep 2021 15:01:06 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412204#comment-17412204
 ]


Marc D'Mello commented on LUCENE-10080:
---------------------------------------

Here is the pull request link for the change: 
[https://github.com/apache/lucene/pull/288]

I ran this change on {{wikimediumall}} and got the following results:
{code:java}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
       HighTermMonthSort       31.46     (18.1%)       30.46     (17.8%)   
-3.2% ( -33% -   39%) 0.573
                  IntNRQ       26.95     (37.9%)       26.40     (35.2%)   
-2.1% ( -54% -  114%) 0.859
              TermDTSort       44.88     (17.3%)       44.04     (16.8%)   
-1.9% ( -30% -   39%) 0.728
                  Fuzzy1       51.03      (8.8%)       50.21      (9.3%)   
-1.6% ( -18% -   18%) 0.578
   HighTermDayOfYearSort       62.41     (11.2%)       61.41     (10.7%)   
-1.6% ( -21% -   22%) 0.646
              AndHighMed       61.78      (7.2%)       60.91      (6.9%)   
-1.4% ( -14% -   13%) 0.524
BrowseDayOfYearSSDVFacets        4.43      (7.5%)        4.39      (6.8%)   
-1.0% ( -14% -   14%) 0.647
           OrNotHighHigh      454.43      (7.4%)      449.88      (6.7%)   
-1.0% ( -14% -   14%) 0.654
   BrowseMonthSSDVFacets        4.87      (7.3%)        4.82      (6.6%)   
-1.0% ( -13% -   13%) 0.653
              AndHighLow      484.80      (7.8%)      480.21      (8.4%)   
-0.9% ( -15% -   16%) 0.712
    HighIntervalsOrdered        1.14      (7.8%)        1.13      (7.8%)   
-0.9% ( -15% -   16%) 0.730
        HighSloppyPhrase       11.92      (7.8%)       11.82      (7.6%)   
-0.8% ( -15% -   15%) 0.737
   BrowseMonthTaxoFacets        0.82      (8.6%)        0.81      (8.0%)   
-0.8% ( -16% -   17%) 0.764
               OrHighMed       53.48      (7.0%)       53.09      (6.6%)   
-0.7% ( -13% -   13%) 0.734
              OrHighHigh        8.81      (6.4%)        8.75      (6.1%)   
-0.7% ( -12% -   12%) 0.716
    BrowseDateTaxoFacets        0.78      (8.4%)        0.77      (8.2%)   
-0.7% ( -15% -   17%) 0.786
                Wildcard       62.05      (8.6%)       61.66     (10.0%)   
-0.6% ( -17% -   19%) 0.830
BrowseDayOfYearTaxoFacets        0.78      (8.4%)        0.77      (8.2%)   
-0.6% ( -15% -   17%) 0.821
               MedPhrase       19.38      (8.0%)       19.27      (7.8%)   
-0.6% ( -15% -   16%) 0.819
                 Respell       28.41      (7.7%)       28.25      (7.2%)   
-0.5% ( -14% -   15%) 0.820
             AndHighHigh       55.43      (7.1%)       55.15      (6.2%)   
-0.5% ( -12% -   13%) 0.812
         MedSloppyPhrase       88.90      (8.9%)       88.47      (8.5%)   
-0.5% ( -16% -   18%) 0.860
               LowPhrase       18.44      (6.8%)       18.37      (7.4%)   
-0.4% ( -13% -   14%) 0.849
     MedIntervalsOrdered        6.51      (7.8%)        6.49      (7.2%)   
-0.4% ( -14% -   15%) 0.880
             LowSpanNear       10.95      (7.0%)       10.91      (6.7%)   
-0.3% ( -13% -   14%) 0.878
            OrHighNotMed      544.36      (8.1%)      542.80      (7.6%)   
-0.3% ( -14% -   16%) 0.909
             MedSpanNear        9.89      (7.8%)        9.87      (6.9%)   
-0.2% ( -13% -   15%) 0.925
         LowSloppyPhrase       14.41      (7.4%)       14.38      (6.8%)   
-0.2% ( -13% -   15%) 0.931
     LowIntervalsOrdered       36.25      (7.8%)       36.22      (7.1%)   
-0.1% ( -13% -   16%) 0.972
              HighPhrase      115.92      (8.5%)      115.85      (8.1%)   
-0.1% ( -15% -   18%) 0.982
                PKLookup      160.74      (6.7%)      160.80      (6.1%)    
0.0% ( -11% -   13%) 0.985
                 Prefix3       72.60      (8.9%)       72.64      (9.2%)    
0.1% ( -16% -   20%) 0.984
            HighSpanNear        2.85      (7.9%)        2.86      (6.6%)    
0.3% ( -13% -   16%) 0.912
            OrHighNotLow      570.42      (8.5%)      572.40      (8.4%)    
0.3% ( -15% -   18%) 0.896
            OrNotHighMed      581.16      (7.9%)      584.23      (9.2%)    
0.5% ( -15% -   19%) 0.845
            OrNotHighLow      564.66      (8.5%)      568.35      (9.5%)    
0.7% ( -15% -   20%) 0.819
                HighTerm      713.52      (7.0%)      718.50      (7.0%)    
0.7% ( -12% -   15%) 0.751
    HighTermTitleBDVSort       66.21     (17.1%)       66.79     (18.1%)    
0.9% ( -29% -   43%) 0.875
                 MedTerm     1140.30      (6.1%)     1150.77      (7.4%)    
0.9% ( -11% -   15%) 0.670
           OrHighNotHigh      433.10      (7.9%)      440.81      (8.3%)    
1.8% ( -13% -   19%) 0.488
                  Fuzzy2       46.81     (14.9%)       47.66     (13.7%)    
1.8% ( -23% -   35%) 0.688
               OrHighLow      331.28      (8.7%)      338.74      (9.4%)    
2.3% ( -14% -   22%) 0.432
                 LowTerm     1459.96      (8.9%)     1493.52      (7.5%)    
2.3% ( -12% -   20%) 0.377{code}
This change shouldn't affect the QPS, but it should ideally improve heap usage. 
I could do method profiling to get an idea of heap usage changes, but is there 
other ways I can test to see if there is a heap usage reduction? Thanks!

> Use a bit set to count long-tail of singleton FacetLabels?
> ----------------------------------------------------------
>
>                 Key: LUCENE-10080
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10080
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Priority: Major
>
> I was talking about this with [~rcmuir ] about LUCENE-9969, and he had a neat 
> idea for more efficient facet counting.
> Today we accumulate counts directly in an HPPC native int/int map, or a 
> non-sparse {{int[]}} (if enough hits match the query).
> But it is likely that many of these facet counts are singletons (occur only 
> once in each query). To be more space efficient, we could wrap a bit set 
> around the map or {{int[]}}.  The first time we see an ordinal, we set its 
> bit.  The second and subsequent times, we increment the count as we do today.
> If we use a non-sparse bitset (e.g. {{FixedBitSet}}) that will add some 
> non-sparse heap cost O(maxDoc) for each segment, but if there are enough 
> ordinals to count, that can be a win over just the HPPC native int map for 
> some cases?
> Maybe this could be an intermediate implementation, since we already cover 
> the "very low hit count" (use HPPC int/int map) and "very high hit count" 
> (using {{int[]}}) today?
> Also, this bit set would be able to quickly iterate over the sorted ordinals, 
> which might be helpful if we move the three big {{int[]}} into numeric doc 
> values?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10080) Use a bit set to count long-tail of singleton FacetLabels?

Reply via email to