[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

Michael McCandless (JIRA) Fri, 08 Feb 2013 05:37:17 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13574481#comment-13574481
 ]


Michael McCandless commented on LUCENE-4609:
--------------------------------------------

OK the new format doesn't do very well.  This is all wikipedia (6.6M "big" 
docs), 7 facet dims:

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
                 MedTerm       46.85      (2.4%)       28.22      (0.7%)  
-39.8% ( -41% -  -37%)
                HighTerm       19.09      (2.5%)       12.27      (0.9%)  
-35.7% ( -38% -  -33%)
               OrHighLow       16.83      (2.8%)       11.21      (1.0%)  
-33.4% ( -36% -  -30%)
               OrHighMed       16.35      (2.8%)       11.00      (1.0%)  
-32.7% ( -35% -  -29%)
                 Prefix3       12.87      (2.8%)        8.81      (0.9%)  
-31.5% ( -34% -  -28%)
                Wildcard       27.22      (2.2%)       18.68      (0.7%)  
-31.4% ( -33% -  -29%)
                 LowTerm      110.58      (1.8%)       79.25      (0.6%)  
-28.3% ( -30% -  -26%)
              OrHighHigh        8.61      (2.9%)        6.19      (1.3%)  
-28.1% ( -31% -  -24%)
                  IntNRQ        3.54      (2.9%)        2.55      (1.2%)  
-27.9% ( -31% -  -24%)
             AndHighHigh       23.19      (1.4%)       17.67      (0.7%)  
-23.8% ( -25% -  -22%)
                  Fuzzy1       46.94      (1.7%)       40.34      (1.6%)  
-14.1% ( -17% -  -10%)
               MedPhrase      110.00      (5.6%)       98.08      (4.2%)  
-10.8% ( -19% -   -1%)
         MedSloppyPhrase       25.93      (2.5%)       23.37      (1.6%)   
-9.9% ( -13% -   -5%)
             MedSpanNear       28.43      (2.5%)       25.68      (1.2%)   
-9.7% ( -13% -   -6%)
              AndHighMed      105.06      (0.9%)       95.74      (1.0%)   
-8.9% ( -10% -   -7%)
               LowPhrase       21.26      (6.2%)       19.86      (5.3%)   
-6.6% ( -16% -    5%)
            HighSpanNear        3.53      (2.0%)        3.30      (1.2%)   
-6.5% (  -9% -   -3%)
                  Fuzzy2       52.61      (2.6%)       49.64      (2.5%)   
-5.6% ( -10% -    0%)
              HighPhrase       17.44     (10.2%)       16.66      (9.5%)   
-4.5% ( -21% -   16%)
        HighSloppyPhrase        0.92      (7.3%)        0.88      (5.7%)   
-4.5% ( -16% -    9%)
         LowSloppyPhrase       20.28      (3.1%)       19.59      (2.0%)   
-3.4% (  -8% -    1%)
                 Respell       46.30      (3.2%)       45.27      (3.4%)   
-2.2% (  -8% -    4%)
             LowSpanNear        8.36      (2.8%)        8.20      (1.9%)   
-1.9% (  -6% -    2%)
              AndHighLow      578.66      (3.0%)      569.71      (3.1%)   
-1.5% (  -7% -    4%)
{noformat}

Also it's quite a bit more RAM / disk consuming: 306 MB of .dvm/d files on disk 
vs 178 MB for trunk (and remember that part of this is the title SortedDV field.
                
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
>                 Key: LUCENE-4609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4609
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Priority: Minor
>         Attachments: LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch, 
> LUCENE-4609.patch, LUCENE-4609.patch, SemiPackedEncoder.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
> category ordinals. We have several such encoders, including VInt (default), 
> and block encoders.
> It would be interesting to implement and benchmark a 
> PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
> bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
> the max value you can see and (2) one that decides for each doc on the 
> optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

Reply via email to