[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

Michael McCandless (JIRA) Mon, 04 Nov 2013 03:21:41 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812757#comment-13812757
 ]


Michael McCandless commented on LUCENE-5316:
--------------------------------------------

bq. How about if you make up a hierarchical category, e.g. 
charCount/0-100K/0-10K/0-1K/0-100/0-10? 

Oh, right, I forgot I already have this hierarchical dim ... but I ran
PrintFacetStats and it only results in 1086 ords (less than date at
3279 ords).

I tested Gilad's latest patch still using NO_PARENTS (change one thing
at a time):

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
              AndHighLow       60.84      (3.5%)       45.81      (0.9%)  
-24.7% ( -28% -  -21%)
               MedPhrase       43.63      (2.9%)       35.16      (1.0%)  
-19.4% ( -22% -  -15%)
                 LowTerm       40.50      (2.0%)       33.31      (0.9%)  
-17.8% ( -20% -  -15%)
            OrNotHighLow       33.70      (4.9%)       28.34      (2.9%)  
-15.9% ( -22% -   -8%)
                  Fuzzy1       30.86      (1.9%)       26.61      (0.8%)  
-13.8% ( -16% -  -11%)
         LowSloppyPhrase       24.22      (1.7%)       21.51      (0.6%)  
-11.2% ( -13% -   -9%)
                  Fuzzy2       25.05      (1.7%)       22.29      (1.0%)  
-11.0% ( -13% -   -8%)
            OrNotHighMed       20.09      (3.9%)       18.07      (2.5%)  
-10.1% ( -15% -   -3%)
             MedSpanNear       18.16      (2.7%)       16.49      (1.7%)   
-9.2% ( -13% -   -4%)
              AndHighMed       16.39      (1.4%)       15.03      (0.9%)   
-8.3% ( -10% -   -6%)
             AndHighHigh       14.05      (1.2%)       13.11      (0.7%)   
-6.7% (  -8% -   -4%)
               LowPhrase        9.71      (4.6%)        9.06      (4.8%)   
-6.6% ( -15% -    2%)
                 Prefix3       13.39      (1.5%)       12.57      (0.8%)   
-6.2% (  -8% -   -3%)
                 MedTerm       12.86      (1.1%)       12.11      (1.3%)   
-5.8% (  -8% -   -3%)
             LowSpanNear        7.54      (3.8%)        7.19      (3.3%)   
-4.7% ( -11% -    2%)
           OrNotHighHigh        9.94      (2.4%)        9.48      (1.5%)   
-4.6% (  -8% -    0%)
                HighTerm        8.65      (1.5%)        8.34      (1.4%)   
-3.6% (  -6% -    0%)
              HighPhrase        2.81      (5.1%)        2.72      (5.8%)   
-3.3% ( -13% -    8%)
            OrHighNotMed        7.23      (1.6%)        7.01      (1.2%)   
-3.1% (  -5% -    0%)
               OrHighMed        5.67      (1.9%)        5.52      (0.8%)   
-2.7% (  -5% -    0%)
            HighSpanNear        3.26      (3.1%)        3.18      (2.2%)   
-2.5% (  -7% -    2%)
           OrHighNotHigh        4.84      (2.1%)        4.73      (1.3%)   
-2.4% (  -5% -    1%)
            OrHighNotLow        4.14      (1.3%)        4.05      (1.0%)   
-2.1% (  -4% -    0%)
                Wildcard        4.61      (1.4%)        4.52      (1.1%)   
-1.8% (  -4% -    0%)
               OrHighLow        2.82      (1.5%)        2.78      (0.7%)   
-1.4% (  -3% -    0%)
         MedSloppyPhrase        3.27      (6.2%)        3.23      (7.2%)   
-1.2% ( -13% -   13%)
        HighSloppyPhrase        3.39      (6.9%)        3.36      (8.9%)   
-0.9% ( -15% -   15%)
              OrHighHigh        2.10      (1.5%)        2.08      (1.0%)   
-0.9% (  -3% -    1%)
                  IntNRQ        1.51      (1.3%)        1.50      (0.4%)   
-0.5% (  -2% -    1%)
                 Respell       52.58      (2.8%)       53.34      (1.9%)    
1.4% (  -3% -    6%)
{noformat}

Curiously, returning null when there are no children didn't seem to
help?

Then I switched all dims to ALL_BUT_DIM (I haven't added per-dim ord
policy control to luceneutil yet):

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
              AndHighLow      103.19      (4.5%)       83.12      (1.2%)  
-19.4% ( -24% -  -14%)
               MedPhrase       60.47      (3.8%)       53.05      (1.7%)  
-12.3% ( -17% -   -7%)
                 LowTerm       54.28      (3.0%)       48.32      (1.2%)  
-11.0% ( -14% -   -7%)
            OrNotHighLow       42.72      (5.3%)       38.99      (4.0%)   
-8.7% ( -17% -    0%)
                  Fuzzy1       38.65      (1.9%)       35.70      (1.0%)   
-7.6% ( -10% -   -4%)
         LowSloppyPhrase       28.79      (1.8%)       27.02      (1.0%)   
-6.2% (  -8% -   -3%)
                  Fuzzy2       29.02      (1.6%)       27.31      (1.2%)   
-5.9% (  -8% -   -3%)
             MedSpanNear       20.48      (2.2%)       19.55      (2.3%)   
-4.5% (  -8% -    0%)
            OrNotHighMed       22.49      (4.0%)       21.52      (3.3%)   
-4.3% ( -11% -    3%)
              AndHighMed       17.94      (1.2%)       17.24      (0.7%)   
-3.9% (  -5% -   -2%)
             AndHighHigh       15.04      (1.0%)       14.55      (0.5%)   
-3.2% (  -4% -   -1%)
                 Prefix3       14.02      (1.2%)       13.64      (1.0%)   
-2.7% (  -4% -    0%)
                 MedTerm       13.47      (1.5%)       13.12      (1.0%)   
-2.6% (  -5% -    0%)
               LowPhrase       10.15      (5.5%)        9.94      (5.3%)   
-2.1% ( -12% -    9%)
             LowSpanNear        7.82      (3.8%)        7.67      (3.7%)   
-1.9% (  -9% -    5%)
                HighTerm        8.83      (1.3%)        8.68      (1.1%)   
-1.7% (  -3% -    0%)
           OrNotHighHigh        9.79      (2.0%)        9.63      (1.7%)   
-1.6% (  -5% -    2%)
            OrHighNotMed        7.30      (1.7%)        7.22      (1.2%)   
-1.1% (  -3% -    1%)
            HighSpanNear        3.27      (1.9%)        3.24      (2.8%)   
-0.9% (  -5% -    3%)
               OrHighMed        5.64      (1.6%)        5.59      (1.3%)   
-0.9% (  -3% -    2%)
           OrHighNotHigh        4.78      (1.9%)        4.75      (1.5%)   
-0.6% (  -3% -    2%)
                Wildcard        4.59      (1.5%)        4.57      (1.4%)   
-0.6% (  -3% -    2%)
            OrHighNotLow        4.08      (1.6%)        4.07      (1.5%)   
-0.3% (  -3% -    2%)
        HighSloppyPhrase        3.42      (8.7%)        3.42      (8.3%)   
-0.2% ( -15% -   18%)
               OrHighLow        2.72      (1.7%)        2.72      (1.5%)   
-0.1% (  -3% -    3%)
              OrHighHigh        2.03      (1.8%)        2.03      (1.4%)   
-0.0% (  -3% -    3%)
         MedSloppyPhrase        3.30      (6.2%)        3.30      (6.1%)   
-0.0% ( -11% -   13%)
                  IntNRQ        1.43      (1.4%)        1.43      (1.2%)   
-0.0% (  -2% -    2%)
              HighPhrase        2.74      (5.7%)        2.74      (6.1%)    
0.1% ( -10% -   12%)
                 Respell       52.73      (2.5%)       53.18      (3.0%)    
0.8% (  -4% -    6%)
{noformat}

Cost is definitely less ... but not "trivial".

Then, I did ALL_BUT_DIM, but only facet on the 7 "easy" dims
(i.e. exclude categories and username):

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
               LowPhrase       12.23      (6.4%)       12.07      (6.0%)   
-1.4% ( -12% -   11%)
                 Respell       55.11      (3.1%)       54.44      (2.8%)   
-1.2% (  -6% -    4%)
                  Fuzzy1       62.58      (1.6%)       61.95      (1.8%)   
-1.0% (  -4% -    2%)
                  Fuzzy2       46.58      (1.5%)       46.17      (1.7%)   
-0.9% (  -3% -    2%)
              AndHighLow      402.55      (1.9%)      399.72      (3.2%)   
-0.7% (  -5% -    4%)
              HighPhrase        4.01      (8.0%)        3.99      (7.9%)   
-0.5% ( -15% -   16%)
               MedPhrase      126.62      (4.6%)      126.18      (4.5%)   
-0.3% (  -9% -    9%)
                HighTerm       20.11      (1.5%)       20.06      (1.6%)   
-0.3% (  -3% -    2%)
         LowSloppyPhrase       39.10      (1.7%)       39.03      (1.7%)   
-0.2% (  -3% -    3%)
                 LowTerm      127.98      (1.9%)      127.81      (2.0%)   
-0.1% (  -3% -    3%)
                 Prefix3       28.04      (1.1%)       28.02      (1.3%)   
-0.1% (  -2% -    2%)
              AndHighMed       27.39      (0.9%)       27.36      (0.7%)   
-0.1% (  -1% -    1%)
             AndHighHigh       25.00      (1.0%)       24.98      (0.7%)   
-0.1% (  -1% -    1%)
                 MedTerm       26.14      (1.5%)       26.13      (1.6%)   
-0.0% (  -3% -    3%)
                  IntNRQ        2.39      (1.4%)        2.39      (1.5%)    
0.0% (  -2% -    2%)
                Wildcard        8.90      (1.7%)        8.90      (1.1%)    
0.1% (  -2% -    2%)
            OrHighNotLow        8.52      (1.5%)        8.57      (2.5%)    
0.5% (  -3% -    4%)
        HighSloppyPhrase        3.95      (8.5%)        3.98     (12.7%)    
0.8% ( -18% -   24%)
              OrHighHigh        3.48      (1.6%)        3.51      (1.6%)    
0.8% (  -2% -    4%)
            OrHighNotMed       14.44      (1.6%)       14.57      (1.7%)    
0.9% (  -2% -    4%)
               OrHighLow        4.85      (1.6%)        4.90      (1.8%)    
0.9% (  -2% -    4%)
               OrHighMed       11.37      (1.4%)       11.48      (1.4%)    
1.0% (  -1% -    3%)
             MedSpanNear       25.99      (3.5%)       26.25      (2.8%)    
1.0% (  -5% -    7%)
             LowSpanNear        8.84      (4.7%)        8.93      (4.4%)    
1.1% (  -7% -   10%)
         MedSloppyPhrase        3.62      (7.0%)        3.67      (8.1%)    
1.3% ( -12% -   17%)
            HighSpanNear        4.98      (5.2%)        5.05      (2.9%)    
1.3% (  -6% -    9%)
           OrNotHighHigh       14.23      (1.9%)       14.42      (2.0%)    
1.4% (  -2% -    5%)
           OrHighNotHigh        8.52      (1.9%)        8.65      (2.0%)    
1.5% (  -2% -    5%)
            OrNotHighMed       30.98      (3.8%)       32.05      (4.9%)    
3.5% (  -5% -   12%)
            OrNotHighLow       60.16      (5.6%)       62.95      (7.0%)    
4.6% (  -7% -   18%)
{noformat}

Which looks like basically noise ... so it's only the high-cardinality
dims that are affected.


> Taxonomy tree traversing improvement
> ------------------------------------
>
>                 Key: LUCENE-5316
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5316
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Gilad Barkai
>            Priority: Minor
>         Attachments: LUCENE-5316.patch, LUCENE-5316.patch
>
>
> The taxonomy traversing is done today utilizing the 
> {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
> which hold for each ordinal it's (array #1) youngest child and (array #2) 
> older sibling.
> This is a compact way of holding the tree information in memory, but it's not 
> perfect:
> * Large (8 bytes per ordinal in memory)
> * Exposes internal implementation
> * Utilizing these arrays for tree traversing is not straight forward
> * Lose reference locality while traversing (the array is accessed in 
> increasing only entries, but they may be distant from one another)
> * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
> This issue is about making the traversing more easy, the code more readable, 
> and open it for future improvements (i.e memory footprint and NRT cost) - 
> without changing any of the internals. 
> A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5316) Taxonomy tree traversing improvement

Reply via email to