[ 
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809247#comment-13809247
 ] 

Michael McCandless commented on LUCENE-5316:
--------------------------------------------

Thanks Gilad, I ran a performance test, on full English Wikipedia.  I
indexed all dims as NO_PARENTS so we exercise visiting all children
(rollup).  Also, two flat dims (username, categories) have high
cardinality ... it looks like there is a high fixed cost, because the
relatively easy queries seem to lose the most perf:

{noformat}
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
              AndHighLow       59.93      (2.6%)       44.81      (4.9%)  
-25.2% ( -31% -  -18%)
               MedPhrase       42.82      (2.1%)       34.59      (4.1%)  
-19.2% ( -24% -  -13%)
                 LowTerm       40.13      (1.8%)       32.68      (3.3%)  
-18.6% ( -23% -  -13%)
            OrNotHighLow       33.20      (3.8%)       27.64      (4.2%)  
-16.8% ( -23% -   -9%)
                  Fuzzy1       30.74      (1.6%)       26.15      (3.0%)  
-14.9% ( -19% -  -10%)
                  Fuzzy2       24.95      (1.8%)       21.78      (2.9%)  
-12.7% ( -17% -   -8%)
         LowSloppyPhrase       24.11      (1.2%)       21.22      (2.8%)  
-12.0% ( -15% -   -8%)
            OrNotHighMed       19.85      (2.8%)       17.68      (3.4%)  
-10.9% ( -16% -   -4%)
             MedSpanNear       17.94      (2.3%)       16.34      (3.4%)   
-8.9% ( -14% -   -3%)
              AndHighMed       16.26      (1.3%)       14.88      (2.1%)   
-8.4% ( -11% -   -5%)
             AndHighHigh       13.95      (0.9%)       12.91      (1.9%)   
-7.4% ( -10% -   -4%)
                 Prefix3       13.24      (1.2%)       12.34      (1.9%)   
-6.8% (  -9% -   -3%)
                 MedTerm       12.85      (0.8%)       12.01      (1.8%)   
-6.6% (  -9% -   -3%)
           OrNotHighHigh        9.79      (1.5%)        9.29      (2.2%)   
-5.1% (  -8% -   -1%)
               LowPhrase        9.50      (5.5%)        9.05      (4.9%)   
-4.7% ( -14% -    6%)
                HighTerm        8.65      (1.2%)        8.28      (1.7%)   
-4.2% (  -7% -   -1%)
             LowSpanNear        7.40      (3.8%)        7.13      (4.6%)   
-3.7% ( -11% -    4%)
            OrHighNotMed        7.19      (1.4%)        6.96      (1.7%)   
-3.2% (  -6% -    0%)
               OrHighMed        5.64      (1.3%)        5.51      (1.4%)   
-2.3% (  -4% -    0%)
           OrHighNotHigh        4.80      (1.3%)        4.71      (2.1%)   
-1.9% (  -5% -    1%)
                Wildcard        4.60      (1.8%)        4.51      (1.5%)   
-1.8% (  -5% -    1%)
            OrHighNotLow        4.12      (1.1%)        4.06      (1.5%)   
-1.5% (  -3% -    1%)
               OrHighLow        2.81      (1.1%)        2.78      (1.2%)   
-1.0% (  -3% -    1%)
            HighSpanNear        3.24      (2.9%)        3.22      (3.0%)   
-0.8% (  -6% -    5%)
              OrHighHigh        2.09      (1.1%)        2.08      (1.4%)   
-0.6% (  -3% -    1%)
                  IntNRQ        1.50      (1.1%)        1.50      (0.8%)   
-0.6% (  -2% -    1%)
         MedSloppyPhrase        3.20      (5.2%)        3.19      (7.0%)   
-0.5% ( -11% -   12%)
              HighPhrase        2.73      (6.3%)        2.72      (5.7%)   
-0.3% ( -11% -   12%)
                 Respell       52.93      (2.4%)       52.76      (3.2%)   
-0.3% (  -5% -    5%)
        HighSloppyPhrase        3.28      (6.9%)        3.32      (9.3%)    
1.2% ( -14% -   18%)
{noformat}

Maybe ... we should allow .getChildren to return null when that ord
has no children?  I.e., I think the cost might be because are visiting
a great many ords for the "flat" fields.

Separately, perhaps taxo index could tell us when a dim is flat and
then we can avoid doing rollup.  Really for such dims, I should index
them as ALL_BUT_DIM, but it'd be nice if I do NO_PARENTS if the facets
code detected that the dim is actually flat and optimized
accordingly...


> Taxonomy tree traversing improvement
> ------------------------------------
>
>                 Key: LUCENE-5316
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5316
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Gilad Barkai
>            Priority: Minor
>         Attachments: LUCENE-5316.patch
>
>
> The taxonomy traversing is done today utilizing the 
> {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays 
> which hold for each ordinal it's (array #1) youngest child and (array #2) 
> older sibling.
> This is a compact way of holding the tree information in memory, but it's not 
> perfect:
> * Large (8 bytes per ordinal in memory)
> * Exposes internal implementation
> * Utilizing these arrays for tree traversing is not straight forward
> * Lose reference locality while traversing (the array is accessed in 
> increasing only entries, but they may be distant from one another)
> * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
> This issue is about making the traversing more easy, the code more readable, 
> and open it for future improvements (i.e memory footprint and NRT cost) - 
> without changing any of the internals. 
> A later issue(s?) could be opened to address the gaps once this one is done.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to