[
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809247#comment-13809247
]
Michael McCandless commented on LUCENE-5316:
--------------------------------------------
Thanks Gilad, I ran a performance test, on full English Wikipedia. I
indexed all dims as NO_PARENTS so we exercise visiting all children
(rollup). Also, two flat dims (username, categories) have high
cardinality ... it looks like there is a high fixed cost, because the
relatively easy queries seem to lose the most perf:
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
AndHighLow 59.93 (2.6%) 44.81 (4.9%)
-25.2% ( -31% - -18%)
MedPhrase 42.82 (2.1%) 34.59 (4.1%)
-19.2% ( -24% - -13%)
LowTerm 40.13 (1.8%) 32.68 (3.3%)
-18.6% ( -23% - -13%)
OrNotHighLow 33.20 (3.8%) 27.64 (4.2%)
-16.8% ( -23% - -9%)
Fuzzy1 30.74 (1.6%) 26.15 (3.0%)
-14.9% ( -19% - -10%)
Fuzzy2 24.95 (1.8%) 21.78 (2.9%)
-12.7% ( -17% - -8%)
LowSloppyPhrase 24.11 (1.2%) 21.22 (2.8%)
-12.0% ( -15% - -8%)
OrNotHighMed 19.85 (2.8%) 17.68 (3.4%)
-10.9% ( -16% - -4%)
MedSpanNear 17.94 (2.3%) 16.34 (3.4%)
-8.9% ( -14% - -3%)
AndHighMed 16.26 (1.3%) 14.88 (2.1%)
-8.4% ( -11% - -5%)
AndHighHigh 13.95 (0.9%) 12.91 (1.9%)
-7.4% ( -10% - -4%)
Prefix3 13.24 (1.2%) 12.34 (1.9%)
-6.8% ( -9% - -3%)
MedTerm 12.85 (0.8%) 12.01 (1.8%)
-6.6% ( -9% - -3%)
OrNotHighHigh 9.79 (1.5%) 9.29 (2.2%)
-5.1% ( -8% - -1%)
LowPhrase 9.50 (5.5%) 9.05 (4.9%)
-4.7% ( -14% - 6%)
HighTerm 8.65 (1.2%) 8.28 (1.7%)
-4.2% ( -7% - -1%)
LowSpanNear 7.40 (3.8%) 7.13 (4.6%)
-3.7% ( -11% - 4%)
OrHighNotMed 7.19 (1.4%) 6.96 (1.7%)
-3.2% ( -6% - 0%)
OrHighMed 5.64 (1.3%) 5.51 (1.4%)
-2.3% ( -4% - 0%)
OrHighNotHigh 4.80 (1.3%) 4.71 (2.1%)
-1.9% ( -5% - 1%)
Wildcard 4.60 (1.8%) 4.51 (1.5%)
-1.8% ( -5% - 1%)
OrHighNotLow 4.12 (1.1%) 4.06 (1.5%)
-1.5% ( -3% - 1%)
OrHighLow 2.81 (1.1%) 2.78 (1.2%)
-1.0% ( -3% - 1%)
HighSpanNear 3.24 (2.9%) 3.22 (3.0%)
-0.8% ( -6% - 5%)
OrHighHigh 2.09 (1.1%) 2.08 (1.4%)
-0.6% ( -3% - 1%)
IntNRQ 1.50 (1.1%) 1.50 (0.8%)
-0.6% ( -2% - 1%)
MedSloppyPhrase 3.20 (5.2%) 3.19 (7.0%)
-0.5% ( -11% - 12%)
HighPhrase 2.73 (6.3%) 2.72 (5.7%)
-0.3% ( -11% - 12%)
Respell 52.93 (2.4%) 52.76 (3.2%)
-0.3% ( -5% - 5%)
HighSloppyPhrase 3.28 (6.9%) 3.32 (9.3%)
1.2% ( -14% - 18%)
{noformat}
Maybe ... we should allow .getChildren to return null when that ord
has no children? I.e., I think the cost might be because are visiting
a great many ords for the "flat" fields.
Separately, perhaps taxo index could tell us when a dim is flat and
then we can avoid doing rollup. Really for such dims, I should index
them as ALL_BUT_DIM, but it'd be nice if I do NO_PARENTS if the facets
code detected that the dim is actually flat and optimized
accordingly...
> Taxonomy tree traversing improvement
> ------------------------------------
>
> Key: LUCENE-5316
> URL: https://issues.apache.org/jira/browse/LUCENE-5316
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Gilad Barkai
> Priority: Minor
> Attachments: LUCENE-5316.patch
>
>
> The taxonomy traversing is done today utilizing the
> {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays
> which hold for each ordinal it's (array #1) youngest child and (array #2)
> older sibling.
> This is a compact way of holding the tree information in memory, but it's not
> perfect:
> * Large (8 bytes per ordinal in memory)
> * Exposes internal implementation
> * Utilizing these arrays for tree traversing is not straight forward
> * Lose reference locality while traversing (the array is accessed in
> increasing only entries, but they may be distant from one another)
> * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
> This issue is about making the traversing more easy, the code more readable,
> and open it for future improvements (i.e memory footprint and NRT cost) -
> without changing any of the internals.
> A later issue(s?) could be opened to address the gaps once this one is done.
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]