[
https://issues.apache.org/jira/browse/LUCENE-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812757#comment-13812757
]
Michael McCandless commented on LUCENE-5316:
--------------------------------------------
bq. How about if you make up a hierarchical category, e.g.
charCount/0-100K/0-10K/0-1K/0-100/0-10?
Oh, right, I forgot I already have this hierarchical dim ... but I ran
PrintFacetStats and it only results in 1086 ords (less than date at
3279 ords).
I tested Gilad's latest patch still using NO_PARENTS (change one thing
at a time):
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
AndHighLow 60.84 (3.5%) 45.81 (0.9%)
-24.7% ( -28% - -21%)
MedPhrase 43.63 (2.9%) 35.16 (1.0%)
-19.4% ( -22% - -15%)
LowTerm 40.50 (2.0%) 33.31 (0.9%)
-17.8% ( -20% - -15%)
OrNotHighLow 33.70 (4.9%) 28.34 (2.9%)
-15.9% ( -22% - -8%)
Fuzzy1 30.86 (1.9%) 26.61 (0.8%)
-13.8% ( -16% - -11%)
LowSloppyPhrase 24.22 (1.7%) 21.51 (0.6%)
-11.2% ( -13% - -9%)
Fuzzy2 25.05 (1.7%) 22.29 (1.0%)
-11.0% ( -13% - -8%)
OrNotHighMed 20.09 (3.9%) 18.07 (2.5%)
-10.1% ( -15% - -3%)
MedSpanNear 18.16 (2.7%) 16.49 (1.7%)
-9.2% ( -13% - -4%)
AndHighMed 16.39 (1.4%) 15.03 (0.9%)
-8.3% ( -10% - -6%)
AndHighHigh 14.05 (1.2%) 13.11 (0.7%)
-6.7% ( -8% - -4%)
LowPhrase 9.71 (4.6%) 9.06 (4.8%)
-6.6% ( -15% - 2%)
Prefix3 13.39 (1.5%) 12.57 (0.8%)
-6.2% ( -8% - -3%)
MedTerm 12.86 (1.1%) 12.11 (1.3%)
-5.8% ( -8% - -3%)
LowSpanNear 7.54 (3.8%) 7.19 (3.3%)
-4.7% ( -11% - 2%)
OrNotHighHigh 9.94 (2.4%) 9.48 (1.5%)
-4.6% ( -8% - 0%)
HighTerm 8.65 (1.5%) 8.34 (1.4%)
-3.6% ( -6% - 0%)
HighPhrase 2.81 (5.1%) 2.72 (5.8%)
-3.3% ( -13% - 8%)
OrHighNotMed 7.23 (1.6%) 7.01 (1.2%)
-3.1% ( -5% - 0%)
OrHighMed 5.67 (1.9%) 5.52 (0.8%)
-2.7% ( -5% - 0%)
HighSpanNear 3.26 (3.1%) 3.18 (2.2%)
-2.5% ( -7% - 2%)
OrHighNotHigh 4.84 (2.1%) 4.73 (1.3%)
-2.4% ( -5% - 1%)
OrHighNotLow 4.14 (1.3%) 4.05 (1.0%)
-2.1% ( -4% - 0%)
Wildcard 4.61 (1.4%) 4.52 (1.1%)
-1.8% ( -4% - 0%)
OrHighLow 2.82 (1.5%) 2.78 (0.7%)
-1.4% ( -3% - 0%)
MedSloppyPhrase 3.27 (6.2%) 3.23 (7.2%)
-1.2% ( -13% - 13%)
HighSloppyPhrase 3.39 (6.9%) 3.36 (8.9%)
-0.9% ( -15% - 15%)
OrHighHigh 2.10 (1.5%) 2.08 (1.0%)
-0.9% ( -3% - 1%)
IntNRQ 1.51 (1.3%) 1.50 (0.4%)
-0.5% ( -2% - 1%)
Respell 52.58 (2.8%) 53.34 (1.9%)
1.4% ( -3% - 6%)
{noformat}
Curiously, returning null when there are no children didn't seem to
help?
Then I switched all dims to ALL_BUT_DIM (I haven't added per-dim ord
policy control to luceneutil yet):
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
AndHighLow 103.19 (4.5%) 83.12 (1.2%)
-19.4% ( -24% - -14%)
MedPhrase 60.47 (3.8%) 53.05 (1.7%)
-12.3% ( -17% - -7%)
LowTerm 54.28 (3.0%) 48.32 (1.2%)
-11.0% ( -14% - -7%)
OrNotHighLow 42.72 (5.3%) 38.99 (4.0%)
-8.7% ( -17% - 0%)
Fuzzy1 38.65 (1.9%) 35.70 (1.0%)
-7.6% ( -10% - -4%)
LowSloppyPhrase 28.79 (1.8%) 27.02 (1.0%)
-6.2% ( -8% - -3%)
Fuzzy2 29.02 (1.6%) 27.31 (1.2%)
-5.9% ( -8% - -3%)
MedSpanNear 20.48 (2.2%) 19.55 (2.3%)
-4.5% ( -8% - 0%)
OrNotHighMed 22.49 (4.0%) 21.52 (3.3%)
-4.3% ( -11% - 3%)
AndHighMed 17.94 (1.2%) 17.24 (0.7%)
-3.9% ( -5% - -2%)
AndHighHigh 15.04 (1.0%) 14.55 (0.5%)
-3.2% ( -4% - -1%)
Prefix3 14.02 (1.2%) 13.64 (1.0%)
-2.7% ( -4% - 0%)
MedTerm 13.47 (1.5%) 13.12 (1.0%)
-2.6% ( -5% - 0%)
LowPhrase 10.15 (5.5%) 9.94 (5.3%)
-2.1% ( -12% - 9%)
LowSpanNear 7.82 (3.8%) 7.67 (3.7%)
-1.9% ( -9% - 5%)
HighTerm 8.83 (1.3%) 8.68 (1.1%)
-1.7% ( -3% - 0%)
OrNotHighHigh 9.79 (2.0%) 9.63 (1.7%)
-1.6% ( -5% - 2%)
OrHighNotMed 7.30 (1.7%) 7.22 (1.2%)
-1.1% ( -3% - 1%)
HighSpanNear 3.27 (1.9%) 3.24 (2.8%)
-0.9% ( -5% - 3%)
OrHighMed 5.64 (1.6%) 5.59 (1.3%)
-0.9% ( -3% - 2%)
OrHighNotHigh 4.78 (1.9%) 4.75 (1.5%)
-0.6% ( -3% - 2%)
Wildcard 4.59 (1.5%) 4.57 (1.4%)
-0.6% ( -3% - 2%)
OrHighNotLow 4.08 (1.6%) 4.07 (1.5%)
-0.3% ( -3% - 2%)
HighSloppyPhrase 3.42 (8.7%) 3.42 (8.3%)
-0.2% ( -15% - 18%)
OrHighLow 2.72 (1.7%) 2.72 (1.5%)
-0.1% ( -3% - 3%)
OrHighHigh 2.03 (1.8%) 2.03 (1.4%)
-0.0% ( -3% - 3%)
MedSloppyPhrase 3.30 (6.2%) 3.30 (6.1%)
-0.0% ( -11% - 13%)
IntNRQ 1.43 (1.4%) 1.43 (1.2%)
-0.0% ( -2% - 2%)
HighPhrase 2.74 (5.7%) 2.74 (6.1%)
0.1% ( -10% - 12%)
Respell 52.73 (2.5%) 53.18 (3.0%)
0.8% ( -4% - 6%)
{noformat}
Cost is definitely less ... but not "trivial".
Then, I did ALL_BUT_DIM, but only facet on the 7 "easy" dims
(i.e. exclude categories and username):
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
LowPhrase 12.23 (6.4%) 12.07 (6.0%)
-1.4% ( -12% - 11%)
Respell 55.11 (3.1%) 54.44 (2.8%)
-1.2% ( -6% - 4%)
Fuzzy1 62.58 (1.6%) 61.95 (1.8%)
-1.0% ( -4% - 2%)
Fuzzy2 46.58 (1.5%) 46.17 (1.7%)
-0.9% ( -3% - 2%)
AndHighLow 402.55 (1.9%) 399.72 (3.2%)
-0.7% ( -5% - 4%)
HighPhrase 4.01 (8.0%) 3.99 (7.9%)
-0.5% ( -15% - 16%)
MedPhrase 126.62 (4.6%) 126.18 (4.5%)
-0.3% ( -9% - 9%)
HighTerm 20.11 (1.5%) 20.06 (1.6%)
-0.3% ( -3% - 2%)
LowSloppyPhrase 39.10 (1.7%) 39.03 (1.7%)
-0.2% ( -3% - 3%)
LowTerm 127.98 (1.9%) 127.81 (2.0%)
-0.1% ( -3% - 3%)
Prefix3 28.04 (1.1%) 28.02 (1.3%)
-0.1% ( -2% - 2%)
AndHighMed 27.39 (0.9%) 27.36 (0.7%)
-0.1% ( -1% - 1%)
AndHighHigh 25.00 (1.0%) 24.98 (0.7%)
-0.1% ( -1% - 1%)
MedTerm 26.14 (1.5%) 26.13 (1.6%)
-0.0% ( -3% - 3%)
IntNRQ 2.39 (1.4%) 2.39 (1.5%)
0.0% ( -2% - 2%)
Wildcard 8.90 (1.7%) 8.90 (1.1%)
0.1% ( -2% - 2%)
OrHighNotLow 8.52 (1.5%) 8.57 (2.5%)
0.5% ( -3% - 4%)
HighSloppyPhrase 3.95 (8.5%) 3.98 (12.7%)
0.8% ( -18% - 24%)
OrHighHigh 3.48 (1.6%) 3.51 (1.6%)
0.8% ( -2% - 4%)
OrHighNotMed 14.44 (1.6%) 14.57 (1.7%)
0.9% ( -2% - 4%)
OrHighLow 4.85 (1.6%) 4.90 (1.8%)
0.9% ( -2% - 4%)
OrHighMed 11.37 (1.4%) 11.48 (1.4%)
1.0% ( -1% - 3%)
MedSpanNear 25.99 (3.5%) 26.25 (2.8%)
1.0% ( -5% - 7%)
LowSpanNear 8.84 (4.7%) 8.93 (4.4%)
1.1% ( -7% - 10%)
MedSloppyPhrase 3.62 (7.0%) 3.67 (8.1%)
1.3% ( -12% - 17%)
HighSpanNear 4.98 (5.2%) 5.05 (2.9%)
1.3% ( -6% - 9%)
OrNotHighHigh 14.23 (1.9%) 14.42 (2.0%)
1.4% ( -2% - 5%)
OrHighNotHigh 8.52 (1.9%) 8.65 (2.0%)
1.5% ( -2% - 5%)
OrNotHighMed 30.98 (3.8%) 32.05 (4.9%)
3.5% ( -5% - 12%)
OrNotHighLow 60.16 (5.6%) 62.95 (7.0%)
4.6% ( -7% - 18%)
{noformat}
Which looks like basically noise ... so it's only the high-cardinality
dims that are affected.
> Taxonomy tree traversing improvement
> ------------------------------------
>
> Key: LUCENE-5316
> URL: https://issues.apache.org/jira/browse/LUCENE-5316
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Gilad Barkai
> Priority: Minor
> Attachments: LUCENE-5316.patch, LUCENE-5316.patch
>
>
> The taxonomy traversing is done today utilizing the
> {{ParallelTaxonomyArrays}}. In particular, two taxonomy-size {{int}} arrays
> which hold for each ordinal it's (array #1) youngest child and (array #2)
> older sibling.
> This is a compact way of holding the tree information in memory, but it's not
> perfect:
> * Large (8 bytes per ordinal in memory)
> * Exposes internal implementation
> * Utilizing these arrays for tree traversing is not straight forward
> * Lose reference locality while traversing (the array is accessed in
> increasing only entries, but they may be distant from one another)
> * In NRT, a reopen is always (not worst case) done at O(Taxonomy-size)
> This issue is about making the traversing more easy, the code more readable,
> and open it for future improvements (i.e memory footprint and NRT cost) -
> without changing any of the internals.
> A later issue(s?) could be opened to address the gaps once this one is done.
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]