[
https://issues.apache.org/jira/browse/LUCENE-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637983#comment-13637983
]
Han Jiang commented on LUCENE-2962:
-----------------------------------
A full summary of skip frequency in wikimedium.10M.nostopwords.tasks, and part
of crazyRandomMinShouldMatch.tasks. The latter one is Really crazy :)
The 'skip_len' is actually counted as (newDocUpto-docUpto) in
Lucene41PostingsReader.*Enum.advance(target), when skip doesn't move,
counted as 0, otherwise the number of docs skipped. I changed codes in
luceneutil so that, each line of query is taken into account:
#query_category #num_query #num called #max_skip_len #tot_skip_len
#avg_skip_len #std_dev_skip_len
and_high_high: 500 18021935 14633 110997027
6.158996 25.283280
and_high_med: 500 9145928 22730 233710779
25.553534 61.885853
and_high_low: 500 1385125 215533 1073755035
775.204429 1606.345686
high_phrase: 42 253569 3284 5113544
20.166282 56.904256
high_sloppy_phrase: 42 2441007 3284 11993572
4.913371 23.253660
high_span_near: 42 2362258 3284 11846707
5.014993 23.604965
low_phrase: 500 6936508 21180 247018573
35.611373 103.751734
low_sloppy_phrase: 500 18170618 21180 298025713
16.401518 66.808480
low_span_near: 500 18100056 21180 296733920
16.394089 66.895263
med_phrase: 500 4513849 26367 144556764
32.025166 83.814376
med_sloppy_phrase: 500 17683175 26367 197756027
11.183287 45.764898
med_span_near: 500 17503372 26367 196409780
11.221254 45.958612
10terms_0high_2msm: 22 10875 32768 2502731
230.136184 1319.894640
10terms_0high_3msm: 17 17127 15743 440149
25.699130 209.870841
10terms_0high_4msm: 27 27144 24192 2156919
79.462091 640.948445
10terms_0high_5msm: 21 19564 26479 1829846
93.531282 773.820054
10terms_0high_6msm: 27 17555 31232 1615071
92.000627 745.978516
10terms_0high_7msm: 27 16618 18688 1208893
72.745998 505.996915
10terms_0high_8msm: 25 10722 17024 817872
76.279799 451.833907
10terms_0high_9msm: 16 5371 11008 411776
76.666543 353.098379
10terms_0high_10msm: 21 10403 32768 7325395
704.161780 2504.260576
10terms_5high_2msm: 24 650096 2163 1832245
2.818422 18.513591
10terms_5high_5msm: 24 1123877 276224 128339073
114.193166 936.887693
10terms_5high_10msm: 24 14211 1663232 322730000
22709.872634 115150.031194
This drives me to test, whether it is really necessary to use multi-level skip
structure for simpler queries like AndQuery & PhraseQuery.
So I set skipMultiplier=8000000 to make sure that Lucene41SkipWriter won't
create a level >1 skip list, which is marked as 'comp'.
And a clean trunk (skipMultiplier=8) used as 'base':
Task QPS base StdDev QPS comp StdDev
Pct diff
LowPhrase 34.86 (2.7%) 32.23 (1.6%)
-7.5% ( -11% - -3%)
LowTerm 335.88 (8.8%) 326.09 (7.8%)
-2.9% ( -17% - 14%)
HighSpanNear 7.05 (2.2%) 6.97 (0.7%)
-1.1% ( -3% - 1%)
AndHighMed 52.22 (1.3%) 51.72 (1.0%)
-1.0% ( -3% - 1%)
MedSpanNear 4.30 (2.1%) 4.26 (0.7%)
-0.8% ( -3% - 2%)
LowSpanNear 42.46 (1.7%) 42.28 (0.6%)
-0.4% ( -2% - 1%)
Fuzzy2 59.56 (5.4%) 59.31 (4.7%)
-0.4% ( -9% - 10%)
LowSloppyPhrase 10.33 (2.6%) 10.30 (2.5%)
-0.3% ( -5% - 4%)
AndHighHigh 18.37 (0.6%) 18.33 (0.3%)
-0.2% ( -1% - 0%)
Fuzzy1 53.70 (5.3%) 53.59 (5.2%)
-0.2% ( -10% - 10%)
HighPhrase 2.56 (6.5%) 2.56 (5.6%)
-0.2% ( -11% - 12%)
HighTerm 57.36 (15.2%) 57.34 (15.0%)
-0.0% ( -26% - 35%)
MedTerm 90.08 (13.9%) 90.30 (13.6%)
0.2% ( -23% - 32%)
IntNRQ 2.82 (13.3%) 2.83 (11.7%)
0.3% ( -21% - 29%)
MedPhrase 15.18 (8.6%) 15.23 (8.4%)
0.3% ( -15% - 19%)
MedSloppyPhrase 2.17 (4.2%) 2.18 (3.7%)
0.6% ( -6% - 8%)
OrHighMed 20.30 (14.5%) 20.47 (14.4%)
0.8% ( -24% - 34%)
Wildcard 21.53 (5.6%) 21.71 (4.9%)
0.8% ( -9% - 12%)
OrHighLow 17.26 (15.0%) 17.43 (15.0%)
1.0% ( -25% - 36%)
HighSloppyPhrase 8.31 (4.2%) 8.39 (4.5%)
1.0% ( -7% - 10%)
Prefix3 22.70 (5.8%) 22.93 (5.1%)
1.0% ( -9% - 12%)
OrHighHigh 15.51 (14.7%) 15.69 (14.8%)
1.2% ( -24% - 35%)
Respell 41.39 (3.4%) 42.01 (3.3%)
1.5% ( -5% - 8%)
AndHighLow 459.48 (2.3%) 468.22 (2.1%)
1.9% ( -2% - 6%)
PKLookup 251.05 (4.4%) 259.80 (2.8%)
3.5% ( -3% - 11%)
> Skip data should be inlined into the postings lists
> ---------------------------------------------------
>
> Key: LUCENE-2962
> URL: https://issues.apache.org/jira/browse/LUCENE-2962
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Labels: gsoc2013
> Attachments: proposal.txt
>
>
> Today, we store all skip data as a separate blob at the end of a given term's
> postings (if that term occurs in enough docs to warrant skip data).
> But this adds overhead during decoding -- we have to seek to a different
> place for the initial load, we have to init separate readers, we have to seek
> again while using the lower levels of the skip data, etc. Also, we have to
> fully decode all skip information even if we are not going to use it (eg if I
> only want docIDs, I still must decode position offset and lastPayloadLength).
> If instead we interleaved skip data into the postings file, we could keep it
> local, and "private" to each file that needs skipping. This should make it
> least costly to init and then use the skip data, which'd be a good perf gain
> for eg PhraseQuery, AndQuery.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]