[
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Han Jiang updated LUCENE-3069:
------------------------------
Attachment: example.png
LUCENE-3069.patch
Uploaded patch, it is the main part of changes I commited to branch3069.
The picture shows current impl of outputs (it is fetched from one field in
wikimedium5k).
* long[] (sortable metadata)
* byte[] (unsortable, generic metadata)
* df, ttf (term stats)
A single byte flag is used to indicate whether/which fields current outputs
maintains,
for PBF with short byte[], this should be enough. Also, for long-tail terms,
the totalTermFreq
an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms
have df == ttf).
Since TermsEnum is totally based on FSTEnum, the performance of term dict
should be similar with
MemoryPF. However, for PK tasks, we have to pull docsEnum from MMap, so this
hurts.
Following is the performance comparison:
{noformat}
pure TempFST vs. Lucene41 + Memory(on idField), on wikimediumall
Task QPS base StdDev QPS comp StdDev
Pct diff
Respell 48.13 (4.4%) 15.38 (1.0%)
-68.0% ( -70% - -65%)
Fuzzy2 51.30 (5.3%) 17.47 (1.3%)
-65.9% ( -68% - -62%)
Fuzzy1 52.24 (4.0%) 18.50 (1.2%)
-64.6% ( -67% - -61%)
Wildcard 9.31 (1.7%) 6.16 (2.2%)
-33.8% ( -37% - -30%)
Prefix3 23.25 (1.8%) 19.00 (2.2%)
-18.3% ( -21% - -14%)
PKLookup 244.92 (3.6%) 225.42 (2.3%)
-8.0% ( -13% - -2%)
LowTerm 295.88 (5.5%) 293.27 (4.8%)
-0.9% ( -10% - 9%)
HighPhrase 13.62 (6.5%) 13.54 (7.4%)
-0.6% ( -13% - 14%)
MedTerm 99.51 (7.8%) 99.19 (7.7%)
-0.3% ( -14% - 16%)
MedPhrase 154.63 (9.4%) 154.38 (10.1%)
-0.2% ( -17% - 21%)
HighTerm 28.25 (10.7%) 28.25 (10.0%)
-0.0% ( -18% - 23%)
OrHighHigh 16.83 (13.3%) 16.86 (13.1%)
0.2% ( -23% - 30%)
HighSloppyPhrase 9.02 (4.4%) 9.03 (4.5%)
0.2% ( -8% - 9%)
LowPhrase 6.26 (3.4%) 6.27 (4.1%)
0.2% ( -7% - 8%)
OrHighMed 13.73 (13.2%) 13.77 (12.8%)
0.3% ( -22% - 30%)
OrHighLow 25.65 (13.2%) 25.73 (13.0%)
0.3% ( -22% - 30%)
MedSloppyPhrase 6.63 (2.7%) 6.66 (2.7%)
0.5% ( -4% - 6%)
AndHighMed 42.77 (1.8%) 43.13 (1.5%)
0.8% ( -2% - 4%)
LowSloppyPhrase 32.68 (3.0%) 32.96 (2.8%)
0.8% ( -4% - 6%)
AndHighHigh 22.90 (1.2%) 23.18 (0.7%)
1.2% ( 0% - 3%)
LowSpanNear 29.30 (2.0%) 29.83 (2.2%)
1.8% ( -2% - 6%)
MedSpanNear 8.39 (2.7%) 8.56 (2.9%)
2.0% ( -3% - 7%)
IntNRQ 3.12 (1.9%) 3.18 (6.7%)
2.1% ( -6% - 10%)
AndHighLow 507.01 (2.4%) 522.10 (2.8%)
3.0% ( -2% - 8%)
HighSpanNear 5.43 (1.8%) 5.60 (2.6%)
3.1% ( -1% - 7%)
{noformat}
{noformat}
pure TempFST vs. pure Lucene41, on wikimediumall
Task QPS base StdDev QPS comp StdDev
Pct diff
Respell 49.24 (2.7%) 15.51 (1.0%)
-68.5% ( -70% - -66%)
Fuzzy2 52.01 (4.8%) 17.61 (1.4%)
-66.1% ( -68% - -63%)
Fuzzy1 53.00 (4.0%) 18.62 (1.3%)
-64.9% ( -67% - -62%)
Wildcard 9.37 (1.3%) 6.15 (2.1%)
-34.4% ( -37% - -31%)
Prefix3 23.36 (0.8%) 18.96 (2.1%)
-18.8% ( -21% - -16%)
MedPhrase 155.86 (9.8%) 152.34 (9.7%)
-2.3% ( -19% - 19%)
LowPhrase 6.33 (3.7%) 6.23 (4.0%)
-1.6% ( -8% - 6%)
HighPhrase 13.68 (7.2%) 13.49 (6.8%)
-1.4% ( -14% - 13%)
OrHighMed 13.78 (13.0%) 13.68 (12.7%)
-0.8% ( -23% - 28%)
HighSloppyPhrase 9.14 (5.2%) 9.07 (3.7%)
-0.7% ( -9% - 8%)
OrHighHigh 16.87 (13.3%) 16.76 (12.9%)
-0.6% ( -23% - 29%)
OrHighLow 25.71 (13.1%) 25.58 (12.8%)
-0.5% ( -23% - 29%)
MedSloppyPhrase 6.69 (2.7%) 6.67 (2.4%)
-0.3% ( -5% - 4%)
LowSloppyPhrase 33.01 (3.2%) 32.99 (2.6%)
-0.1% ( -5% - 5%)
MedTerm 99.64 (8.0%) 99.67 (10.9%)
0.0% ( -17% - 20%)
LowTerm 294.52 (5.5%) 295.72 (7.2%)
0.4% ( -11% - 13%)
LowSpanNear 29.61 (2.6%) 29.76 (2.7%)
0.5% ( -4% - 5%)
IntNRQ 3.13 (1.8%) 3.16 (7.8%)
0.8% ( -8% - 10%)
MedSpanNear 8.49 (3.0%) 8.57 (3.4%)
0.9% ( -5% - 7%)
AndHighMed 42.86 (1.4%) 43.35 (1.4%)
1.1% ( -1% - 3%)
AndHighHigh 22.98 (0.6%) 23.26 (0.5%)
1.2% ( 0% - 2%)
HighSpanNear 5.51 (3.4%) 5.58 (3.4%)
1.3% ( -5% - 8%)
HighTerm 28.32 (10.5%) 28.76 (15.0%)
1.6% ( -21% - 30%)
AndHighLow 509.60 (2.2%) 526.17 (1.9%)
3.3% ( 0% - 7%)
PKLookup 156.59 (2.2%) 225.47 (2.8%)
44.0% ( 38% - 50%)
{noformat}
To revive the performance on automaton queries, intersect methods should be
implemented.
And index size comparison:
(actually, after LUCENE-5029, TempBlock has a little larger (5%) index size
than Lucene41)
{noformat}
wikimedium1m wikimediumall
Memory 2,212,352 /
Lucene41 448,164 12,104,520
TempFST 525,888 12,770,700
{noformat}
as for term dict size:
{noformat}
wikimedium1m wikimediumall
Lucene41(.tim+.tip) 157776 2059744
TempFST(.tmp) 233636 2779784
48% 35%
{noformat}
Some unresolved problems:
* Currently, TempFST uses the default option to build FST (i.e. doPacked =
false), when this option is switched on, the index size on wikimedium1m becomes
smaller, but on wikimediumall it becomes larger?
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index, core/search
> Affects Versions: 4.0-ALPHA
> Reporter: Simon Willnauer
> Assignee: Han Jiang
> Labels: gsoc2013
> Fix For: 4.4
>
> Attachments: example.png, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a
> delta codec file for scanning to terms. Some environments have enough memory
> available to keep the entire FST based term dict in memory. We should add a
> TermDictionary implementation that encodes all needed information for each
> term into the FST (custom fst.Output) and builds a FST from the entire term
> not just the delta.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]