[
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Han Jiang updated LUCENE-3069:
------------------------------
Attachment: LUCENE-3069.patch
Uploaded patch.
It is optimized for wildcardquery, and I did a quick test on 1M wiki data:
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
PKLookup 314.63 (1.5%) 314.64 (1.2%)
0.0% ( -2% - 2%)
Fuzzy1 91.32 (3.7%) 92.50 (1.6%)
1.3% ( -3% - 6%)
Respell 104.54 (3.9%) 106.97 (1.6%)
2.3% ( -2% - 8%)
Fuzzy2 38.22 (4.1%) 39.16 (1.2%)
2.5% ( -2% - 8%)
Wildcard 109.56 (3.1%) 273.42 (5.0%)
149.6% ( 137% - 162%)
{noformat}
and TempFSTOrd vs. Lucene41, on 1M data:
{noformat}
Task QPS base StdDev QPS comp StdDev
Pct diff
Respell 134.85 (3.7%) 106.30 (0.6%)
-21.2% ( -24% - -17%)
Fuzzy2 47.78 (4.1%) 39.03 (0.9%)
-18.3% ( -22% - -13%)
Fuzzy1 112.02 (3.0%) 91.95 (0.6%)
-17.9% ( -20% - -14%)
Wildcard 326.68 (3.5%) 273.41 (1.9%)
-16.3% ( -20% - -11%)
PKLookup 194.61 (1.8%) 314.24 (0.7%)
61.5% ( 57% - 65%)
{noformat}
But I'm not happy with it :(, the hack I did here is to consume another big
block to store the last byte of each term. So for wildcard query ab*c, we have
external information to tell the ord of nearest term like *c. Knowing the ord,
we can use a similar approach like getByOutput to jump to the next target term.
Previously, we have to walk on fst to the stop node to find out whether the
last byte is 'c', so this optimization comes to be a big chunk.
However I don't really like this patch :(, we have to increase index size (521M
=> 530M), and the code comes to be mess up, since we always have to foresee the
next arc on current stack.
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index, core/search
> Affects Versions: 4.0-ALPHA
> Reporter: Simon Willnauer
> Assignee: Han Jiang
> Labels: gsoc2013
> Fix For: 5.0, 4.5
>
> Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch,
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch,
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a
> delta codec file for scanning to terms. Some environments have enough memory
> available to keep the entire FST based term dict in memory. We should add a
> TermDictionary implementation that encodes all needed information for each
> term into the FST (custom fst.Output) and builds a FST from the entire term
> not just the delta.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]