[
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-4198:
---------------------------------
Attachment: LUCENE-4198.patch
New patch. This time it has tests, does basic testing in CheckIndex and does
not clone too much.
Results are very good on queries that score on a single term, almost too good,
I'm currently thinking about how we could change the API to have something that
is easier to propagate with boolean queries, even if it means term queries
can't be as fast.
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
AndHighLow 2050.37 (4.2%) 1745.54 (2.0%)
-14.9% ( -20% - -9%)
OrHighLow 922.62 (3.7%) 844.54 (2.4%)
-8.5% ( -14% - -2%)
AndHighMed 277.85 (1.8%) 258.11 (2.6%)
-7.1% ( -11% - -2%)
OrNotHighLow 1105.41 (3.6%) 1044.69 (2.0%)
-5.5% ( -10% - 0%)
AndHighHigh 128.97 (1.1%) 121.89 (2.7%)
-5.5% ( -9% - -1%)
Fuzzy2 166.62 (6.2%) 158.38 (6.3%)
-4.9% ( -16% - 8%)
OrHighMed 177.56 (2.3%) 170.05 (1.9%)
-4.2% ( -8% - 0%)
Fuzzy1 199.16 (4.4%) 193.05 (5.5%)
-3.1% ( -12% - 7%)
MedSloppyPhrase 53.92 (2.2%) 52.40 (2.3%)
-2.8% ( -7% - 1%)
LowPhrase 201.13 (1.7%) 195.87 (1.0%)
-2.6% ( -5% - 0%)
LowSpanNear 363.85 (3.0%) 355.07 (2.5%)
-2.4% ( -7% - 3%)
HighPhrase 62.68 (1.6%) 61.32 (1.2%)
-2.2% ( -4% - 0%)
HighTermMonthSort 218.42 (9.8%) 214.35 (8.3%)
-1.9% ( -18% - 18%)
MedSpanNear 46.65 (1.4%) 45.89 (1.5%)
-1.6% ( -4% - 1%)
MedPhrase 178.02 (1.5%) 175.24 (1.2%)
-1.6% ( -4% - 1%)
HighSpanNear 10.21 (3.4%) 10.11 (3.4%)
-1.0% ( -7% - 6%)
HighSloppyPhrase 32.32 (7.3%) 32.01 (7.1%)
-1.0% ( -14% - 14%)
LowSloppyPhrase 18.01 (2.7%) 17.85 (2.7%)
-0.9% ( -6% - 4%)
Respell 320.99 (2.1%) 321.02 (2.4%)
0.0% ( -4% - 4%)
IntNRQ 29.29 (11.6%) 29.42 (12.5%)
0.4% ( -21% - 27%)
Wildcard 189.97 (4.6%) 191.87 (3.9%)
1.0% ( -7% - 9%)
Prefix3 166.43 (6.2%) 169.95 (5.4%)
2.1% ( -8% - 14%)
OrHighHigh 48.00 (3.7%) 49.09 (3.9%)
2.3% ( -5% - 10%)
HighTermDayOfYearSort 146.88 (7.4%) 150.76 (8.0%)
2.6% ( -11% - 19%)
LowTerm 830.79 (2.6%) 2246.40 (9.9%)
170.4% ( 153% - 187%)
OrNotHighMed 180.11 (1.5%) 1454.55 (15.7%)
707.6% ( 680% - 735%)
MedTerm 216.16 (1.7%) 3834.73 (37.0%)
1674.0% (1608% - 1742%)
HighTerm 109.49 (2.0%) 1944.44 (45.3%)
1675.9% (1597% - 1757%)
OrHighNotMed 57.55 (1.1%) 1292.66 (57.7%)
2146.2% (2064% - 2229%)
OrHighNotLow 84.00 (1.1%) 1996.82 (75.4%)
2277.2% (2176% - 2379%)
OrNotHighHigh 58.22 (1.3%) 1479.53 (53.5%)
2441.4% (2356% - 2528%)
OrHighNotHigh 66.91 (1.2%) 2042.54 (55.1%)
2952.6% (2862% - 3045%)
{noformat}
> Allow codecs to index term impacts
> ----------------------------------
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
> Issue Type: Sub-task
> Components: core/index
> Reporter: Robert Muir
> Attachments: LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch,
> LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his
> implementation currently stores a max for the entire term, the problem is the
> same).
> We can imagine other similar algorithms too: I think the codec API should be
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it.
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the
> Similarity. Another problem is that it needs access to the term and
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment
> in a branch with these changes and see if we can make it work well.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]