[
https://issues.apache.org/jira/browse/LUCENE-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-4198:
---------------------------------
Attachment: LUCENE-4198.patch
I have taken another approach. Issue with {{setMinCompetitiveScore}} is that it
usually cannot be efficiently leveraged to speed up eg. conjunctions. So I went
with implementing ideas from the block-max WAND (BMW) paper
(http://engineering.nyu.edu/~suel/papers/bmw.pdf): the patch introduces a new
{{ImpactsEnum}} which extends {{PostingsEnum}} and introduces two APIs instead
of {{setMinCompetitiveScore}}:
- {{int advanceShallow(int target)}} to get scoring information for documents
that start at {{target}}. The benefit compared to {{advance}} is that it only
advances the skip list reader, which is much cheaper: no decoding is happening.
- {{float getMaxScore(int upTo)}} wich gives information about scores for doc
ids between the last target to {{advanceShallow}} and {{upTo}}, both included.
Currently only TermScorer leverages this, but the benefit is that we could add
these APIs to Scorer as well in a follow-up issue so that WANDScorer and
ConjunctionScorer could leverage them. I built a prototype already to make sure
that there is an actual speedup for some queries, but I'm leaving it to a
follow-up issue as indexing impacts is already challenging on its own. One
thing that it made me change though is that the new patch also stores all
impacts on the first level, which is written every 128 documents. This seemed
important for conjunctions, since the maximum score on a given block is not
always reached, on the contrary to term queries since they match all documents
in the block. It makes it more important to have good bounds of the score with
conjunctions than it is with term queries. The disk overhead is still
acceptable to me: the wikimedium10 index is only 1.4% larger overall, and
postings alone (the .doc file) is only 3.1% larger.
Here are the benchmark results:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
AndHighLow 1128.91 (3.5%) 875.48 (2.3%)
-22.4% ( -27% - -17%)
AndHighMed 409.67 (2.0%) 331.98 (1.7%)
-19.0% ( -22% - -15%)
OrHighMed 264.99 (3.5%) 229.15 (3.0%)
-13.5% ( -19% - -7%)
OrHighLow 111.47 (4.5%) 98.00 (3.2%)
-12.1% ( -18% - -4%)
OrHighHigh 34.88 (4.2%) 31.69 (4.0%)
-9.1% ( -16% - -1%)
OrNotHighLow 1373.74 (5.2%) 1291.72 (4.1%)
-6.0% ( -14% - 3%)
LowPhrase 78.14 (1.6%) 75.28 (1.2%)
-3.7% ( -6% - 0%)
MedPhrase 47.49 (1.6%) 45.92 (1.2%)
-3.3% ( -5% - 0%)
LowSloppyPhrase 208.43 (2.8%) 202.37 (2.7%)
-2.9% ( -8% - 2%)
Fuzzy1 300.99 (7.7%) 292.78 (8.0%)
-2.7% ( -17% - 13%)
LowSpanNear 62.73 (1.4%) 61.09 (1.3%)
-2.6% ( -5% - 0%)
Fuzzy2 188.37 (7.9%) 184.16 (6.7%)
-2.2% ( -15% - 13%)
MedSpanNear 57.41 (1.8%) 56.17 (1.5%)
-2.2% ( -5% - 1%)
MedSloppyPhrase 23.21 (2.3%) 22.73 (2.3%)
-2.1% ( -6% - 2%)
HighPhrase 48.75 (3.2%) 47.80 (3.6%)
-1.9% ( -8% - 4%)
HighSpanNear 40.04 (2.9%) 39.35 (2.7%)
-1.7% ( -7% - 4%)
HighTermMonthSort 228.21 (8.4%) 224.66 (7.9%)
-1.6% ( -16% - 16%)
HighSloppyPhrase 25.96 (2.8%) 25.61 (3.0%)
-1.4% ( -6% - 4%)
Respell 284.85 (3.7%) 282.42 (4.0%)
-0.9% ( -8% - 7%)
IntNRQ 18.87 (5.3%) 18.86 (6.8%)
-0.1% ( -11% - 12%)
Wildcard 85.50 (5.0%) 86.79 (4.0%)
1.5% ( -7% - 11%)
Prefix3 137.41 (6.5%) 141.61 (4.9%)
3.1% ( -7% - 15%)
HighTermDayOfYearSort 116.58 (6.3%) 121.38 (7.2%)
4.1% ( -8% - 18%)
AndHighHigh 37.64 (1.5%) 118.12 (6.7%)
213.8% ( 202% - 225%)
LowTerm 909.13 (2.2%) 3379.38 (11.2%)
271.7% ( 252% - 291%)
OrNotHighMed 196.21 (1.7%) 1509.92 (28.9%)
669.6% ( 627% - 712%)
MedTerm 305.82 (1.7%) 2897.01 (42.5%)
847.3% ( 789% - 907%)
HighTerm 108.94 (1.7%) 1191.54 (61.3%)
993.8% ( 915% - 1075%)
OrHighNotMed 81.83 (0.5%) 1082.94 (63.2%)
1223.5% (1153% - 1294%)
OrHighNotHigh 63.16 (0.7%) 995.35 (72.0%)
1475.8% (1393% - 1559%)
OrHighNotLow 34.53 (0.9%) 557.35 (94.8%)
1514.1% (1406% - 1623%)
OrNotHighHigh 51.95 (0.9%) 943.81 (73.7%)
1716.6% (1626% - 1808%)
{noformat}
Results are a bit less good than previously for two reasons:
- The new API doesn't allow term queries to skip more than one block (128
docs) at a time. But I think the perspective of speeding up boolean queries too
makes it a good trade-off, especially given that this is already quite a
significant speed up for term queries.
- The new implementation is more complex since it also (optionally) gives
access to positions, offsets and payloads. I did not specialize the case that
only frequencies are requested (hopefully this won't be necessary). Information
about positions could be useful in the future to speed up phrase queries.
I wouldn't worry too much about the slowdown for some boolean queries. First
the most affected queries are queries that are already quite fast. Plus I hope
that follow-ups will allow us to get some performance back.
> Allow codecs to index term impacts
> ----------------------------------
>
> Key: LUCENE-4198
> URL: https://issues.apache.org/jira/browse/LUCENE-4198
> Project: Lucene - Core
> Issue Type: Sub-task
> Components: core/index
> Reporter: Robert Muir
> Attachments: LUCENE-4198.patch, LUCENE-4198.patch, LUCENE-4198.patch,
> LUCENE-4198.patch, LUCENE-4198_flush.patch
>
>
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his
> implementation currently stores a max for the entire term, the problem is the
> same).
> We can imagine other similar algorithms too: I think the codec API should be
> able to support these.
> Currently it really doesnt: Stefan worked around the problem by providing a
> tool to 'rewrite' your index, he passes the IndexReader and Similarity to it.
> But it would be better if we fixed the codec API.
> One problem is that the Postings writer needs to have access to the
> Similarity. Another problem is that it needs access to the term and
> collection statistics up front, rather than after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment
> in a branch with these changes and see if we can make it work well.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]