[
https://issues.apache.org/jira/browse/LUCENE-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872142#comment-16872142
]
Jim Ferenczi commented on LUCENE-8806:
--------------------------------------
I ran luceneutil with some disjunctions of phrase and term queries:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
HighPhraseHighTerm 8.47 (1.6%) 4.78 (2.6%)
-43.6% ( -47% - -40%)
MedPhraseHighTerm 15.54 (1.2%) 9.41 (2.5%)
-39.5% ( -42% - -36%)
HighPhraseHighPhrase 5.99 (1.4%) 3.65 (3.0%)
-39.0% ( -42% - -35%)
HighPhraseLowPhrase 15.57 (1.2%) 14.26 (3.6%)
-8.4% ( -13% - -3%)
LowPhraseLowPhrase 27.25 (2.0%) 31.75 (4.5%)
16.5% ( 9% - 23%)
HighPhraseLowTerm 26.31 (0.9%) 31.42 (3.4%)
19.4% ( 14% - 24%)
HighPhraseMedTerm 12.95 (1.0%) 15.74 (3.8%)
21.6% ( 16% - 26%)
MedPhraseMedPhrase 9.21 (2.4%) 11.50 (8.3%)
24.9% ( 13% - 36%)
MedPhraseLowTerm 24.85 (1.6%) 31.52 (5.5%)
26.8% ( 19% - 34%)
MedPhraseLowPhrase 11.64 (2.3%) 15.06 (7.1%)
29.3% ( 19% - 39%)
HighPhraseMedPhrase 8.27 (2.0%) 10.77 (7.2%)
30.2% ( 20% - 40%)
MedPhraseMedTerm 14.53 (1.7%) 19.33 (5.6%)
33.0% ( 25% - 40%)
{noformat}
While the change speeds up some cases it also shows a non-negligible regression
with high and med frequencies.
Currently the phrase scorer doesn't check impacts to compute the max score per
blocks so I tried to hack a simple patch that merges the impacts of the terms
that appear in the phrase query. The patch keeps the minimum frequency per norm
value in order to compute an upper bound of the score of the phrase query. I
ran luceneutil again with the modified patch and results are much better:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
HighPhraseHighTerm 8.22 (3.3%) 8.83 (1.9%)
7.4% ( 2% - 12%)
LowPhraseLowPhrase 26.57 (0.7%) 28.55 (5.5%)
7.4% ( 1% - 13%)
HighPhraseMedPhrase 7.98 (0.8%) 9.01 (5.0%)
12.9% ( 7% - 18%)
MedPhraseMedPhrase 8.95 (1.4%) 10.11 (6.6%)
12.9% ( 4% - 21%)
MedPhraseHighTerm 15.10 (1.1%) 17.69 (4.6%)
17.2% ( 11% - 23%)
MedPhraseLowPhrase 11.17 (1.1%) 13.11 (4.9%)
17.4% ( 11% - 23%)
HighPhraseLowPhrase 15.09 (1.5%) 18.85 (7.4%)
24.9% ( 15% - 34%)
HighPhraseHighPhrase 5.75 (2.3%) 7.26 (4.5%)
26.2% ( 18% - 33%)
HighPhraseLowTerm 25.68 (0.7%) 34.46 (2.4%)
34.2% ( 30% - 37%)
MedPhraseMedTerm 14.23 (0.1%) 20.71 (2.3%)
45.5% ( 43% - 47%)
MedPhraseLowTerm 24.30 (0.6%) 38.47 (2.4%)
58.3% ( 55% - 61%)
HighPhraseMedTerm 12.77 (0.6%) 22.21 (3.1%)
73.9% ( 69% - 77%)
{noformat}
However simple phrase queries (without disjunctions) seem to be slower with the
merging of impacts:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
HighPhrase 10.48 (0.0%) 9.74 (0.0%)
-7.1% ( -7% - -7%)
MedPhrase 20.92 (0.0%) 20.25 (0.0%)
-3.2% ( -3% - -3%)
LowPhrase 24.07 (0.0%) 23.33 (0.0%)
-3.1% ( -3% - -3%)
{noformat}
I am not sure that the merging of impacts is correct so far so I'll add some
tests. It's also unrelated to this change (even if it helps for performance) so
I'll open a separate issue to discuss this merging of impacts for phrase query
separately.
Considering the results of this change alone (two-phase iterator for the wand)
I will not merge it yet since it doesn't improve queries with lots of matches
but we can revisit when/if the merging of impacts for phrase queries is
implemented. WDYT ?
> WANDScorer should support two-phase iterator
> --------------------------------------------
>
> Key: LUCENE-8806
> URL: https://issues.apache.org/jira/browse/LUCENE-8806
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Jim Ferenczi
> Priority: Major
> Attachments: LUCENE-8806.patch, LUCENE-8806.patch
>
>
> Following https://issues.apache.org/jira/browse/LUCENE-8770 the WANDScorer
> should leverage two-phase iterators in order to be faster when used in
> conjunctions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]