[
https://issues.apache.org/jira/browse/LUCENE-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14319789#comment-14319789
]
Adrien Grand edited comment on LUCENE-6198 at 2/13/15 9:57 AM:
---------------------------------------------------------------
I did some more benchmarking and something that helped was to flatten clauses
in ConjunctionDISI. This typically means that {{+ "A B" +C}} is now
approximated as {{+A +B +C}} instead of {{+(+A +B) +C}}. (see attached patch)
Here are results on wikibig:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
AndMedPhraseHighTerm 21.19 (6.1%) 19.98 (2.6%)
-5.7% ( -13% - 3%)
PKLookup 334.11 (2.1%) 334.82 (2.2%)
0.2% ( -4% - 4%)
AndHighPhraseHighTerm 11.64 (4.1%) 11.83 (2.4%)
1.6% ( -4% - 8%)
AndHighPhraseMedTerm 19.19 (2.5%) 21.99 (2.1%)
14.6% ( 9% - 19%)
AndMedPhraseMedTerm 58.27 (6.3%) 67.53 (6.6%)
15.9% ( 2% - 30%)
AndHighPhraseLowTerm 35.07 (5.6%) 42.46 (6.1%)
21.1% ( 8% - 34%)
AndMedPhraseLowTerm 93.39 (8.0%) 128.24 (13.3%)
37.3% ( 14% - 63%)
{noformat}
I was curious about the slow down on AndMedPhraseHighTerm. For instance we have
{{+"los angeles" +title}}. {{title}} has a high doc frequency and so {{"los
angeles"}} leas the iteration on trunk, meaning that we check positions on
38591 documents (number of matches of {{+los +angeles}}). With the patch, we
intersect with {{title}} before checking positions, meaning that we only check
positions on 30711 documents. It seems to not be low enough compared to 38591
to make the query faster.
However, if we take a query from AndMedPhraseLowTerm like {{+"los angeles"
+rivers}}, this time we only check positions on 1238 documents instead of
38591, hence the speedup.
Edit: fixed the explanation which was backwards :)
was (Author: jpountz):
I did some more benchmarking and something that helped was to flatten clauses
in ConjunctionDISI. This typically means that {{+ "A B" +C}} is now
approximated as {{+A +B +C}} instead of {{+(+A +B) +C}}. (see attached patch)
Here are results on wikibig:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
AndMedPhraseHighTerm 21.19 (6.1%) 19.98 (2.6%)
-5.7% ( -13% - 3%)
PKLookup 334.11 (2.1%) 334.82 (2.2%)
0.2% ( -4% - 4%)
AndHighPhraseHighTerm 11.64 (4.1%) 11.83 (2.4%)
1.6% ( -4% - 8%)
AndHighPhraseMedTerm 19.19 (2.5%) 21.99 (2.1%)
14.6% ( 9% - 19%)
AndMedPhraseMedTerm 58.27 (6.3%) 67.53 (6.6%)
15.9% ( 2% - 30%)
AndHighPhraseLowTerm 35.07 (5.6%) 42.46 (6.1%)
21.1% ( 8% - 34%)
AndMedPhraseLowTerm 93.39 (8.0%) 128.24 (13.3%)
37.3% ( 14% - 63%)
{noformat}
I was curious about the slow down on AndMedPhraseHighTerm. And actually it
seems to be tied to the fact that terms are not random. For instance one query
of this task is {{+"los angeles" +title}} which matches 30669 documents.
However the approximation is {{+los +angeles +title}} and matches 30711
documents, so approximation in this case only adds overhead.
> two phase intersection
> ----------------------
>
> Key: LUCENE-6198
> URL: https://issues.apache.org/jira/browse/LUCENE-6198
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Robert Muir
> Attachments: LUCENE-6198.patch, LUCENE-6198.patch, LUCENE-6198.patch,
> LUCENE-6198.patch, LUCENE-6198.patch, phrase_intersections.tasks
>
>
> Currently some scorers have to do a lot of per-document work to determine if
> a document is a match. The simplest example is a phrase scorer, but there are
> others (spans, sloppy phrase, geospatial, etc).
> Imagine a conjunction with two MUST clauses, one that is a term that matches
> all odd documents, another that is a phrase matching all even documents.
> Today this conjunction will be very expensive, because the zig-zag
> intersection is reading a ton of useless positions.
> The same problem happens with filteredQuery and anything else that acts like
> a conjunction.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]