[jira] Commented: (LUCENE-2879) MultiPhraseQuery sums its own idf instead of Similarity.

Robert Muir (JIRA) Sun, 23 Jan 2011 06:31:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985332#action_12985332
 ]


Robert Muir commented on LUCENE-2879:
-------------------------------------

{quote}
A small thing that bothered me was that an explanation is created although the 
user did not call explain(), and in general explain() is considered slower, but 
it is called once per query, so it should not be a perf issue, and that's the 
case already for two other queries so anyhow this one (MFQ) should first be 
made consistent, which is done by this patch.
{quote}

Well, this IDFExplanation is confusing/tricky... so with a good implementation, 
its an abstract class so creating the "Explanation" does nothing really.

Instead the explanation is calculated "lazily", only if you ask for it:
{noformat}
    /**
     * This should be calculated lazily if possible.
     * 
     * @return the explanation for the idf factor.
     */
    public abstract String explain();
{noformat}

{quote}
Not saying that the patch should change, just pointing out the difference in 
sum-of-square-weights computation between SpanWeight and MFQ.
{quote}

I saw this and it bothered me a bit as well too. But I suppose its ok, given 
that the whole thing is only an approximation anyway right?
(In a lot of more "ordinary" short queries, the # of unique terms will be 
similar to # of terms).

Additionally if this really bothered someone, they could work around it by 
putting all the terms into a HashSet in their IDF implementation to make
PhraseQuery, MultiPhraseQuery work like SpanQuery.

In general, when I look at the SpanQueries I am frustrated with other scoring 
problems.
For example, I think that SpanScorer by default should be consistent with our 
other Queries.
But imagine a Simple SpanTermQuery, its tf() calculation is done like this:
{noformat}
   while (matches) {
      int matchLength = spans.end() - spans.start();
      freq += similarity.sloppyFreq(matchLength);
   }
   ...
   similarity.tf(freq);
{noformat}

In my opinion this is an off-by-one :)
In the current implementation, this produces slop of 1 for an exact 
SpanTermQuery.
if instead it were spans.end() - spans.start() - 1, it would produce a slop of 
0,
yielding a sloppyFreq of 1 for each match, and would equate exactly with 
TermQuery.


> MultiPhraseQuery sums its own idf instead of Similarity.
> --------------------------------------------------------
>
>                 Key: LUCENE-2879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2879
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 2.9.5, 3.0.4, 3.1, 4.0
>
>         Attachments: LUCENE-2879.patch
>
>
> MultiPhraseQuery is a generalized version of PhraseQuery, and computes IDF 
> the same way by default (by summing across the terms).
> The problem is it doesn't let the Similarity do this: PhraseQuery calls 
> Similarity.idfExplain(Collection<Term> terms, IndexSearcher searcher),
> but MultiPhraseQuery just sums itself, calling Similarity.idf(int, int) for 
> each term.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2879) MultiPhraseQuery sums its own idf instead of Similarity.

Reply via email to