The reason I asked about Span scoring is that the behavior changed when I
switched from TermQuery to BoostingTermQuery to take advantage of payloads.

It seems to me that a SpanTermQuery and BoostingTermQuery should behave the
same as TermQuery with respect to term frequency. The 'edit distance' isn't
really relevant for these queries, is it?

For a SpanNearQuery that contains SpanTermQueries, the score for a match on
"the quick brown fox" would be lower than a match on "brown fox" because of
the edit distance (4 vs 2). This seems counter intuitive, too.

Any comments?

Thanks,
Peter


On Tue, Mar 3, 2009 at 2:42 PM, Peter Keegan <peterlkee...@gmail.com> wrote:

> The DefaultSimilarity class defines sloppyFreq as:
>
> public float sloppyFreq(int distance) {
>   return 1.0f / (distance + 1);
> }
>
> For a 'SpanNearQuery', this reduces the effect of the term frequency on the
> score as the number of terms in the span increases. So, for a simple phrase
> query (using spans), the longer the phrase, the lower the TF. For a simple
> SpanTermQuery, the TF is reduced in half (1.0f / 1 + 1).
>
> I'm just wondering why this is the default behavior. For 'SpanTermQuery',
> I'd expect the TF to reflect the actual number of occurrences of the term.
> For a SpanNearQuery, wouldn't it still be the number of occurrences of the
> whole span, not the number of terms in the span?
>
> Thanks,
> Peter
>

Reply via email to