[jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

Robert Muir (JIRA) Fri, 20 Oct 2017 08:52:26 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212813#comment-16212813
 ]


Robert Muir commented on LUCENE-8000:
-------------------------------------

{quote}
What benchmarks have you used for measuring performance?
{quote}

I use trec-like IR collections in different languages. The Lucene benchmark 
module has some support for running the queries and creating output that you 
can send to trec_eval. I just use its query-running support (QueryDriver), i 
don't use its indexing/parsing support although it has that too. Instead I 
index the test collections myself. That's because the 
collections/queries/judgements are always annoyingly in a slightly different 
non-standard format. I only look at measures which are generally the most 
stable like MAP and bpref. 

{quote}
Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.
{quote}

I can't remember which scoring systems I tested at the time we flipped the 
default, but I think we should keep the same default for all scoring functions. 
It is fairly easy once you have everything setup to test with a ton of 
similarities at once (or different parameters) by modifying the code to loop 
across a big list. That's one reason why its valuable to try to keep any 
index-time logic consistent across all of them (such as formula for encoding 
the norm). Otherwise it makes testing unnecessarily difficult. Its already 
painful enough. This is important for real users too, they shouldn't have to 
reindex to do parameter tuning.



> Document Length Normalization in BM25Similarity correct?
> --------------------------------------------------------
>
>                 Key: LUCENE-8000
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8000
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Christoph Goller
>            Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
>     final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
>     int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
>     if (indexCreatedVersionMajor >= 7) {
>       return SmallFloat.intToByte4(numTerms);
>     } else {
>       return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
>     }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {code}
>  protected float avgFieldLength(CollectionStatistics collectionStats) {
>     final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
>     if (sumTotalTermFreq <= 0) {
>       return 1f;       // field does not exist, or stat is unsupported
>     } else {
>       final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : collectionStats.docCount();
>       return (float) (sumTotalTermFreq / (double) docCount);
>     }
>   }
> }
> {code}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious 
> effect. It just means that documents that have synonyms or in my use case 
> different normal forms of tokens on the same position are shorter and 
> therefore get higher scores  than they should and that we do not use the 
> whole spectrum of relative document lenght of BM25.
> I think for BM25  discountOverlaps  should default to false. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

Reply via email to