[ https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898859#comment-16898859 ]
Christoph Goller edited comment on LUCENE-8943 at 8/2/19 12:39 PM: ------------------------------------------------------------------- Why is this an issue? Because IDFs of SpanOrQueriy and MultiPhraseQuery can get gigantic meaning that such queries have an unexpectedly high impact on the final score. was (Author: gol...@detego-software.de): Why is this an issue? Because IDFs of SpanOrQueriy and MultiPhraseQuery can get gigantic meaning that such queries get an unexpectedly high impact on the final score. > Incorrect IDF in MultiPhraseQuery and SpanOrQuery > ------------------------------------------------- > > Key: LUCENE-8943 > URL: https://issues.apache.org/jira/browse/LUCENE-8943 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring > Affects Versions: 8.0 > Reporter: Christoph Goller > Priority: Major > > I recently stumbled across a very old bug in the IDF computation for > MultiPhraseQuery and SpanOrQuery. > BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for > combining IDF values from more than on term / TermStatistics. > I mean the method: > Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics > termStats[]) > It simply adds up the IDFs from all termStats[]. > This method is used e.g. in PhraseQuery where it makes sense. If we assume > that for the phrase "New York" the occurrences of both words are independent, > we can multiply their probabilitis and since IDFs are logarithmic we add them > up. Seems to be a reasonable approximation. However, this method is also used > to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in: > Similarity.SimScorer getStats(IndexSearcher searcher) > A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual > positions. IDFs of alternative terms for one position should not be added up. > Instead we should use the minimum value as an approcimation because this > corresponds to the docFreq of the most frequent term and we know that this is > a lower bound for the docFreq for this position. > In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from > SpanWeight and adds up all IDFs of all OR-clauses. > If my arguments are not convincing, look at SynonymQuery / SynonymWeight in > the constructor: > SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float > boost) > A SynonymQuery is also a kind of OR-query and it uses the maximum of the > docFreq of all its alternative terms. I think this is how it should be. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org