Hi Markus,
It is a known property of BM25. It produces negative scores for common terms. Most of the term-weighting models are developed for indices in which stop words are eliminated. Therefore, most of the term-weighting models have problems scoring common terms. By the way, DFI model does a decent job when handling common terms. Ahmet On Tuesday, April 19, 2016 4:48 PM, Markus Jelsma <markus.jel...@openindex.io> wrote: Hello, I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using BM25 similarity and i have a very simple unit test to see if something is working at all. But to my surprise, one of the results has a negative score, caused by a negative IDF because docFreq is higher than docCount for that term on that field. Here are the test documents: assertU(adoc("id", "1", "text", "rare term")); assertU(adoc("id", "2", "text_nl", "less rare term")); assertU(adoc("id", "3", "text_nl", "rarest term")); assertU(commit()); My query parser creates the following Lucene query: BlendedTermQuery(Blended(text:rare text:term text_nl:rare text_nl:term)) which looks fine to me. But this is what i am getting back for issueing that query on the above set of documents, the third document is the one with a negative score. <result name="response" numFound="3" start="0" maxScore="0.1805489"> <doc> <str name="id">3</str> <float name="score">0.1805489</float></doc> <doc> <str name="id">2</str> <float name="score">0.14785346</float></doc> <doc> <str name="id">1</str> <float name="score">-0.004004207</float></doc> </result> <lst name="debug"> <str name="rawquerystring">{!blended fl=text,text_nl}rare term</str> <str name="querystring">{!blended fl=text,text_nl}rare term</str> <str name="parsedquery">BlendedTermQuery(Blended(text:rare text:term text_nl:rare text_nl:term))</str> <str name="parsedquery_toString">Blended(text:rare text:term text_nl:rare text_nl:term)</str> <lst name="explain"> <str name="3"> 0.1805489 = max plus 0.01 times others of: 0.1805489 = weight(text_nl:term in 2) [], result of: 0.1805489 = score(doc=2,freq=1.0 = termFreq=1.0 ), product of: 0.18232156 = idf(docFreq=2, docCount=2) 0.9902773 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 2.5 = avgFieldLength 2.56 = fieldLength </str> <str name="2"> 0.14785345 = max plus 0.01 times others of: 0.14638956 = weight(text_nl:rare in 1) [], result of: 0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 0.18232156 = idf(docFreq=2, docCount=2) 0.8029196 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 2.5 = avgFieldLength 4.0 = fieldLength 0.14638956 = weight(text_nl:term in 1) [], result of: 0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 0.18232156 = idf(docFreq=2, docCount=2) 0.8029196 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 2.5 = avgFieldLength 4.0 = fieldLength </str> <str name="1"> -0.004004207 = max plus 0.01 times others of: -0.20021036 = weight(text:rare in 0) [], result of: -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: -0.22314355 = idf(docFreq=2, docCount=1) 0.89722675 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 2.0 = avgFieldLength 2.56 = fieldLength -0.20021036 = weight(text:term in 0) [], result of: -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: -0.22314355 = idf(docFreq=2, docCount=1) 0.89722675 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 2.0 = avgFieldLength 2.56 = fieldLength </str> What am i doing wrong? Or did i catch a bug? Thanks, Markus --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org