RE: BlendedTermQuery causing negative IDF?

Markus Jelsma Tue, 19 Apr 2016 07:30:25 -0700

Hello Ahmet,

Before the unit test with the BlendingTermQuery i am also doing a sanity check 
using a simple Boolean query via LuceneQParser. The query is analogous to the 
BlendingTermQuery (text_nl:rare text_nl:term) (text:rare text:term) and does 
not produce negative scores because the docFreq doesn't exceed docCount.


I'd like to try DFISimilarity and ClassicSimilarity as well, but for some 
reason the unit tests do not accept the similarity defined in the test's 
schema.xml?!

Thanks!
Markus

 
 
-----Original message-----
> From:Ahmet Arslan <iori...@yahoo.com.INVALID>
> Sent: Tuesday 19th April 2016 16:17
> To: java-user@lucene.apache.org
> Subject: Re: BlendedTermQuery causing negative IDF?
> 
> 
> 
> Hi Markus,
> 
> It is a known property of BM25. It produces negative scores for common terms.
> Most of the term-weighting models are developed for indices in which stop 
> words are eliminated.
> Therefore, most of the term-weighting models have problems scoring common 
> terms.
> By the way, DFI model does a decent job when handling common terms.
> 
> Ahmet
> 
> 
> 
> On Tuesday, April 19, 2016 4:48 PM, Markus Jelsma 
> <markus.jel...@openindex.io> wrote:
> Hello,
> 
> I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using BM25 
> similarity and i have a very simple unit test to see if something is working 
> at all. But to my surprise, one of the results has a negative score, caused 
> by a negative IDF because docFreq is higher than docCount for that term on 
> that field. Here are the test documents:
> 
>     assertU(adoc("id", "1", "text", "rare term"));
>     assertU(adoc("id", "2", "text_nl", "less rare term"));
>     assertU(adoc("id", "3", "text_nl", "rarest term"));
>     assertU(commit());
> 
> My query parser creates the following Lucene query: 
> BlendedTermQuery(Blended(text:rare text:term text_nl:rare text_nl:term)) 
> which looks fine to me. But this is what i am getting back for issueing that 
> query on the above set of documents, the third document is the one with a 
> negative score.
> 
> <result name="response" numFound="3" start="0" maxScore="0.1805489">
>   <doc>
>     <str name="id">3</str>
>     <float name="score">0.1805489</float></doc>
>   <doc>
>     <str name="id">2</str>
>     <float name="score">0.14785346</float></doc>
>   <doc>
>     <str name="id">1</str>
>     <float name="score">-0.004004207</float></doc>
> </result>
> <lst name="debug">
>   <str name="rawquerystring">{!blended fl=text,text_nl}rare term</str>
>   <str name="querystring">{!blended fl=text,text_nl}rare term</str>
>   <str name="parsedquery">BlendedTermQuery(Blended(text:rare text:term 
> text_nl:rare text_nl:term))</str>
>   <str name="parsedquery_toString">Blended(text:rare text:term text_nl:rare 
> text_nl:term)</str>
>   <lst name="explain">
>     <str name="3">
> 0.1805489 = max plus 0.01 times others of:
>   0.1805489 = weight(text_nl:term in 2) [], result of:
>     0.1805489 = score(doc=2,freq=1.0 = termFreq=1.0
> ), product of:
>       0.18232156 = idf(docFreq=2, docCount=2)
>       0.9902773 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.5 = avgFieldLength
>         2.56 = fieldLength
> </str>
>     <str name="2">
> 0.14785345 = max plus 0.01 times others of:
>   0.14638956 = weight(text_nl:rare in 1) [], result of:
>     0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
> ), product of:
>       0.18232156 = idf(docFreq=2, docCount=2)
>       0.8029196 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.5 = avgFieldLength
>         4.0 = fieldLength
>   0.14638956 = weight(text_nl:term in 1) [], result of:
>     0.14638956 = score(doc=1,freq=1.0 = termFreq=1.0
> ), product of:
>       0.18232156 = idf(docFreq=2, docCount=2)
>       0.8029196 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.5 = avgFieldLength
>         4.0 = fieldLength
> </str>
>     <str name="1">
> -0.004004207 = max plus 0.01 times others of:
>   -0.20021036 = weight(text:rare in 0) [], result of:
>     -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
> ), product of:
>       -0.22314355 = idf(docFreq=2, docCount=1)
>       0.89722675 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.0 = avgFieldLength
>         2.56 = fieldLength
>   -0.20021036 = weight(text:term in 0) [], result of:
>     -0.20021036 = score(doc=0,freq=1.0 = termFreq=1.0
> ), product of:
>       -0.22314355 = idf(docFreq=2, docCount=1)
>       0.89722675 = tfNorm, computed from:
>         1.0 = termFreq=1.0
>         1.2 = parameter k1
>         0.75 = parameter b
>         2.0 = avgFieldLength
>         2.56 = fieldLength
> </str>
> 
> What am i doing wrong? Or did i catch a bug?
> 
> Thanks,
> Markus
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: BlendedTermQuery causing negative IDF?

Reply via email to