Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Robert Muir
The scoring algorithm can't be expected to deal with totally bogus (e.g. mathematically impossible) statistics, such as docFreq > docCount. Many of them may fall apart. We should try to improve that about BlendedTermQuery! SynonymQuery should not really exist. It exists because of problems like

Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Ahmet Arslan
Thanks Dough for letting us know that Lucene's BM25 avoids negative IDF values. I didn't know that. Markus, out of curiosity, why do you need BlendedTermQuery? I knew SynonymQuery is now part of query parser base, I think they do similar things? Ahmet On Tuesday, April 19, 2016 5:33 PM,

Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Doug Turnbull
Lucene's BM25 avoids negatives scores for this by adding 1 inside the log term of BM25's IDF Compare this: https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L71 to the Wikipedia

RE: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Markus Jelsma
Hello Ahmet, Before the unit test with the BlendingTermQuery i am also doing a sanity check using a simple Boolean query via LuceneQParser. The query is analogous to the BlendingTermQuery (text_nl:rare text_nl:term) (text:rare text:term) and does not produce negative scores because the docFreq

Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Ahmet Arslan
Hi Again, For those who are interested, I uploaded BM25's Term Frequency graph [0] for some common and content-bearing words. [0] http://2.1m.yt/PgUEcZ.png Ahmet On Tuesday, April 19, 2016 5:16 PM, Ahmet Arslan wrote: Hi Markus, It is a known property of

Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Ahmet Arslan
Hi Markus, It is a known property of BM25. It produces negative scores for common terms. Most of the term-weighting models are developed for indices in which stop words are eliminated. Therefore, most of the term-weighting models have problems scoring common terms. By the way, DFI model does a

BlendedTermQuery causing negative IDF?

2016-04-19 Thread Markus Jelsma
Hello, I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using BM25 similarity and i have a very simple unit test to see if something is working at all. But to my surprise, one of the results has a negative score, caused by a negative IDF because docFreq is higher than

RE: Custom indexing

2016-04-19 Thread Uwe Schindler
Hi, > The main use case is searching in file names. For example, lucene.txt, > lucene_new.txt, lucene_1_new.txt. If I use 'lucene', I need to get all 3 > files. with 'new' I need to get last two files. Please note that Standard > analyzer/tokenizer of lucene 3.6 is not giving us the results with >

Re: Problem with NGramAnalyzer, PhraseQuery and Highlighter

2016-04-19 Thread Eva Popenda
Hi Alan, thank you, a jira ticket is opened. Cheers, Eva On 18.04.2016 19:01, Alan Woodward wrote: > Hi Eva, > > This looks like a bug in WeightedSpanTermExtractor, which is rewriting your > PhraseQuery into a SpanNearQuery without checking how many terms there are. > Could you open a JIRA