The scoring algorithm can't be expected to deal with totally bogus
(e.g. mathematically impossible) statistics, such as docFreq >
docCount. Many of them may fall apart. We should try to improve that
about BlendedTermQuery!
SynonymQuery should not really exist. It exists because of problems
like
Thanks Dough for letting us know that Lucene's BM25 avoids negative IDF values.
I didn't know that.
Markus, out of curiosity, why do you need BlendedTermQuery?
I knew SynonymQuery is now part of query parser base, I think they do similar
things?
Ahmet
On Tuesday, April 19, 2016 5:33 PM,
Lucene's BM25 avoids negatives scores for this by adding 1 inside the log
term of BM25's IDF
Compare this:
https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L71
to the Wikipedia
Hello Ahmet,
Before the unit test with the BlendingTermQuery i am also doing a sanity check
using a simple Boolean query via LuceneQParser. The query is analogous to the
BlendingTermQuery (text_nl:rare text_nl:term) (text:rare text:term) and does
not produce negative scores because the docFreq
Hi Again,
For those who are interested, I uploaded BM25's Term Frequency graph [0] for
some common and content-bearing words.
[0] http://2.1m.yt/PgUEcZ.png
Ahmet
On Tuesday, April 19, 2016 5:16 PM, Ahmet Arslan
wrote:
Hi Markus,
It is a known property of
Hi Markus,
It is a known property of BM25. It produces negative scores for common terms.
Most of the term-weighting models are developed for indices in which stop words
are eliminated.
Therefore, most of the term-weighting models have problems scoring common terms.
By the way, DFI model does a
Hello,
I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using BM25
similarity and i have a very simple unit test to see if something is working at
all. But to my surprise, one of the results has a negative score, caused by a
negative IDF because docFreq is higher than
Hi,
> The main use case is searching in file names. For example, lucene.txt,
> lucene_new.txt, lucene_1_new.txt. If I use 'lucene', I need to get all 3
> files. with 'new' I need to get last two files. Please note that Standard
> analyzer/tokenizer of lucene 3.6 is not giving us the results with
>
Hi Alan,
thank you, a jira ticket is opened.
Cheers,
Eva
On 18.04.2016 19:01, Alan Woodward wrote:
> Hi Eva,
>
> This looks like a bug in WeightedSpanTermExtractor, which is rewriting your
> PhraseQuery into a SpanNearQuery without checking how many terms there are.
> Could you open a JIRA