Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Doug Turnbull
It is challenging as the performance of different use cases and domains will by very dependent on the use case (there's no one globally perfect relevance solution). But a good set of metrics to see *generally* how stock Solr performs across a reasonable set of verticals would be nice. My

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread alessandro.benedetti
Thanks Yonik and thanks Doug. I agree with Doug in adding few generics test corpora Jenkins automatically runs some metrics on, to evaluate Apache Lucene/Solr changes don't affect a golden truth too much. This of course can be very complex, but I think it is a direction the Apache Lucene/Solr

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Doug Turnbull
Just a piece of feedback from clients on the original docCount change. I have seen several cases with clients where the switch to docCount surprised and harmed relevance. More broadly, I’m concerned when we make these changes there’s not a testing process against test corpuses with judgments

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Yonik Seeley
On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti wrote: > "Lucene/Solr doesn't actually delete documents when you delete them, it > just marks them as deleted. I'm pretty sure that the difference between > docCount and maxDoc is deleted documents. Maybe I don't

Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread alessandro.benedetti
"Lucene/Solr doesn't actually delete documents when you delete them, it just marks them as deleted. I'm pretty sure that the difference between docCount and maxDoc is deleted documents. Maybe I don't understand what I'm talking about, but that is the best I can come up with. " Thanks Shawn,

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread Yonik Seeley
On Mon, Dec 4, 2017 at 1:35 PM, Shawn Heisey wrote: > I'm pretty sure that the difference between docCount and maxDoc is deleted > documents. docCount (not the best name) here is the number of documents with the field being searched. docFreq (df) is the number of documents

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread Shawn Heisey
On 12/4/2017 7:21 AM, alessandro.benedetti wrote: the reason docCount was improving things is because it was using a docCount relative to a specific field while maxDoc is global all over the index ? Lucene/Solr doesn't actually delete documents when you delete them, it just marks them as

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread alessandro.benedetti
Furthermore, taking a look to the code for BM25 similarity, it seems to me it is currently working right : - docCount is used per field if != -1 /** * Computes a score factor for a simple term and returns an explanation * for that score factor. * * * The default implementation

Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread alessandro.benedetti
Hi Markus, just out of interest, why did " It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well!" solve the problem ? i assume you are using different fields, one per language. Each field is appearing on a different number of docs I guess. e.g.

Re: Skewed IDF in multi lingual index, again

2017-11-30 Thread Walter Underwood
on top. Except for very relevant documents in foreign languages, > hence the deboost is not too low. > > Thanks, > Markus > > > -Original message- >> From:Walter Underwood <wun...@wunderwood.org> >> Sent: Thursday 30th November 2017 17:29 >>

RE: Skewed IDF in multi lingual index, again

2017-11-30 Thread Markus Jelsma
, hence the deboost is not too low. Thanks, Markus -Original message- > From:Walter Underwood <wun...@wunderwood.org> > Sent: Thursday 30th November 2017 17:29 > To: solr-user@lucene.apache.org > Subject: Re: Skewed IDF in multi lingual index, again > > I’ve occas

Re: Skewed IDF in multi lingual index, again

2017-11-30 Thread Walter Underwood
I’ve occasionally considered using Unicode language tags (U+E001 and friends) on each term. That would make a term specific to a language, so we would get [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big hammer, because it restricts matches to the same language. If

Skewed IDF in multi lingual index, again

2017-11-30 Thread Markus Jelsma
Hello, We already discussed this problem five years ago [1]. In short: documents in foreign languages are scored higher for some terms. It was solved back then by using docCount instead of maxDoc when calculating idf, it worked really well! But, probably due to index changes, the problem is