Hi Maria,
this is actually a great catch!
I have been working a lot on the More Like This and this mistake never
caught my attention.

I agree with you, feel free to open a Jira Issue.

First of all what you say, makes sense.
Secondly it is the way it is the standard way used in the similarity Lucene
calculations :








*public Explanation idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats) {  final long df = termStats.docFreq();
final long docCount = collectionStats.docCount();  final float idf =
idf(df, docCount);  return Explanation.match(idf, "idf, computed as
log((docCount+1)/(docFreq+1)) + 1 from:",      Explanation.match(df,
"docFreq, number of documents containing term"),
Explanation.match(docCount, "docCount, total number of documents with
field"));}*


*Indeed the int numDocs = ir.numDocs(); should actually be allocated
per term in the for loop, using the field stats, something like:*

*numDocs = ir.getDocCount(fieldName)*

Feel free to open the Jira issue and attach a patch with at least a
testCase that shows the bugfix.

I will be available for doing the review.


Cheers

--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
www.sease.io


On Tue, Jan 29, 2019 at 11:41 AM Matt Pearce <m...@flax.co.uk> wrote:

> Hi Maria,
>
> Would it help to add a filter to your query to restrict the results to
> just those where the description field is populated? Eg. add
>
> fq=description:[* TO *]
>
> to your query parameters.
>
> Apologies if I'm misunderstanding the problem!
>
> Best,
>
> Matt
>
>
> On 28/01/2019 16:29, Maria Mestre wrote:
> > Hi all,
> >
> > First of all, I’m not a Java developer, and a SolR newbie. I have worked
> with Elasticsearch for some years (not contributing, just as a user), so I
> think I have the basics of text search engines covered. I am always
> learning new things though!
> >
> > I created an index in SolR and used more-like-this on it, by passing a
> document_id. My data has a special feature, which is that one of the fields
> is called “description” but is only populated about 10% of the time. Most
> of the time it is empty. I am using that field to query similar documents.
> >
> > So I query the /mlt endpoint using these parameters (for example):
> >
> > {q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”,
> > mlt=true,
> > mlt.fl=description,
> > mlt.mindf=1,
> > mlt.mintf=1,
> > mlt.maxqt=5,
> > wt=json,
> > mlt.interestingTerms=details}
> >
> > The issue I have is that when retrieving the key scored terms
> (interestingTerms), the code uses the total number of documents in the
> index, not the total number of documents with populated “description”
> field. This is where it’s done in the code:
> https://github.com/apache/lucene-solr/blob/master/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L651
> >
> > The effect of this choice is that the “idf” does not vary much, given
> that numDocs >> number of documents with “description”, so the key terms
> end up being just the terms with the highest term frequencies.
> >
> > It is inconsistent because the MLT-search then uses these extracted key
> terms and scores all documents using an idf which is computed only on the
> subset of documents with “description”. So one part of the MLT uses a
> different numDocs than another part. This sounds like an odd choice, and
> not expected at all, and I wonder if I’m missing something.
> >
> > Best,
> > Maria
> >
> >
> >
> >
> >
> >
>
> --
> Matt Pearce
> Flax - Open Source Enterprise Search
> www.flax.co.uk
>

Reply via email to