Hi Maria, this is actually a great catch! I have been working a lot on the More Like This and this mistake never caught my attention.
I agree with you, feel free to open a Jira Issue. First of all what you say, makes sense. Secondly it is the way it is the standard way used in the similarity Lucene calculations : *public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) { final long df = termStats.docFreq(); final long docCount = collectionStats.docCount(); final float idf = idf(df, docCount); return Explanation.match(idf, "idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:", Explanation.match(df, "docFreq, number of documents containing term"), Explanation.match(docCount, "docCount, total number of documents with field"));}* *Indeed the int numDocs = ir.numDocs(); should actually be allocated per term in the for loop, using the field stats, something like:* *numDocs = ir.getDocCount(fieldName)* Feel free to open the Jira issue and attach a patch with at least a testCase that shows the bugfix. I will be available for doing the review. Cheers -------------------------- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director www.sease.io On Tue, Jan 29, 2019 at 11:41 AM Matt Pearce <m...@flax.co.uk> wrote: > Hi Maria, > > Would it help to add a filter to your query to restrict the results to > just those where the description field is populated? Eg. add > > fq=description:[* TO *] > > to your query parameters. > > Apologies if I'm misunderstanding the problem! > > Best, > > Matt > > > On 28/01/2019 16:29, Maria Mestre wrote: > > Hi all, > > > > First of all, I’m not a Java developer, and a SolR newbie. I have worked > with Elasticsearch for some years (not contributing, just as a user), so I > think I have the basics of text search engines covered. I am always > learning new things though! > > > > I created an index in SolR and used more-like-this on it, by passing a > document_id. My data has a special feature, which is that one of the fields > is called “description” but is only populated about 10% of the time. Most > of the time it is empty. I am using that field to query similar documents. > > > > So I query the /mlt endpoint using these parameters (for example): > > > > {q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”, > > mlt=true, > > mlt.fl=description, > > mlt.mindf=1, > > mlt.mintf=1, > > mlt.maxqt=5, > > wt=json, > > mlt.interestingTerms=details} > > > > The issue I have is that when retrieving the key scored terms > (interestingTerms), the code uses the total number of documents in the > index, not the total number of documents with populated “description” > field. This is where it’s done in the code: > https://github.com/apache/lucene-solr/blob/master/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L651 > > > > The effect of this choice is that the “idf” does not vary much, given > that numDocs >> number of documents with “description”, so the key terms > end up being just the terms with the highest term frequencies. > > > > It is inconsistent because the MLT-search then uses these extracted key > terms and scores all documents using an idf which is computed only on the > subset of documents with “description”. So one part of the MLT uses a > different numDocs than another part. This sounds like an odd choice, and > not expected at all, and I wonder if I’m missing something. > > > > Best, > > Maria > > > > > > > > > > > > > > -- > Matt Pearce > Flax - Open Source Enterprise Search > www.flax.co.uk >