Hi Alessandro and Matt, Thanks so much for your help!
@Alessandro: I will do so, thank you :-) > On 29 Jan 2019, at 12:26, Alessandro Benedetti <a.benede...@sease.io> wrote: > > Hi Maria, > this is actually a great catch! > I have been working a lot on the More Like This and this mistake never > caught my attention. > > I agree with you, feel free to open a Jira Issue. > > First of all what you say, makes sense. > Secondly it is the way it is the standard way used in the similarity Lucene > calculations : > > > > > > > > > *public Explanation idfExplain(CollectionStatistics collectionStats, > TermStatistics termStats) { final long df = termStats.docFreq(); > final long docCount = collectionStats.docCount(); final float idf = > idf(df, docCount); return Explanation.match(idf, "idf, computed as > log((docCount+1)/(docFreq+1)) + 1 from:", Explanation.match(df, > "docFreq, number of documents containing term"), > Explanation.match(docCount, "docCount, total number of documents with > field"));}* > > > *Indeed the int numDocs = ir.numDocs(); should actually be allocated > per term in the for loop, using the field stats, something like:* > > *numDocs = ir.getDocCount(fieldName)* > > Feel free to open the Jira issue and attach a patch with at least a > testCase that shows the bugfix. > > I will be available for doing the review. > > > Cheers > > -------------------------- > Alessandro Benedetti > Search Consultant, R&D Software Engineer, Director > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sease.io&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=KEO05uAuRQl8rAIP9s17NGMjXRfT6hiTPrY4lqZgdu4&e= > > <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sease.io&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=KEO05uAuRQl8rAIP9s17NGMjXRfT6hiTPrY4lqZgdu4&e=> > > > On Tue, Jan 29, 2019 at 11:41 AM Matt Pearce <m...@flax.co.uk > <mailto:m...@flax.co.uk>> wrote: > >> Hi Maria, >> >> Would it help to add a filter to your query to restrict the results to >> just those where the description field is populated? Eg. add >> >> fq=description:[* TO *] >> >> to your query parameters. >> >> Apologies if I'm misunderstanding the problem! >> >> Best, >> >> Matt >> >> >> On 28/01/2019 16:29, Maria Mestre wrote: >>> Hi all, >>> >>> First of all, I’m not a Java developer, and a SolR newbie. I have worked >> with Elasticsearch for some years (not contributing, just as a user), so I >> think I have the basics of text search engines covered. I am always >> learning new things though! >>> >>> I created an index in SolR and used more-like-this on it, by passing a >> document_id. My data has a special feature, which is that one of the fields >> is called “description” but is only populated about 10% of the time. Most >> of the time it is empty. I am using that field to query similar documents. >>> >>> So I query the /mlt endpoint using these parameters (for example): >>> >>> {q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”, >>> mlt=true, >>> mlt.fl=description, >>> mlt.mindf=1, >>> mlt.mintf=1, >>> mlt.maxqt=5, >>> wt=json, >>> mlt.interestingTerms=details} >>> >>> The issue I have is that when retrieving the key scored terms >> (interestingTerms), the code uses the total number of documents in the >> index, not the total number of documents with populated “description” >> field. This is where it’s done in the code: >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=KtS7zF20Gy-Ij9SeQ-XfafUkPqn8C8855G6KbnNVR6I&e= >> >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=KtS7zF20Gy-Ij9SeQ-XfafUkPqn8C8855G6KbnNVR6I&e=> >>> >>> The effect of this choice is that the “idf” does not vary much, given >> that numDocs >> number of documents with “description”, so the key terms >> end up being just the terms with the highest term frequencies. >>> >>> It is inconsistent because the MLT-search then uses these extracted key >> terms and scores all documents using an idf which is computed only on the >> subset of documents with “description”. So one part of the MLT uses a >> different numDocs than another part. This sounds like an odd choice, and >> not expected at all, and I wonder if I’m missing something. >>> >>> Best, >>> Maria >>> >>> >>> >>> >>> >>> >> >> -- >> Matt Pearce >> Flax - Open Source Enterprise Search >> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.flax.co.uk&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=yD20MeMqL431tJ4y2F6SRz4DgvYVjiJ4N1ovHwt9m2g&e= >> >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.flax.co.uk&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=yD20MeMqL431tJ4y2F6SRz4DgvYVjiJ4N1ovHwt9m2g&e=>