Re: MLT - unexpected design choice

Maria Mestre Tue, 29 Jan 2019 08:29:41 -0800

Hi Alessandro and Matt,

Thanks so much for your help!


@Alessandro: I will do so, thank you :-)



> On 29 Jan 2019, at 12:26, Alessandro Benedetti <a.benede...@sease.io> wrote:
> 
> Hi Maria,
> this is actually a great catch!
> I have been working a lot on the More Like This and this mistake never
> caught my attention.
> 
> I agree with you, feel free to open a Jira Issue.
> 
> First of all what you say, makes sense.
> Secondly it is the way it is the standard way used in the similarity Lucene
> calculations :
> 
> 
> 
> 
> 
> 
> 
> 
> *public Explanation idfExplain(CollectionStatistics collectionStats,
> TermStatistics termStats) {  final long df = termStats.docFreq();
> final long docCount = collectionStats.docCount();  final float idf =
> idf(df, docCount);  return Explanation.match(idf, "idf, computed as
> log((docCount+1)/(docFreq+1)) + 1 from:",      Explanation.match(df,
> "docFreq, number of documents containing term"),
> Explanation.match(docCount, "docCount, total number of documents with
> field"));}*
> 
> 
> *Indeed the int numDocs = ir.numDocs(); should actually be allocated
> per term in the for loop, using the field stats, something like:*
> 
> *numDocs = ir.getDocCount(fieldName)*
> 
> Feel free to open the Jira issue and attach a patch with at least a
> testCase that shows the bugfix.
> 
> I will be available for doing the review.
> 
> 
> Cheers
> 
> --------------------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sease.io&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=KEO05uAuRQl8rAIP9s17NGMjXRfT6hiTPrY4lqZgdu4&e=
>  
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sease.io&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=KEO05uAuRQl8rAIP9s17NGMjXRfT6hiTPrY4lqZgdu4&e=>
> 
> 
> On Tue, Jan 29, 2019 at 11:41 AM Matt Pearce <m...@flax.co.uk 
> <mailto:m...@flax.co.uk>> wrote:
> 
>> Hi Maria,
>> 
>> Would it help to add a filter to your query to restrict the results to
>> just those where the description field is populated? Eg. add
>> 
>> fq=description:[* TO *]
>> 
>> to your query parameters.
>> 
>> Apologies if I'm misunderstanding the problem!
>> 
>> Best,
>> 
>> Matt
>> 
>> 
>> On 28/01/2019 16:29, Maria Mestre wrote:
>>> Hi all,
>>> 
>>> First of all, I’m not a Java developer, and a SolR newbie. I have worked
>> with Elasticsearch for some years (not contributing, just as a user), so I
>> think I have the basics of text search engines covered. I am always
>> learning new things though!
>>> 
>>> I created an index in SolR and used more-like-this on it, by passing a
>> document_id. My data has a special feature, which is that one of the fields
>> is called “description” but is only populated about 10% of the time. Most
>> of the time it is empty. I am using that field to query similar documents.
>>> 
>>> So I query the /mlt endpoint using these parameters (for example):
>>> 
>>> {q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”,
>>> mlt=true,
>>> mlt.fl=description,
>>> mlt.mindf=1,
>>> mlt.mintf=1,
>>> mlt.maxqt=5,
>>> wt=json,
>>> mlt.interestingTerms=details}
>>> 
>>> The issue I have is that when retrieving the key scored terms
>> (interestingTerms), the code uses the total number of documents in the
>> index, not the total number of documents with populated “description”
>> field. This is where it’s done in the code:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=KtS7zF20Gy-Ij9SeQ-XfafUkPqn8C8855G6KbnNVR6I&e=
>>  
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=KtS7zF20Gy-Ij9SeQ-XfafUkPqn8C8855G6KbnNVR6I&e=>
>>> 
>>> The effect of this choice is that the “idf” does not vary much, given
>> that numDocs >> number of documents with “description”, so the key terms
>> end up being just the terms with the highest term frequencies.
>>> 
>>> It is inconsistent because the MLT-search then uses these extracted key
>> terms and scores all documents using an idf which is computed only on the
>> subset of documents with “description”. So one part of the MLT uses a
>> different numDocs than another part. This sounds like an odd choice, and
>> not expected at all, and I wonder if I’m missing something.
>>> 
>>> Best,
>>> Maria
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> --
>> Matt Pearce
>> Flax - Open Source Enterprise Search
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.flax.co.uk&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=yD20MeMqL431tJ4y2F6SRz4DgvYVjiJ4N1ovHwt9m2g&e=
>>  
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.flax.co.uk&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ&m=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4&s=yD20MeMqL431tJ4y2F6SRz4DgvYVjiJ4N1ovHwt9m2g&e=>

Re: MLT - unexpected design choice

Reply via email to