Hi Alessandro and Matt,
Thanks so much for your help!
@Alessandro: I will do so, thank you :-)
> On 29 Jan 2019, at 12:26, Alessandro Benedetti wrote:
>
> Hi Maria,
> this is actually a great catch!
> I have been working a lot on the More Like This and this mistake never
> caught my attention.
>
> I agree with you, feel free to open a Jira Issue.
>
> First of all what you say, makes sense.
> Secondly it is the way it is the standard way used in the similarity Lucene
> calculations :
>
>
>
>
>
>
>
>
> *public Explanation idfExplain(CollectionStatistics collectionStats,
> TermStatistics termStats) { final long df = termStats.docFreq();
> final long docCount = collectionStats.docCount(); final float idf =
> idf(df, docCount); return Explanation.match(idf, "idf, computed as
> log((docCount+1)/(docFreq+1)) + 1 from:", Explanation.match(df,
> "docFreq, number of documents containing term"),
> Explanation.match(docCount, "docCount, total number of documents with
> field"));}*
>
>
> *Indeed the int numDocs = ir.numDocs(); should actually be allocated
> per term in the for loop, using the field stats, something like:*
>
> *numDocs = ir.getDocCount(fieldName)*
>
> Feel free to open the Jira issue and attach a patch with at least a
> testCase that shows the bugfix.
>
> I will be available for doing the review.
>
>
> Cheers
>
> --
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sease.io=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4=KEO05uAuRQl8rAIP9s17NGMjXRfT6hiTPrY4lqZgdu4=
>
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sease.io=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4=KEO05uAuRQl8rAIP9s17NGMjXRfT6hiTPrY4lqZgdu4=>
>
>
> On Tue, Jan 29, 2019 at 11:41 AM Matt Pearce <mailto:m...@flax.co.uk>> wrote:
>
>> Hi Maria,
>>
>> Would it help to add a filter to your query to restrict the results to
>> just those where the description field is populated? Eg. add
>>
>> fq=description:[* TO *]
>>
>> to your query parameters.
>>
>> Apologies if I'm misunderstanding the problem!
>>
>> Best,
>>
>> Matt
>>
>>
>> On 28/01/2019 16:29, Maria Mestre wrote:
>>> Hi all,
>>>
>>> First of all, I’m not a Java developer, and a SolR newbie. I have worked
>> with Elasticsearch for some years (not contributing, just as a user), so I
>> think I have the basics of text search engines covered. I am always
>> learning new things though!
>>>
>>> I created an index in SolR and used more-like-this on it, by passing a
>> document_id. My data has a special feature, which is that one of the fields
>> is called “description” but is only populated about 10% of the time. Most
>> of the time it is empty. I am using that field to query similar documents.
>>>
>>> So I query the /mlt endpoint using these parameters (for example):
>>>
>>> {q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”,
>>> mlt=true,
>>> mlt.fl=description,
>>> mlt.mindf=1,
>>> mlt.mintf=1,
>>> mlt.maxqt=5,
>>> wt=json,
>>> mlt.interestingTerms=details}
>>>
>>> The issue I have is that when retrieving the key scored terms
>> (interestingTerms), the code uses the total number of documents in the
>> index, not the total number of documents with populated “description”
>> field. This is where it’s done in the code:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4=KtS7zF20Gy-Ij9SeQ-XfafUkPqn8C8855G6KbnNVR6I=
>>
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4=KtS7zF20Gy-Ij9SeQ-XfafUkPqn8C8855G6KbnNVR6I=>
>>>
>>> The effect of this choice is that the “idf” does not vary much, given
>> that numDocs >> number of documents with “description”,