[ https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tommaso Teofili updated LUCENE-6687: ------------------------------------ Fix Version/s: master (9.0) > MLT term frequency calculation bug > ---------------------------------- > > Key: LUCENE-6687 > URL: https://issues.apache.org/jira/browse/LUCENE-6687 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring, core/queryparser > Affects Versions: 5.2.1, 6.0 > Environment: OS X v10.10.4; Solr 5.2.1 > Reporter: Marko Bonaci > Assignee: Tommaso Teofili > Priority: Major > Fix For: 5.2.2, master (9.0) > > Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch, > LUCENE-6687.patch, buggy-method-usage.png, > solr-mlt-tf-doubling-bug-results.png, > solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, > solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, > solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, > terms-glass.png, terms-how.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method > {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document > basically, but it doesn't have to be an existing doc. > !solr-mlt-tf-doubling-bug.png|height=500! > There are 2 for loops, one inside the other, which both loop through the same > set of fields. > That effectively doubles the term frequency for all the terms from fields > that we provide in MLT QP {{qf}} parameter. > It basically goes two times over the list of fields and accumulates the term > frequencies from all fields into {{termFreqMap}}. > The private method {{retrieveTerms}} is only called from one public method, > the version of overloaded method {{like}} that receives a Map: so that > private class member {{fieldNames}} is always derived from > {{retrieveTerms}}'s argument {{fields}}. > > Uh, I don't understand what I wrote myself, but that basically means that, by > the time {{retrieveTerms}} method gets called, its parameter fields and > private member {{fieldNames}} always contain the same list of fields. > Here's the proof: > These are the final results of the calculation: > !solr-mlt-tf-doubling-bug-results.png|height=700! > And this is the actual {{thread_id:TID0009}} document, where those values > were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}): > !terms-glass.png|height=100! > !terms-angry.png|height=100! > !terms-how.png|height=100! > !terms-accumulator.png|height=100! > Now, let's further test this hypothesis by seeing MLT QP in action from the > AdminUI. > Let's try to find docs that are More Like doc {{TID0009}}. > Here's the interesting part, the query: > {code} > q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009 > {code} > We just saw, in the last image above, that the term accumulator appears {{7}} > times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as > {{14}}. > By using {{mintf=14}}, we say that, when calculating similarity, we don't > want to consider terms that appear less than 14 times (when terms from fields > {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}. > I added the term accumulator in only one other document ({{TID0004}}), where > it appears only once, in the field {{title_mlt}}. > !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500! > Let's see what happens when we use {{mintf=15}}: > !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500! > I should probably mention that multiple fields ({{qf}}) work because I > applied the patch: > [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143]. > Bug, no? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org