[ https://issues.apache.org/jira/browse/LUCENE-8020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16223682#comment-16223682 ]
Robert Muir commented on LUCENE-8020: ------------------------------------- I noticed this when trying to debug AfterEffectB in LUCENE-8015. The formula should be: {{(F + 1) / (n * (tfn + 1))}} but we currently use {{(F + 1 + 1) / ((n + 1) * (tfn + 1))}} and I couldn't remember why we had this mess everywhere. > Don't force sim to score bogus terms (e.g. docfreq=0) > ----------------------------------------------------- > > Key: LUCENE-8020 > URL: https://issues.apache.org/jira/browse/LUCENE-8020 > Project: Lucene - Core > Issue Type: Bug > Reporter: Robert Muir > > Today all sim formulas have to be "hacked" to deal with the fact that they > may be passed stats such as docFreq=0, totalTermFreq=0. This happens easily > with spans and there is even a dedicated test for it. All formulas have hacks > such as what you see in https://issues.apache.org/jira/browse/LUCENE-6818: > Instead of: > {code} > expected = stats.getTotalTermFreq() * docLen / stats.getNumberOfFieldTokens(); > {code} > they must do tricks such as: > {code} > expected = (1 + stats.getTotalTermFreq()) * docLen / (1 + > stats.getNumberOfFieldTokens()); > {code} > There is no good reason for this, it is just sloppiness in the > Query/Weight/Scorer api. I think formulas should work unmodified, we > shouldn't pass terms that dont exist or bogus statistics. > It adds a lot of complexity to the scoring api and makes it difficult to have > meaningful/useful explanations, to debug problems, etc. It also makes it > really hard to add a new sim. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org