[ https://issues.apache.org/jira/browse/LUCENE-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861960#comment-16861960 ]
Mark Harwood commented on LUCENE-8840: -------------------------------------- {quote}we shouldn't favor documents that contain multiple variations of the same fuzzy term. {quote} For fuzzy I agree that rewarding more variations in a doc is probably undesirable - a doc will normally pick one spelling for a word and use it consistently so any variations are more likely to be false positives (your baz/bad example). Plurals and other forms of suffix would be a notable exception but I don't think that's too much of a problem because: # we can assume that stemming is taking care of normalizing these tokens. # a lot of fuzzy querying is for things like people names that aren't expressed as plurals or with other common suffixes I think all forms of automatic expansions (synonym, fuzzy, wildcard) need a form of score blending for the expansions they create. Wildcards are perhaps unlike fuzzy in that finding multiple variations in a doc _is_ desirable - we _are_ looking for multiple forms and a document that contains many is better than few. > TopTermsBlendedFreqScoringRewrite should use SynonymQuery > --------------------------------------------------------- > > Key: LUCENE-8840 > URL: https://issues.apache.org/jira/browse/LUCENE-8840 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Jim Ferenczi > Priority: Major > Attachments: LUCENE-8840.patch > > > Today the TopTermsBlendedFreqScoringRewrite, which is the default rewrite > method for Fuzzy queries, uses the BlendedTermQuery to score documents that > match the fuzzy terms. This query blends the frequencies used for scoring > across the terms and creates a disjunction of all the blended terms. This > means that each fuzzy term that match in a document will add their BM25 score > contribution. We already have a query that can blend the statistics of > multiple terms in a single scorer that sums the doc frequencies rather than > the entire BM25 score: the SynonymQuery. Since > https://issues.apache.org/jira/browse/LUCENE-8652 this query also handles > boost between 0 and 1 so it should be easy to change the default rewrite > method for Fuzzy queries to use it instead of the BlendedTermQuery. This > would bound the contribution of each term to the final score which seems a > better alternative in terms of relevancy than the current solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org