[ 
https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-124:
-------------------------------

    Attachment: LUCENE-124.patch

Here is a patch, with a test for the issue.

This patch adds TOP_TERMS_CONSTANT_BOOLEAN_REWRITE to complement 
TOP_TERMS_SCORING_BOOLEAN_REWRITE.

Note: this solution is different than LUCENE-329, but I think this rewrite 
method could be useful for other queries as well.

example usage:
{code}
FuzzyQuery query = new FuzzyQuery(new Term("field", "Lucene"));
query.setRewriteMethod(MultiTermQuery.TOP_TERMS_CONSTANT_BOOLEAN_REWRITE);
ScoreDoc[] hits = searcher.search(query, ...)
 ...
{code}

> Fuzzy Searches do not get a boost of 0.2 as stated in "Query Syntax" doc
> ------------------------------------------------------------------------
>
>                 Key: LUCENE-124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-124
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 1.2
>         Environment: Operating System: All
> Platform: All
>            Reporter: Cormac Twomey
>            Priority: Minor
>         Attachments: LUCENE-124.patch
>
>
> According to the website's "Query Syntax" page, fuzzy searches are given a
> boost of 0.2. I've found this not to be the case, and have seen situations 
> where
> exact matches have lower relevance scores than fuzzy matches.
> Rather than getting a boost of 0.2, it appears that all variations on the term
> are first found in the model, where dist* > 0.5.
> * dist = levenshteinDistance / length of min(termlength, variantlength)
> This then leads to a boolean OR search of all the variant terms, each of whose
> boost is set to (dist - 0.5)*2 for that variant.
> The upshot of all of this is that there are many cases where a fuzzy match 
> will
> get a higher relevance score than an exact match.
> See this email for a test case to reproduce this anomalous behaviour.
> http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html
> Here is a candidate patch to address the issue -
> *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java   Sun Jun 
> 09
> 13:47:54 2002
> --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java  
> Fri
> Mar 14 11:37:20 2003
> ***************
> *** 99,105 ****
>       }
>       
>       final protected float difference() {
> !         return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
>       }
>       
>       final public boolean endEnum() {
> --- 99,109 ----
>       }
>       
>       final protected float difference() {
> !             if (distance == 1.0) {
> !                     return 1.0f;
> !             }
> !             else
> !                     return (float)((distance - FUZZY_THRESHOLD) * 
> SCALE_FACTOR);
>       }
>       
>       final public boolean endEnum() {
> ***************
> *** 111,117 ****
>        ******************************/
>       
>       public static final double FUZZY_THRESHOLD = 0.5;
> !     public static final double SCALE_FACTOR = 1.0f / (1.0f - 
> FUZZY_THRESHOLD);
>       
>       /**
>        Finds and returns the smallest of three integers 
> --- 115,121 ----
>        ******************************/
>       
>       public static final double FUZZY_THRESHOLD = 0.5;
> !     public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f -
> FUZZY_THRESHOLD));
>       
>       /**
>        Finds and returns the smallest of three integers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to