[jira] Commented: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

Hoss Man (JIRA) Mon, 25 Aug 2008 12:28:36 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625476#action_12625476
 ]


Hoss Man commented on LUCENE-1124:
----------------------------------

I don't deal with FuzzyQueries much, but skimming this issue it seems to touch 
on a lot of hte same things that spawned the creation of the "mm" syntax for 
specifying the "minNumberShouldMatch" value on BooleanQueries in the Solr 
dismax query parser...

http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html

...perhaps something similar could be used to allow people to specify simpel 
expressions for dictating the "fuzzyiness" of short input vs medium length 
input, vs long input.

> short circuit FuzzyQuery.rewrite when input token length is small compared to 
> minSimilarity
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1124
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1124
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Hoss Man
>         Attachments: LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from 
> during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <[EMAIL PROTECTED]>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>            if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... 
> FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
> the event that the input token is shorter then some simple math on the 
> minSimilarity.  (i'm not smart enough to be certain that the math above is 
> right however ... it's been a while since i looked at Levenstein distances 
> ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

Reply via email to