[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-10-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1124:
---

Fix Version/s: (was: 2.9)
   3.0
   2.9.1

> short circuit FuzzyQuery.rewrite when input token length is small compared to 
> minSimilarity
> ---
>
> Key: LUCENE-1124
> URL: https://issues.apache.org/jira/browse/LUCENE-1124
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Hoss Man
>Assignee: Mark Miller
>Priority: Trivial
> Fix For: 2.9.1, 3.0
>
> Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, 
> LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from 
> during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <200712311601.12255.luc...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... 
> FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
> the event that the input token is shorter then some simple math on the 
> minSimilarity.  (i'm not smart enough to be certain that the math above is 
> right however ... it's been a while since i looked at Levenstein distances 
> ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-10-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1124:
---

Attachment: LUCENE-1124.patch

Attach patch (based on 2.9) showing the bug, along with the fix.  Instead of 
rewriting to empty BooleanQuery when prefix term is not long enough, I rewrite 
to TermQuery with that prefix.  This way the exact term matches.

I'll commit shortly to trunk & 2.9.x.

> short circuit FuzzyQuery.rewrite when input token length is small compared to 
> minSimilarity
> ---
>
> Key: LUCENE-1124
> URL: https://issues.apache.org/jira/browse/LUCENE-1124
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Hoss Man
>Assignee: Mark Miller
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, 
> LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from 
> during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <200712311601.12255.luc...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... 
> FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
> the event that the input token is shorter then some simple math on the 
> minSimilarity.  (i'm not smart enough to be certain that the math above is 
> right however ... it's been a while since i looked at Levenstein distances 
> ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-01-04 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1124:


Priority: Trivial  (was: Major)

> short circuit FuzzyQuery.rewrite when input token length is small compared to 
> minSimilarity
> ---
>
> Key: LUCENE-1124
> URL: https://issues.apache.org/jira/browse/LUCENE-1124
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Hoss Man
>Assignee: Mark Miller
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from 
> during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <200712311601.12255.luc...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... 
> FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
> the event that the input token is shorter then some simple math on the 
> minSimilarity.  (i'm not smart enough to be certain that the math above is 
> right however ... it's been a while since i looked at Levenstein distances 
> ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-01-04 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1124:


Fix Version/s: 2.9
 Assignee: Mark Miller

> short circuit FuzzyQuery.rewrite when input token length is small compared to 
> minSimilarity
> ---
>
> Key: LUCENE-1124
> URL: https://issues.apache.org/jira/browse/LUCENE-1124
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Hoss Man
>Assignee: Mark Miller
> Fix For: 2.9
>
> Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from 
> during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <200712311601.12255.luc...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... 
> FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
> the event that the input token is shorter then some simple math on the 
> minSimilarity.  (i'm not smart enough to be certain that the math above is 
> right however ... it's been a while since i looked at Levenstein distances 
> ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-01-04 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1124:


Attachment: LUCENE-1124.patch

Updated to trunk.

Im going to commit in few days if no one objects.

> short circuit FuzzyQuery.rewrite when input token length is small compared to 
> minSimilarity
> ---
>
> Key: LUCENE-1124
> URL: https://issues.apache.org/jira/browse/LUCENE-1124
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Hoss Man
> Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from 
> during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <200712311601.12255.luc...@nitwit.de>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... 
> FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
> the event that the input token is shorter then some simple math on the 
> minSimilarity.  (i'm not smart enough to be certain that the math above is 
> right however ... it's been a while since i looked at Levenstein distances 
> ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2008-08-18 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1124:
-

Summary: short circuit FuzzyQuery.rewrite when input token length is small 
compared to minSimilarity  (was: short circuit FuzzyQuery.rewrite when input 
okenlengh is small compared to minSimilarity)

> short circuit FuzzyQuery.rewrite when input token length is small compared to 
> minSimilarity
> ---
>
> Key: LUCENE-1124
> URL: https://issues.apache.org/jira/browse/LUCENE-1124
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Hoss Man
> Attachments: LUCENE-1124.patch, LUCENE-1124.patch
>
>
> I found this (unreplied to) email floating around in my Lucene folder from 
> during the holidays...
> {noformat}
> From: Timo Nentwig
> To: java-dev
> Subject: Fuzzy makes no sense for short tokens
> Date: Mon, 31 Dec 2007 16:01:11 +0100
> Message-Id: <[EMAIL PROTECTED]>
> Hi!
> it generally makes no sense to search fuzzy for short tokens because changing
> even only a single character of course already results in a high edit
> distance. So it actually only makes sense in this case:
>if( token.length() > 1f / (1f - minSimilarity) )
> E.g. changing one character in a 3-letter token (foo) results in an edit
> distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
> we can save all the expensive rewrite() logic.
> {noformat}
> I don't know much about FuzzyQueries, but this reasoning seems sound ... 
> FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
> the event that the input token is shorter then some simple math on the 
> minSimilarity.  (i'm not smart enough to be certain that the math above is 
> right however ... it's been a while since i looked at Levenstein distances 
> ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]