[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-10-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1124:
---

Attachment: LUCENE-1124.patch

Attach patch (based on 2.9) showing the bug, along with the fix.  Instead of 
rewriting to empty BooleanQuery when prefix term is not long enough, I rewrite 
to TermQuery with that prefix.  This way the exact term matches.

I'll commit shortly to trunk  2.9.x.

 short circuit FuzzyQuery.rewrite when input token length is small compared to 
 minSimilarity
 ---

 Key: LUCENE-1124
 URL: https://issues.apache.org/jira/browse/LUCENE-1124
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Hoss Man
Assignee: Mark Miller
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, 
 LUCENE-1124.patch


 I found this (unreplied to) email floating around in my Lucene folder from 
 during the holidays...
 {noformat}
 From: Timo Nentwig
 To: java-dev
 Subject: Fuzzy makes no sense for short tokens
 Date: Mon, 31 Dec 2007 16:01:11 +0100
 Message-Id: 200712311601.12255.luc...@nitwit.de
 Hi!
 it generally makes no sense to search fuzzy for short tokens because changing
 even only a single character of course already results in a high edit
 distance. So it actually only makes sense in this case:
if( token.length()  1f / (1f - minSimilarity) )
 E.g. changing one character in a 3-letter token (foo) results in an edit
 distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
 we can save all the expensive rewrite() logic.
 {noformat}
 I don't know much about FuzzyQueries, but this reasoning seems sound ... 
 FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
 the event that the input token is shorter then some simple math on the 
 minSimilarity.  (i'm not smart enough to be certain that the math above is 
 right however ... it's been a while since i looked at Levenstein distances 
 ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-10-16 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1124:
---

Fix Version/s: (was: 2.9)
   3.0
   2.9.1

 short circuit FuzzyQuery.rewrite when input token length is small compared to 
 minSimilarity
 ---

 Key: LUCENE-1124
 URL: https://issues.apache.org/jira/browse/LUCENE-1124
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Hoss Man
Assignee: Mark Miller
Priority: Trivial
 Fix For: 2.9.1, 3.0

 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, 
 LUCENE-1124.patch


 I found this (unreplied to) email floating around in my Lucene folder from 
 during the holidays...
 {noformat}
 From: Timo Nentwig
 To: java-dev
 Subject: Fuzzy makes no sense for short tokens
 Date: Mon, 31 Dec 2007 16:01:11 +0100
 Message-Id: 200712311601.12255.luc...@nitwit.de
 Hi!
 it generally makes no sense to search fuzzy for short tokens because changing
 even only a single character of course already results in a high edit
 distance. So it actually only makes sense in this case:
if( token.length()  1f / (1f - minSimilarity) )
 E.g. changing one character in a 3-letter token (foo) results in an edit
 distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
 we can save all the expensive rewrite() logic.
 {noformat}
 I don't know much about FuzzyQueries, but this reasoning seems sound ... 
 FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
 the event that the input token is shorter then some simple math on the 
 minSimilarity.  (i'm not smart enough to be certain that the math above is 
 right however ... it's been a while since i looked at Levenstein distances 
 ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-01-04 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1124:


Attachment: LUCENE-1124.patch

Updated to trunk.

Im going to commit in few days if no one objects.

 short circuit FuzzyQuery.rewrite when input token length is small compared to 
 minSimilarity
 ---

 Key: LUCENE-1124
 URL: https://issues.apache.org/jira/browse/LUCENE-1124
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Hoss Man
 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch


 I found this (unreplied to) email floating around in my Lucene folder from 
 during the holidays...
 {noformat}
 From: Timo Nentwig
 To: java-dev
 Subject: Fuzzy makes no sense for short tokens
 Date: Mon, 31 Dec 2007 16:01:11 +0100
 Message-Id: 200712311601.12255.luc...@nitwit.de
 Hi!
 it generally makes no sense to search fuzzy for short tokens because changing
 even only a single character of course already results in a high edit
 distance. So it actually only makes sense in this case:
if( token.length()  1f / (1f - minSimilarity) )
 E.g. changing one character in a 3-letter token (foo) results in an edit
 distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
 we can save all the expensive rewrite() logic.
 {noformat}
 I don't know much about FuzzyQueries, but this reasoning seems sound ... 
 FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
 the event that the input token is shorter then some simple math on the 
 minSimilarity.  (i'm not smart enough to be certain that the math above is 
 right however ... it's been a while since i looked at Levenstein distances 
 ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-01-04 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1124:


Fix Version/s: 2.9
 Assignee: Mark Miller

 short circuit FuzzyQuery.rewrite when input token length is small compared to 
 minSimilarity
 ---

 Key: LUCENE-1124
 URL: https://issues.apache.org/jira/browse/LUCENE-1124
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch


 I found this (unreplied to) email floating around in my Lucene folder from 
 during the holidays...
 {noformat}
 From: Timo Nentwig
 To: java-dev
 Subject: Fuzzy makes no sense for short tokens
 Date: Mon, 31 Dec 2007 16:01:11 +0100
 Message-Id: 200712311601.12255.luc...@nitwit.de
 Hi!
 it generally makes no sense to search fuzzy for short tokens because changing
 even only a single character of course already results in a high edit
 distance. So it actually only makes sense in this case:
if( token.length()  1f / (1f - minSimilarity) )
 E.g. changing one character in a 3-letter token (foo) results in an edit
 distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
 we can save all the expensive rewrite() logic.
 {noformat}
 I don't know much about FuzzyQueries, but this reasoning seems sound ... 
 FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
 the event that the input token is shorter then some simple math on the 
 minSimilarity.  (i'm not smart enough to be certain that the math above is 
 right however ... it's been a while since i looked at Levenstein distances 
 ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2009-01-04 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1124:


Priority: Trivial  (was: Major)

 short circuit FuzzyQuery.rewrite when input token length is small compared to 
 minSimilarity
 ---

 Key: LUCENE-1124
 URL: https://issues.apache.org/jira/browse/LUCENE-1124
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Hoss Man
Assignee: Mark Miller
Priority: Trivial
 Fix For: 2.9

 Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch


 I found this (unreplied to) email floating around in my Lucene folder from 
 during the holidays...
 {noformat}
 From: Timo Nentwig
 To: java-dev
 Subject: Fuzzy makes no sense for short tokens
 Date: Mon, 31 Dec 2007 16:01:11 +0100
 Message-Id: 200712311601.12255.luc...@nitwit.de
 Hi!
 it generally makes no sense to search fuzzy for short tokens because changing
 even only a single character of course already results in a high edit
 distance. So it actually only makes sense in this case:
if( token.length()  1f / (1f - minSimilarity) )
 E.g. changing one character in a 3-letter token (foo) results in an edit
 distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
 we can save all the expensive rewrite() logic.
 {noformat}
 I don't know much about FuzzyQueries, but this reasoning seems sound ... 
 FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
 the event that the input token is shorter then some simple math on the 
 minSimilarity.  (i'm not smart enough to be certain that the math above is 
 right however ... it's been a while since i looked at Levenstein distances 
 ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity

2008-08-18 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1124:
-

Summary: short circuit FuzzyQuery.rewrite when input token length is small 
compared to minSimilarity  (was: short circuit FuzzyQuery.rewrite when input 
okenlengh is small compared to minSimilarity)

 short circuit FuzzyQuery.rewrite when input token length is small compared to 
 minSimilarity
 ---

 Key: LUCENE-1124
 URL: https://issues.apache.org/jira/browse/LUCENE-1124
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Hoss Man
 Attachments: LUCENE-1124.patch, LUCENE-1124.patch


 I found this (unreplied to) email floating around in my Lucene folder from 
 during the holidays...
 {noformat}
 From: Timo Nentwig
 To: java-dev
 Subject: Fuzzy makes no sense for short tokens
 Date: Mon, 31 Dec 2007 16:01:11 +0100
 Message-Id: [EMAIL PROTECTED]
 Hi!
 it generally makes no sense to search fuzzy for short tokens because changing
 even only a single character of course already results in a high edit
 distance. So it actually only makes sense in this case:
if( token.length()  1f / (1f - minSimilarity) )
 E.g. changing one character in a 3-letter token (foo) results in an edit
 distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher
 we can save all the expensive rewrite() logic.
 {noformat}
 I don't know much about FuzzyQueries, but this reasoning seems sound ... 
 FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in 
 the event that the input token is shorter then some simple math on the 
 minSimilarity.  (i'm not smart enough to be certain that the math above is 
 right however ... it's been a while since i looked at Levenstein distances 
 ... tests needed)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]