[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1124: --- Fix Version/s: (was: 2.9) 3.0 2.9.1 > short circuit FuzzyQuery.rewrite when input token length is small compared to > minSimilarity > --- > > Key: LUCENE-1124 > URL: https://issues.apache.org/jira/browse/LUCENE-1124 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring >Reporter: Hoss Man >Assignee: Mark Miller >Priority: Trivial > Fix For: 2.9.1, 3.0 > > Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, > LUCENE-1124.patch > > > I found this (unreplied to) email floating around in my Lucene folder from > during the holidays... > {noformat} > From: Timo Nentwig > To: java-dev > Subject: Fuzzy makes no sense for short tokens > Date: Mon, 31 Dec 2007 16:01:11 +0100 > Message-Id: <200712311601.12255.luc...@nitwit.de> > Hi! > it generally makes no sense to search fuzzy for short tokens because changing > even only a single character of course already results in a high edit > distance. So it actually only makes sense in this case: >if( token.length() > 1f / (1f - minSimilarity) ) > E.g. changing one character in a 3-letter token (foo) results in an edit > distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher > we can save all the expensive rewrite() logic. > {noformat} > I don't know much about FuzzyQueries, but this reasoning seems sound ... > FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in > the event that the input token is shorter then some simple math on the > minSimilarity. (i'm not smart enough to be certain that the math above is > right however ... it's been a while since i looked at Levenstein distances > ... tests needed) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1124: --- Attachment: LUCENE-1124.patch Attach patch (based on 2.9) showing the bug, along with the fix. Instead of rewriting to empty BooleanQuery when prefix term is not long enough, I rewrite to TermQuery with that prefix. This way the exact term matches. I'll commit shortly to trunk & 2.9.x. > short circuit FuzzyQuery.rewrite when input token length is small compared to > minSimilarity > --- > > Key: LUCENE-1124 > URL: https://issues.apache.org/jira/browse/LUCENE-1124 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring >Reporter: Hoss Man >Assignee: Mark Miller >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch, > LUCENE-1124.patch > > > I found this (unreplied to) email floating around in my Lucene folder from > during the holidays... > {noformat} > From: Timo Nentwig > To: java-dev > Subject: Fuzzy makes no sense for short tokens > Date: Mon, 31 Dec 2007 16:01:11 +0100 > Message-Id: <200712311601.12255.luc...@nitwit.de> > Hi! > it generally makes no sense to search fuzzy for short tokens because changing > even only a single character of course already results in a high edit > distance. So it actually only makes sense in this case: >if( token.length() > 1f / (1f - minSimilarity) ) > E.g. changing one character in a 3-letter token (foo) results in an edit > distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher > we can save all the expensive rewrite() logic. > {noformat} > I don't know much about FuzzyQueries, but this reasoning seems sound ... > FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in > the event that the input token is shorter then some simple math on the > minSimilarity. (i'm not smart enough to be certain that the math above is > right however ... it's been a while since i looked at Levenstein distances > ... tests needed) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1124: Priority: Trivial (was: Major) > short circuit FuzzyQuery.rewrite when input token length is small compared to > minSimilarity > --- > > Key: LUCENE-1124 > URL: https://issues.apache.org/jira/browse/LUCENE-1124 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring >Reporter: Hoss Man >Assignee: Mark Miller >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch > > > I found this (unreplied to) email floating around in my Lucene folder from > during the holidays... > {noformat} > From: Timo Nentwig > To: java-dev > Subject: Fuzzy makes no sense for short tokens > Date: Mon, 31 Dec 2007 16:01:11 +0100 > Message-Id: <200712311601.12255.luc...@nitwit.de> > Hi! > it generally makes no sense to search fuzzy for short tokens because changing > even only a single character of course already results in a high edit > distance. So it actually only makes sense in this case: >if( token.length() > 1f / (1f - minSimilarity) ) > E.g. changing one character in a 3-letter token (foo) results in an edit > distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher > we can save all the expensive rewrite() logic. > {noformat} > I don't know much about FuzzyQueries, but this reasoning seems sound ... > FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in > the event that the input token is shorter then some simple math on the > minSimilarity. (i'm not smart enough to be certain that the math above is > right however ... it's been a while since i looked at Levenstein distances > ... tests needed) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1124: Fix Version/s: 2.9 Assignee: Mark Miller > short circuit FuzzyQuery.rewrite when input token length is small compared to > minSimilarity > --- > > Key: LUCENE-1124 > URL: https://issues.apache.org/jira/browse/LUCENE-1124 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring >Reporter: Hoss Man >Assignee: Mark Miller > Fix For: 2.9 > > Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch > > > I found this (unreplied to) email floating around in my Lucene folder from > during the holidays... > {noformat} > From: Timo Nentwig > To: java-dev > Subject: Fuzzy makes no sense for short tokens > Date: Mon, 31 Dec 2007 16:01:11 +0100 > Message-Id: <200712311601.12255.luc...@nitwit.de> > Hi! > it generally makes no sense to search fuzzy for short tokens because changing > even only a single character of course already results in a high edit > distance. So it actually only makes sense in this case: >if( token.length() > 1f / (1f - minSimilarity) ) > E.g. changing one character in a 3-letter token (foo) results in an edit > distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher > we can save all the expensive rewrite() logic. > {noformat} > I don't know much about FuzzyQueries, but this reasoning seems sound ... > FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in > the event that the input token is shorter then some simple math on the > minSimilarity. (i'm not smart enough to be certain that the math above is > right however ... it's been a while since i looked at Levenstein distances > ... tests needed) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1124: Attachment: LUCENE-1124.patch Updated to trunk. Im going to commit in few days if no one objects. > short circuit FuzzyQuery.rewrite when input token length is small compared to > minSimilarity > --- > > Key: LUCENE-1124 > URL: https://issues.apache.org/jira/browse/LUCENE-1124 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring >Reporter: Hoss Man > Attachments: LUCENE-1124.patch, LUCENE-1124.patch, LUCENE-1124.patch > > > I found this (unreplied to) email floating around in my Lucene folder from > during the holidays... > {noformat} > From: Timo Nentwig > To: java-dev > Subject: Fuzzy makes no sense for short tokens > Date: Mon, 31 Dec 2007 16:01:11 +0100 > Message-Id: <200712311601.12255.luc...@nitwit.de> > Hi! > it generally makes no sense to search fuzzy for short tokens because changing > even only a single character of course already results in a high edit > distance. So it actually only makes sense in this case: >if( token.length() > 1f / (1f - minSimilarity) ) > E.g. changing one character in a 3-letter token (foo) results in an edit > distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher > we can save all the expensive rewrite() logic. > {noformat} > I don't know much about FuzzyQueries, but this reasoning seems sound ... > FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in > the event that the input token is shorter then some simple math on the > minSimilarity. (i'm not smart enough to be certain that the math above is > right however ... it's been a while since i looked at Levenstein distances > ... tests needed) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1124) short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity
[ https://issues.apache.org/jira/browse/LUCENE-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-1124: - Summary: short circuit FuzzyQuery.rewrite when input token length is small compared to minSimilarity (was: short circuit FuzzyQuery.rewrite when input okenlengh is small compared to minSimilarity) > short circuit FuzzyQuery.rewrite when input token length is small compared to > minSimilarity > --- > > Key: LUCENE-1124 > URL: https://issues.apache.org/jira/browse/LUCENE-1124 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring >Reporter: Hoss Man > Attachments: LUCENE-1124.patch, LUCENE-1124.patch > > > I found this (unreplied to) email floating around in my Lucene folder from > during the holidays... > {noformat} > From: Timo Nentwig > To: java-dev > Subject: Fuzzy makes no sense for short tokens > Date: Mon, 31 Dec 2007 16:01:11 +0100 > Message-Id: <[EMAIL PROTECTED]> > Hi! > it generally makes no sense to search fuzzy for short tokens because changing > even only a single character of course already results in a high edit > distance. So it actually only makes sense in this case: >if( token.length() > 1f / (1f - minSimilarity) ) > E.g. changing one character in a 3-letter token (foo) results in an edit > distance of 0.6. And if minSimilarity (which is by default: 0.5 :-) is higher > we can save all the expensive rewrite() logic. > {noformat} > I don't know much about FuzzyQueries, but this reasoning seems sound ... > FuzzyQuery.rewrite should be able to completely skip all TermEnumeration in > the event that the input token is shorter then some simple math on the > minSimilarity. (i'm not smart enough to be certain that the math above is > right however ... it's been a while since i looked at Levenstein distances > ... tests needed) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]