[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2089: -- Affects Version/s: Flex Branch Fix Version/s: Flex Branch explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: Flex Branch Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Fix For: Flex Branch Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833812#action_12833812 ] Robert Muir commented on LUCENE-329: in my opinion this is now fixable in a very nice way. we can make an alternative rewrite method for fuzzy that does just like TopTermsRewrite, except it creates a BooleanQuery of ConstantScore queries instead. this way the score will be equal to the boost. then users could choose which one they want to use. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833812#action_12833812 ] Robert Muir commented on LUCENE-329: in my opinion this is now fixable in a very nice way. we can make an alternative rewrite method for fuzzy that does just like TopTermsRewrite, except it creates a BooleanQuery of ConstantScore queries instead. this way the score will be equal to the boost. then users could choose which one they want to use. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833822#action_12833822 ] Mark Harwood commented on LUCENE-329: - The problem with ignoring IDF completely is that it doesn't help balance partial matches where there is 1 fuzzy element in the query e.g.in a query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery) Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833822#action_12833822 ] Mark Harwood commented on LUCENE-329: - The problem with ignoring IDF completely is that it doesn't help balance partial matches where there is 1 fuzzy element in the query e.g.in a query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery) Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833827#action_12833827 ] Shai Erera commented on LUCENE-1720: Ok, it looks like ConcurrentHashMap cannot be used by ATM because of its TimeoutThread. When the thread checks the map and iterates on its elements, we need to lock the map completely because otherwise we can hit a case where the thread pulled an entry for inspection, and then the thread which is monitored calls stop(), but the timeout thread won't know that any might wrongly mark that thread as a timeout activity thread ... Since checkForTimeout does not need to obtain any lock, and the synchronization happens on start/stop/isProjected and TimeoutThread, I'm not sure how important is it to use CHM. Another thing, CHM allows you to specify the concurrency level, which is essentially the number of threads you know are going to 'change' the map (put or remove). If you don't know that, and default to clevel=1, it means only one thread is expected to change the map, which in ATM's case would yield to 0 benefit, since all the threads (which their number we don't know up front) change the map ... of course we can guess, or try to, but ... Anyway, I'm putting that aside for now, and moving no to adding more tests to TestTimeLimitingReader. bq. getWrappedReader Mark, the exception in the tests is thrown from SegmentReader.getOnlySegmentReader. I think if we add this method to FilterIndexReader only and in getOnlySegmentReader we add this code: {code} if (reader instanceof FilterIndexReader) { return getOnlySegmentReader(((FilterIndexReader) reader).getWrappedReader()); } {code} It should work? This can be done in this issue I think as it doesn't break back-compat and exposes a reasonable method where it should be. What do you think? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833828#action_12833828 ] Robert Muir commented on LUCENE-329: bq. The problem with ignoring IDF completely is that it doesn't help balance partial matches where there is 1 fuzzy element in the query e.g.in a query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery) Mark, it wouldn't lose any features. we simply provide another option, just like we do for other MultiTermQuery rewrites for other queries, so users can choose what they want to use. its just an additional choice. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833828#action_12833828 ] Robert Muir commented on LUCENE-329: bq. The problem with ignoring IDF completely is that it doesn't help balance partial matches where there is 1 fuzzy element in the query e.g.in a query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery) Mark, it wouldn't lose any features. we simply provide another option, just like we do for other MultiTermQuery rewrites for other queries, so users can choose what they want to use. its just an additional choice. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-329: --- Attachment: LUCENE-329.patch here is a rough patch Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: LUCENE-329.patch, patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-329: --- Attachment: LUCENE-329.patch here is a rough patch Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: LUCENE-329.patch, patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833833#action_12833833 ] Mark Harwood commented on LUCENE-1720: -- bq. Anyway, I'm putting that aside for now, and moving no to adding more tests to TestTimeLimitingReader. OK. I always shudder when I see lists of if instanceof... logic. My suggestion of getWrappedReader was intended for broader use - there are other reasons to wrap a reader e.g. security. I was thinking of putting it on IndexReader but maybe the convenience wrapper base class FilterIndexReader would be a better home - most reader-wrappers would use this as a base class? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833835#action_12833835 ] Robert Muir commented on LUCENE-329: bq. I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery) oh I see, this issue is really different from LUCENE-124 (they arent technically duplicates). I can move my patch there. You can still create a 'smarter' method here, it won't get in the way as now FuzzyQuery does not have a hardcoded rewrite method. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: LUCENE-329.patch, patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833835#action_12833835 ] Robert Muir commented on LUCENE-329: bq. I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. Obliterating IDF completely as a factor would lose this feature (available in FuzzyLikeThisQuery) oh I see, this issue is really different from LUCENE-124 (they arent technically duplicates). I can move my patch there. You can still create a 'smarter' method here, it won't get in the way as now FuzzyQuery does not have a hardcoded rewrite method. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Assignee: Lucene Developers Priority: Minor Attachments: LUCENE-329.patch, patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-329: Assignee: (was: Lucene Developers) Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: LUCENE-329.patch, patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-329: Assignee: (was: Lucene Developers) Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: LUCENE-329.patch, patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
[ https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-124: Assignee: (was: Lucene Developers) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc Key: LUCENE-124 URL: https://issues.apache.org/jira/browse/LUCENE-124 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Cormac Twomey Priority: Minor According to the website's Query Syntax page, fuzzy searches are given a boost of 0.2. I've found this not to be the case, and have seen situations where exact matches have lower relevance scores than fuzzy matches. Rather than getting a boost of 0.2, it appears that all variations on the term are first found in the model, where dist* 0.5. * dist = levenshteinDistance / length of min(termlength, variantlength) This then leads to a boolean OR search of all the variant terms, each of whose boost is set to (dist - 0.5)*2 for that variant. The upshot of all of this is that there are many cases where a fuzzy match will get a higher relevance score than an exact match. See this email for a test case to reproduce this anomalous behaviour. http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html Here is a candidate patch to address the issue - *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java Sun Jun 09 13:47:54 2002 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java Fri Mar 14 11:37:20 2003 *** *** 99,105 } final protected float difference() { ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { --- 99,109 } final protected float difference() { ! if (distance == 1.0) { ! return 1.0f; ! } ! else ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { *** *** 111,117 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD); /** Finds and returns the smallest of three integers --- 115,121 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f - FUZZY_THRESHOLD)); /** Finds and returns the smallest of three integers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833838#action_12833838 ] Shai Erera commented on LUCENE-1720: Hmm ... I've added the method above, and it solved the issue in one of the tests, but revealed other issues with FilterIndexReader, like for example it doesn't override getUniqueTermCount() and reopen() ... so maybe we need to change the current TestFilterIndexReader to extend TestIndexReader and override getReader() ... this looks like a separate issue, though it will block this issue ... A change in TestIndexReader is also required in one test which asserts that the instance of the reader is DirectoryReader (or ReadOnlySegmentReader), but that's easy to do by for example adding a isInstanceOf(IndexReader) method to TestIndexReader which will be overridden in TestFilterIndexReader and TestTimeLimitingIndexReader. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
[ https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-124: Assignee: (was: Lucene Developers) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc Key: LUCENE-124 URL: https://issues.apache.org/jira/browse/LUCENE-124 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Cormac Twomey Priority: Minor According to the website's Query Syntax page, fuzzy searches are given a boost of 0.2. I've found this not to be the case, and have seen situations where exact matches have lower relevance scores than fuzzy matches. Rather than getting a boost of 0.2, it appears that all variations on the term are first found in the model, where dist* 0.5. * dist = levenshteinDistance / length of min(termlength, variantlength) This then leads to a boolean OR search of all the variant terms, each of whose boost is set to (dist - 0.5)*2 for that variant. The upshot of all of this is that there are many cases where a fuzzy match will get a higher relevance score than an exact match. See this email for a test case to reproduce this anomalous behaviour. http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html Here is a candidate patch to address the issue - *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java Sun Jun 09 13:47:54 2002 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java Fri Mar 14 11:37:20 2003 *** *** 99,105 } final protected float difference() { ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { --- 99,109 } final protected float difference() { ! if (distance == 1.0) { ! return 1.0f; ! } ! else ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { *** *** 111,117 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD); /** Finds and returns the smallest of three integers --- 115,121 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f - FUZZY_THRESHOLD)); /** Finds and returns the smallest of three integers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833840#action_12833840 ] Mark Harwood commented on LUCENE-329: - My best-practice suggestion isn't as simple as offering a choice between preserving IDF for all terms or not. Instead, it is a proposal that we should use the *input* term's IDF for scoring all variants of the same root term (or taking an average of variants where the root term does not exist). This I feel preserves the benefits of keeping IDF as a factor (as in my John~ Patitucci~ balancing example) but also eliminating the side effects we see where a rare mis-spelling beats exact matches. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: LUCENE-329.patch, patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833841#action_12833841 ] Robert Muir commented on LUCENE-329: bq. My best-practice suggestion isn't as simple as offering a choice between preserving IDF for all terms or not. Mark, right, my mistake. I will move this patch to LUCENE-124 so there is a simple alternative, you can proceed here with a smarter method... sorry i got confused amongst the different issues :) Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-329: --- Attachment: (was: LUCENE-329.patch) Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833845#action_12833845 ] Shai Erera commented on LUCENE-1720: bq. most reader-wrappers would use this as a base class? In my understanding it should be. I've started making this change, and TimeLimitingIndexReader does not need for example to override getUniqueTermCount or other such methods which are passed blindly to the wrapped reader. I think it makes sense to request that a wrapping reader should sub-class FilterIndexReader. Of course, because of the semantics of the various reopen methods, it is not enough to provide one impl in FilterIndexReader and rely on that in sub-classes, because they need to wrap the reader with themselves as a new instance (I BTW found and fixed a bug in TimeLimitingIndexReader.reopen which returned the wrapped reopened instance if it wasn't changed, instead of itself). We can get over that by offering a protected getNewInstance(IndexReader) which will be overridden by sub-classes ... TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
[ https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-124: --- Attachment: LUCENE-124.patch Here is a patch, with a test for the issue. This patch adds TOP_TERMS_CONSTANT_BOOLEAN_REWRITE to complement TOP_TERMS_SCORING_BOOLEAN_REWRITE. Note: this solution is different than LUCENE-329, but I think this rewrite method could be useful for other queries as well. example usage: {code} FuzzyQuery query = new FuzzyQuery(new Term(field, Lucene)); query.setRewriteMethod(MultiTermQuery.TOP_TERMS_CONSTANT_BOOLEAN_REWRITE); ScoreDoc[] hits = searcher.search(query, ...) ... {code} Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc Key: LUCENE-124 URL: https://issues.apache.org/jira/browse/LUCENE-124 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Cormac Twomey Priority: Minor Attachments: LUCENE-124.patch According to the website's Query Syntax page, fuzzy searches are given a boost of 0.2. I've found this not to be the case, and have seen situations where exact matches have lower relevance scores than fuzzy matches. Rather than getting a boost of 0.2, it appears that all variations on the term are first found in the model, where dist* 0.5. * dist = levenshteinDistance / length of min(termlength, variantlength) This then leads to a boolean OR search of all the variant terms, each of whose boost is set to (dist - 0.5)*2 for that variant. The upshot of all of this is that there are many cases where a fuzzy match will get a higher relevance score than an exact match. See this email for a test case to reproduce this anomalous behaviour. http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html Here is a candidate patch to address the issue - *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java Sun Jun 09 13:47:54 2002 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java Fri Mar 14 11:37:20 2003 *** *** 99,105 } final protected float difference() { ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { --- 99,109 } final protected float difference() { ! if (distance == 1.0) { ! return 1.0f; ! } ! else ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { *** *** 111,117 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD); /** Finds and returns the smallest of three integers --- 115,121 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f - FUZZY_THRESHOLD)); /** Finds and returns the smallest of three integers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833860#action_12833860 ] Eks Dev commented on LUCENE-329: {quote} query for John~ Patitucci~ I'm probably more interested in a partial match on the rarer surname than a partial match on the common forename. {quote} as a matter of fact, we have not only one frequency to consider, rather two Term frequencies! consider simpler case Query term: Johan //would be High frequency term gives: Fuzzy Expanded term1 Johana // High frequency Fuzzy Expanded term2 Joahn // Low Freq I guess you would like to score the second term higher, meaning Lower frequency (higher IDF)... So far so good. Now turn it upside down and search for LF typo Joahn... in that case you would preffer HF Term Johan from expanded list to score higher... Point being, this situation here is just not complete without taking both frequencies into consideration (Query Term and Expanded term). In my experience, some simple nonlinear hints based on these two freqs bring some easy precision points (HF-LF Pairs are much more likely to be typos that two HF-HF... ). Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833863#action_12833863 ] Mark Harwood commented on LUCENE-1720: -- bq. BTW found and fixed a bug in TimeLimitingIndexReader.reopen which returned the wrapped reopened instance if it wasn't changed, instead of itself Good catch. bq. We can get over that by offering a protected getNewInstance(IndexReader) which will be overridden by sub-classes Would that be abstract? That would effectively help force subclasses to do the right thing when reopening but introduce a back-compatibility issue. If we don't make it abstract what would be the default implementation of this method? Maybe it's all best handled by simply adding a note saying you really should think about overriding reopen in FilterIndexReader's javadocs? TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
[ https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned LUCENE-124: -- Assignee: Robert Muir Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc Key: LUCENE-124 URL: https://issues.apache.org/jira/browse/LUCENE-124 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Cormac Twomey Assignee: Robert Muir Priority: Minor Attachments: LUCENE-124.patch According to the website's Query Syntax page, fuzzy searches are given a boost of 0.2. I've found this not to be the case, and have seen situations where exact matches have lower relevance scores than fuzzy matches. Rather than getting a boost of 0.2, it appears that all variations on the term are first found in the model, where dist* 0.5. * dist = levenshteinDistance / length of min(termlength, variantlength) This then leads to a boolean OR search of all the variant terms, each of whose boost is set to (dist - 0.5)*2 for that variant. The upshot of all of this is that there are many cases where a fuzzy match will get a higher relevance score than an exact match. See this email for a test case to reproduce this anomalous behaviour. http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html Here is a candidate patch to address the issue - *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java Sun Jun 09 13:47:54 2002 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java Fri Mar 14 11:37:20 2003 *** *** 99,105 } final protected float difference() { ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { --- 99,109 } final protected float difference() { ! if (distance == 1.0) { ! return 1.0f; ! } ! else ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { *** *** 111,117 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD); /** Finds and returns the smallest of three integers --- 115,121 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f - FUZZY_THRESHOLD)); /** Finds and returns the smallest of three integers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
[ https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833865#action_12833865 ] Robert Muir commented on LUCENE-124: I will wait till after the code freeze and commit this in a few days if no one objects. I don't claim its a 'best-practice' fix for fuzzy (see LUCENE-329 for ideas on that), I just think TOP_TERMS_CONSTANT_BOOLEAN_REWRITE is a useful complement to TOP_TERMS_SCORING_BOOLEAN_REWRITE, for MultiTermQueries that want the Top-N terms expansion, but the constant score behavior of CONSTANT_BOOLEAN_REWRITE. this patch doesnt change any defaults for fuzzy either. in fact its not specific to fuzzy at all. Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc Key: LUCENE-124 URL: https://issues.apache.org/jira/browse/LUCENE-124 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Cormac Twomey Assignee: Robert Muir Priority: Minor Attachments: LUCENE-124.patch According to the website's Query Syntax page, fuzzy searches are given a boost of 0.2. I've found this not to be the case, and have seen situations where exact matches have lower relevance scores than fuzzy matches. Rather than getting a boost of 0.2, it appears that all variations on the term are first found in the model, where dist* 0.5. * dist = levenshteinDistance / length of min(termlength, variantlength) This then leads to a boolean OR search of all the variant terms, each of whose boost is set to (dist - 0.5)*2 for that variant. The upshot of all of this is that there are many cases where a fuzzy match will get a higher relevance score than an exact match. See this email for a test case to reproduce this anomalous behaviour. http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html Here is a candidate patch to address the issue - *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java Sun Jun 09 13:47:54 2002 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java Fri Mar 14 11:37:20 2003 *** *** 99,105 } final protected float difference() { ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { --- 99,109 } final protected float difference() { ! if (distance == 1.0) { ! return 1.0f; ! } ! else ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { *** *** 111,117 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD); /** Finds and returns the smallest of three integers --- 115,121 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f - FUZZY_THRESHOLD)); /** Finds and returns the smallest of three integers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833867#action_12833867 ] Shai Erera commented on LUCENE-1720: bq. Would that be abstract? ... I've actually went ahead and implemented a wrap(IndexReader) method. The default impl returns a new FilterIndexReader, since it's a concrete class. We should add it to its javadocs, saying that you'd better override that method as well. Since it's not just for reopen (on all its 3 variants), but for getSequentialSubReaders as well, I think it's useful ... BTW, you seem to implemented correctly getSequentialSubReaders on TimeLimitingIndexReader, while FilterIndexReader doesn't ... so I pulled your implementation up to FIR and by calling wrap(IndexReader) for every sub reader, both FIR and TLIR implement it correctly. I'm now working on adding more tests to TestTLIR. Hope to post a patch really soon. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833872#action_12833872 ] Renaud Delbru commented on LUCENE-1410: --- When performing some tests on PFOR, I have noticed that your algorithm was favoring very small numFrameBits for a very large number of exceptions. For example, on block of size 512, I have noticed that many of the blocks was with bestFrameBits = 1, and the number of exceptions reaching 450. I found that this was due to the seeting of the allowedNumExceptions variable (in the last part of the PFOR#frameBitsForCompression() method) which was set to the number of current exceptions + the maximum allowed (which at the end is generally extremely large).Is it a bug, or is it something I don't understand in the current PFOR algorithm ? P.S.: btw, the previous benchmark results I have posted are wrong due to a bug which was due to the hardcoded byte buffer size (1024) in PForIndexInput/Output. I'll post soon updated results, with a comparison with GroupVarInt (from WSDM09 - Jeff Dean talk). PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: Other Reporter: Paul Elschot Priority: Minor Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java Original Estimate: 21840h Remaining Estimate: 21840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833876#action_12833876 ] Mark Harwood commented on LUCENE-329: - bq. consider simpler case OK - but we need to remember that it is important to achieve balance _across_ different fuzzy queries as well as terms _within_ the same fuzzy query. Let's stick to the terms within a single fuzzy query for now: bq. I guess you would like to score the second term higher, meaning Lower frequency No, variant's frequency is not a deciding factor - only edit distance. Johana is similarity 0.6 while Johana is 0.2 so I would favour result one (although the this difference seems a little off in this case) The basic assumption is that user's input is valid and not a typo (deriving spelling suggestions etc are a different topic and one we shouldnt try cover here). Fuzzy matching can drag in all sorts of unqualified variants with massively different frequencies. Because we cannot control these discrepancies we should reward all these alternatives using the known factors we have to hand - the IDF of the user's supposedly valid input and the similarity measure of each variant compared to the input. We could get fancy about probability of variants given the other input terms in the query but that feels like its straying into spell checker territory and ngrams etc. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
[ https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833885#action_12833885 ] Robert Muir commented on LUCENE-124: uwe pointed out to me, i think there is a naming problem with TOP_TERMS_CONSTANT_BOOLEAN_REWRITE, as the entire booleanquery will not produce the same score like CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE. I think the behavior makes sense though, as it wouldnt make sense to use TOP_TERMS without per-term boosting, but we need to fix the naming... and TOP_TERMS_BOOST_BOOLEAN_REWRITE sounds confusing. Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc Key: LUCENE-124 URL: https://issues.apache.org/jira/browse/LUCENE-124 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2 Environment: Operating System: All Platform: All Reporter: Cormac Twomey Assignee: Robert Muir Priority: Minor Attachments: LUCENE-124.patch According to the website's Query Syntax page, fuzzy searches are given a boost of 0.2. I've found this not to be the case, and have seen situations where exact matches have lower relevance scores than fuzzy matches. Rather than getting a boost of 0.2, it appears that all variations on the term are first found in the model, where dist* 0.5. * dist = levenshteinDistance / length of min(termlength, variantlength) This then leads to a boolean OR search of all the variant terms, each of whose boost is set to (dist - 0.5)*2 for that variant. The upshot of all of this is that there are many cases where a fuzzy match will get a higher relevance score than an exact match. See this email for a test case to reproduce this anomalous behaviour. http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html Here is a candidate patch to address the issue - *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java Sun Jun 09 13:47:54 2002 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java Fri Mar 14 11:37:20 2003 *** *** 99,105 } final protected float difference() { ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { --- 99,109 } final protected float difference() { ! if (distance == 1.0) { ! return 1.0f; ! } ! else ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR); } final public boolean endEnum() { *** *** 111,117 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 1.0f / (1.0f - FUZZY_THRESHOLD); /** Finds and returns the smallest of three integers --- 115,121 **/ public static final double FUZZY_THRESHOLD = 0.5; ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f - FUZZY_THRESHOLD)); /** Finds and returns the smallest of three integers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1720: --- Attachment: LUCENE-1720.patch Added patch contains, above all previous things: * Changes to FitlerIndexReader to better support extensions. * Changes in TimeLimitingIndexReader based on the previous bullet. * Changes to TestIndexReader to better support extensions. * TestFilterIndexReader now extends TestIndexReader. * TestTimeLimitingIndexReader was added few test methods. Although some of the changes can be viewed as belonging to a separate issue, I think they are not very big and fit nicely here. It allows testing the new TimeLimitingIndexReader which is important. Mark, the only thing that remains is to convert TimeLimitingIndexReaderBenchmark to a benchmark algorithm/task. Would you mind taking a stab at this? If not, I'll try to get to it later this week (perhaps in a day or two). TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-329) Fuzzy query scoring issues
[ https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833876#action_12833876 ] Mark Harwood edited comment on LUCENE-329 at 2/15/10 5:05 PM: -- bq. consider simpler case OK - but we need to remember that it is important to achieve balance _across_ different fuzzy queries as well as terms _within_ the same fuzzy query. Let's stick to the terms within a single fuzzy query for now: bq. I guess you would like to score the second term higher, meaning Lower frequency No, variant's frequency is not a deciding factor - only edit distance. Johana is similarity 0.6 while Joahn is 0.2 so I would favour result one (although the this difference seems a little off in this case) The basic assumption is that user's input is valid and not a typo (deriving spelling suggestions etc are a different topic and one we shouldnt try cover here). Fuzzy matching can drag in all sorts of unqualified variants with massively different frequencies. Because we cannot control these discrepancies we should reward all these alternatives using the known factors we have to hand - the IDF of the user's supposedly valid input and the similarity measure of each variant compared to the input. We could get fancy about probability of variants given the other input terms in the query but that feels like its straying into spell checker territory and ngrams etc. was (Author: markh): bq. consider simpler case OK - but we need to remember that it is important to achieve balance _across_ different fuzzy queries as well as terms _within_ the same fuzzy query. Let's stick to the terms within a single fuzzy query for now: bq. I guess you would like to score the second term higher, meaning Lower frequency No, variant's frequency is not a deciding factor - only edit distance. Johana is similarity 0.6 while Johana is 0.2 so I would favour result one (although the this difference seems a little off in this case) The basic assumption is that user's input is valid and not a typo (deriving spelling suggestions etc are a different topic and one we shouldnt try cover here). Fuzzy matching can drag in all sorts of unqualified variants with massively different frequencies. Because we cannot control these discrepancies we should reward all these alternatives using the known factors we have to hand - the IDF of the user's supposedly valid input and the similarity measure of each variant compared to the input. We could get fancy about probability of variants given the other input terms in the query but that feels like its straying into spell checker territory and ngrams etc. Fuzzy query scoring issues -- Key: LUCENE-329 URL: https://issues.apache.org/jira/browse/LUCENE-329 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.2rc5 Environment: Operating System: All Platform: All Reporter: Mark Harwood Priority: Minor Attachments: patch.txt Queries which automatically produce multiple terms (wildcard, range, prefix, fuzzy etc)currently suffer from two problems: 1) Scores for matching documents are significantly smaller than term queries because of the volume of terms introduced (A match on query Foo~ is 0.1 whereas a match on query Foo is 1). 2) The rarer forms of expanded terms are favoured over those of more common forms because of the IDF. When using Fuzzy queries for example, rare mis- spellings typically appear in results before the more common correct spellings. I will attach a patch that corrects the issues identified above by 1) Overriding Similarity.coord to counteract the downplaying of scores introduced by expanding terms. 2) Taking the IDF factor of the most common form of expanded terms as the basis of scoring all other expanded terms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833895#action_12833895 ] Paul Elschot commented on LUCENE-1410: -- That might indeed be a bug. In general the frameBitsForCompression() method should try and find the number that has the quickest decompression. The trade off is between exceptions taking extra time, and a smaller number of frame bits taking less time. I have not timed the exception decoding yet, so I would not expect the current implementation of frameBitsForCompression() to be right on the spot. Please note that I'd still like to change the exception encoding, see the comment of 12 May 2009: For an exception, we store its lower b bits instead of the offset to the next exception in its corresponding slot, while we store the higher overflow bits and the offset in two separate arrays. These two arrays are compressed using the Simple16 method. PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: Other Reporter: Paul Elschot Priority: Minor Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java Original Estimate: 21840h Remaining Estimate: 21840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: [VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts
As people.apache.org is down, here is an alternate location with the same artifacts: http://alpha.thetaphi.de/lucene-292-301-take1-rev910082/ Happy testing! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Monday, February 15, 2010 12:46 AM To: gene...@lucene.apache.org; java-dev@lucene.apache.org Subject: [VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts Hallo Folks, I have posted a release candidate for both Lucene Java 2.9.2 and 3.0.1 (which both have the same bug fix level, functionality and release announcement), build from revision 910082 of the corresponding branches. Thanks for all your help! Please test them and give your votes until Thursday morning, as the scheduled release date for both versions is Friday, Feb 19th, 2010. Only votes from Lucene PMC are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. We planned the parallel release with one announcement because of their parallel development / bug fix level to emphasize that they are equal except deprecation removal and Java 5 since major version 3. Please also read the attached release announcement (Open Document) and send it corrected back if you miss anything or want to improve my bad English :-) You find the artifacts here: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/ Maven repo: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/maven/ The changes are here: http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/changes-2.9.2/Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/changes-2.9.2/Contrib-Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/changes-3.0.1/Changes.html http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1- rev910082/changes-3.0.1/Contrib-Changes.html Uwe === Proposed Release Announcement === Hello Lucene users, On behalf of the Lucene development community I would like to announce the release of Lucene Java versions 3.0.1 and 2.9.2: Both releases fix bugs in the previous versions, where 2.9.2 is the last release working with Java 1.4, still providing all deprecated APIs of the Lucene Java 2.x series. 3.0.1 has the same bug fix level, but requires Java 5 and is no longer compatible with code using deprecated APIs. The API was cleaned up to make use of Java 5's generics, varargs, enums, and autoboxing. New users of Lucene are advised to use version 3.0.1 for new developments, because it has a clean, type safe new API. Users upgrading from 2.9.x can now remove unnecessary casts and add generics to their code, too. Important improvements in these releases are a increased maximum number of unique terms in each index segment. They also add fixes in IndexWriter’s commit and lost document deletes in near real-time indexing. Also lots of bugs in Contrib’s Analyzers package were fixed. Additionally, the 3.0.1 release restored some public methods, that get lost during deprecation removal. If you are using Lucene in a web application environment, you will notice that the new Attribute-based TokenStream API now works correct with different class loaders. Both releases are fully compatible with the corresponding previous versions. We strongly recommend upgrading to 2.9.2 if you are using 2.9.1 or 2.9.0; and to 3.0.1 if you are using 3.0.0. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833902#action_12833902 ] Mark Harwood commented on LUCENE-1720: -- bq. Mark, the only thing that remains is to convert TimeLimitingIndexReaderBenchmark to a benchmark algorithm/task. Would you mind taking a stab at this? Will need to look at existing benchmark tasks for guidance. I may get some time later. TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1720: --- Attachment: LUCENE-1720.patch * Added some Javadocs in FilterIndexReader * Changed the exception text thrown from SegmentReader.getOnlySegmentReader(). * Added a CHANGES entry under new API TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, LUCENE-1720.patch, LUCENE-1720.patch, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833922#action_12833922 ] Renaud Delbru commented on LUCENE-1410: --- {quote} In general the frameBitsForCompression() method should try and find the number that has the quickest decompression. {quote} In fact, the method is trying to find the configuration that has the smaller size in term of bytes (with the test totalBytes = bestBytes). But it seems that it is not always the best case for quick decompression. I haven't test the decompression speed of PFOR yet, but I imagine that with 89% of exceptions (while in the original algorithm exception should occurs only a few times), it should not be the best performance. Why not just computing the index (in the sorted copy array) that will represent the percentage of values that should be packed, and the remaining of the array encoded as exceptions ? PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: Other Reporter: Paul Elschot Priority: Minor Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java Original Estimate: 21840h Remaining Estimate: 21840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833952#action_12833952 ] Paul Elschot commented on LUCENE-1410: -- bq. Why not just computing the index (in the sorted copy array) that will represent the percentage of values that should be packed, and the remaining of the array encoded as exceptions ? Because there are cases in which no exceptions are needed, for example 90% of the values using the same max nr of bits. Anyway, the articles on PFor do not completely specify how this should be done, and they are based on C, not on Java. That means there will be room for improvement. PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: Other Reporter: Paul Elschot Priority: Minor Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java Original Estimate: 21840h Remaining Estimate: 21840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2089: Attachment: LUCENE-2089.patch here is a more fleshed out patch (rough but all tests pass): * FuzzyTermsEnum is a proxy enum, it delegates behavior to either LinearFuzzyTermsEnum (the old code) or AutomatonFuzzyTermsEnum, and can switch during enumeration. * TODO is to handle minBoostChanged, this is the event where the priority queue is full and the minimum score to be competitive has changed (the opportunity to switch to a more efficient enum). * code is generalized for all N, but i only include the table for N=0,1 for now. explore using automaton for fuzzyquery -- Key: LUCENE-2089 URL: https://issues.apache.org/jira/browse/LUCENE-2089 Project: Lucene - Java Issue Type: Wish Components: Search Affects Versions: Flex Branch Reporter: Robert Muir Assignee: Mark Miller Priority: Minor Fix For: Flex Branch Attachments: LUCENE-2089.patch, LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use dumb mode at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from dumb mode to smart mode. i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA-DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org