[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-15 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2089:
--

Affects Version/s: Flex Branch
Fix Version/s: Flex Branch

 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: Flex Branch
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Fix For: Flex Branch

 Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833812#action_12833812
 ] 

Robert Muir commented on LUCENE-329:


in my opinion this is now fixable in a very nice way.

we can make an alternative rewrite method for fuzzy that does just like 
TopTermsRewrite, except it creates a BooleanQuery of ConstantScore queries 
instead. this way the score will be equal to the boost.

then users could choose which one they want to use.

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833812#action_12833812
 ] 

Robert Muir commented on LUCENE-329:


in my opinion this is now fixable in a very nice way.

we can make an alternative rewrite method for fuzzy that does just like 
TopTermsRewrite, except it creates a BooleanQuery of ConstantScore queries 
instead. this way the score will be equal to the boost.

then users could choose which one they want to use.

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833822#action_12833822
 ] 

Mark Harwood commented on LUCENE-329:
-

The problem with ignoring IDF completely is that it doesn't help balance 
partial matches where there is 1 fuzzy element in the query e.g.in a query  
for John~ Patitucci~ I'm probably more interested in a partial match on the 
rarer surname than a partial match on the common forename. Obliterating IDF 
completely as a factor would lose this feature (available in FuzzyLikeThisQuery)


 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833822#action_12833822
 ] 

Mark Harwood commented on LUCENE-329:
-

The problem with ignoring IDF completely is that it doesn't help balance 
partial matches where there is 1 fuzzy element in the query e.g.in a query  
for John~ Patitucci~ I'm probably more interested in a partial match on the 
rarer surname than a partial match on the common forename. Obliterating IDF 
completely as a factor would lose this feature (available in FuzzyLikeThisQuery)


 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833827#action_12833827
 ] 

Shai Erera commented on LUCENE-1720:


Ok, it looks like ConcurrentHashMap cannot be used by ATM because of its 
TimeoutThread. When the thread checks the map and iterates on its elements, we 
need to lock the map completely because otherwise we can hit a case where the 
thread pulled an entry for inspection, and then the thread which is monitored 
calls stop(), but the timeout thread won't know that any might wrongly mark 
that thread as a timeout activity thread ...
Since checkForTimeout does not need to obtain any lock, and the synchronization 
happens on start/stop/isProjected and TimeoutThread, I'm not sure how important 
is it to use CHM.

Another thing, CHM allows you to specify the concurrency level, which is 
essentially the number of threads you know are going to 'change' the map (put 
or remove). If you don't know that, and default to clevel=1, it means only one 
thread is expected to change the map, which in ATM's case would yield to 0 
benefit, since all the threads (which their number we don't know up front) 
change the map ... of course we can guess, or try to, but ...

Anyway, I'm putting that aside for now, and moving no to adding more tests to 
TestTimeLimitingReader.

bq. getWrappedReader
Mark, the exception in the tests is thrown from 
SegmentReader.getOnlySegmentReader. I think if we add this method to 
FilterIndexReader only and in getOnlySegmentReader we add this code:
{code}
if (reader instanceof FilterIndexReader) {
  return getOnlySegmentReader(((FilterIndexReader) reader).getWrappedReader());
}
{code}
It should work? This can be done in this issue I think as it doesn't break 
back-compat and exposes a reasonable method where it should be. What do you 
think?

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833828#action_12833828
 ] 

Robert Muir commented on LUCENE-329:


bq. The problem with ignoring IDF completely is that it doesn't help balance 
partial matches where there is 1 fuzzy element in the query e.g.in a query for 
John~ Patitucci~ I'm probably more interested in a partial match on the rarer 
surname than a partial match on the common forename. Obliterating IDF 
completely as a factor would lose this feature (available in FuzzyLikeThisQuery)

Mark, it wouldn't lose any features. we simply provide another option, just 
like we do for other MultiTermQuery rewrites for other queries, so users can 
choose what they want to use. its just an additional choice.

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833828#action_12833828
 ] 

Robert Muir commented on LUCENE-329:


bq. The problem with ignoring IDF completely is that it doesn't help balance 
partial matches where there is 1 fuzzy element in the query e.g.in a query for 
John~ Patitucci~ I'm probably more interested in a partial match on the rarer 
surname than a partial match on the common forename. Obliterating IDF 
completely as a factor would lose this feature (available in FuzzyLikeThisQuery)

Mark, it wouldn't lose any features. we simply provide another option, just 
like we do for other MultiTermQuery rewrites for other queries, so users can 
choose what they want to use. its just an additional choice.

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-329:
---

Attachment: LUCENE-329.patch

here is a rough patch

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: LUCENE-329.patch, patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-329:
---

Attachment: LUCENE-329.patch

here is a rough patch

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: LUCENE-329.patch, patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833833#action_12833833
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. Anyway, I'm putting that aside for now, and moving no to adding more tests 
to TestTimeLimitingReader.

OK.

I always shudder when I see lists of if instanceof... logic.

My suggestion of getWrappedReader was intended for broader use - there are 
other reasons to wrap a reader e.g. security.
I was thinking of putting it on IndexReader but maybe the convenience wrapper 
base class FilterIndexReader would be a better home - most reader-wrappers 
would use this as a base class?


 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833835#action_12833835
 ] 

Robert Muir commented on LUCENE-329:


bq. I'm probably more interested in a partial match on the rarer surname than a 
partial match on the common forename. Obliterating IDF completely as a factor 
would lose this feature (available in FuzzyLikeThisQuery)

oh I see, this issue is really different from LUCENE-124 (they arent 
technically duplicates). I can move my patch there. You can still create a 
'smarter' method here, it won't get in the way as now FuzzyQuery does not have 
a hardcoded rewrite method.

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: LUCENE-329.patch, patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833835#action_12833835
 ] 

Robert Muir commented on LUCENE-329:


bq. I'm probably more interested in a partial match on the rarer surname than a 
partial match on the common forename. Obliterating IDF completely as a factor 
would lose this feature (available in FuzzyLikeThisQuery)

oh I see, this issue is really different from LUCENE-124 (they arent 
technically duplicates). I can move my patch there. You can still create a 
'smarter' method here, it won't get in the way as now FuzzyQuery does not have 
a hardcoded rewrite method.

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Assignee: Lucene Developers
Priority: Minor
 Attachments: LUCENE-329.patch, patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-329:


Assignee: (was: Lucene Developers)

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Priority: Minor
 Attachments: LUCENE-329.patch, patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-329:


Assignee: (was: Lucene Developers)

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Priority: Minor
 Attachments: LUCENE-329.patch, patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc

2010-02-15 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-124:


Assignee: (was: Lucene Developers)

 Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
 

 Key: LUCENE-124
 URL: https://issues.apache.org/jira/browse/LUCENE-124
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Cormac Twomey
Priority: Minor

 According to the website's Query Syntax page, fuzzy searches are given a
 boost of 0.2. I've found this not to be the case, and have seen situations 
 where
 exact matches have lower relevance scores than fuzzy matches.
 Rather than getting a boost of 0.2, it appears that all variations on the term
 are first found in the model, where dist*  0.5.
 * dist = levenshteinDistance / length of min(termlength, variantlength)
 This then leads to a boolean OR search of all the variant terms, each of whose
 boost is set to (dist - 0.5)*2 for that variant.
 The upshot of all of this is that there are many cases where a fuzzy match 
 will
 get a higher relevance score than an exact match.
 See this email for a test case to reproduce this anomalous behaviour.
 http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html
 Here is a candidate patch to address the issue -
 *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java   Sun Jun 
 09
 13:47:54 2002
 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java  
 Fri
 Mar 14 11:37:20 2003
 ***
 *** 99,105 
   }
   
   final protected float difference() {
 ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
   }
   
   final public boolean endEnum() {
 --- 99,109 
   }
   
   final protected float difference() {
 ! if (distance == 1.0) {
 ! return 1.0f;
 ! }
 ! else
 ! return (float)((distance - FUZZY_THRESHOLD) * 
 SCALE_FACTOR);
   }
   
   final public boolean endEnum() {
 ***
 *** 111,117 
**/
   
   public static final double FUZZY_THRESHOLD = 0.5;
 ! public static final double SCALE_FACTOR = 1.0f / (1.0f - 
 FUZZY_THRESHOLD);
   
   /**
Finds and returns the smallest of three integers 
 --- 115,121 
**/
   
   public static final double FUZZY_THRESHOLD = 0.5;
 ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f -
 FUZZY_THRESHOLD));
   
   /**
Finds and returns the smallest of three integers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833838#action_12833838
 ] 

Shai Erera commented on LUCENE-1720:


Hmm ... I've added the method above, and it solved the issue in one of the 
tests, but revealed other issues with FilterIndexReader, like for example it 
doesn't override getUniqueTermCount() and reopen() ... so maybe we need to 
change the current TestFilterIndexReader  to extend TestIndexReader and 
override getReader() ... this looks like a separate issue, though it will block 
this issue ...
A change in TestIndexReader is also required in one test which asserts that the 
instance of the reader is DirectoryReader (or ReadOnlySegmentReader), but 
that's easy to do by for example adding a isInstanceOf(IndexReader) method to 
TestIndexReader which will be overridden in TestFilterIndexReader and 
TestTimeLimitingIndexReader.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc

2010-02-15 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-124:


Assignee: (was: Lucene Developers)

 Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
 

 Key: LUCENE-124
 URL: https://issues.apache.org/jira/browse/LUCENE-124
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Cormac Twomey
Priority: Minor

 According to the website's Query Syntax page, fuzzy searches are given a
 boost of 0.2. I've found this not to be the case, and have seen situations 
 where
 exact matches have lower relevance scores than fuzzy matches.
 Rather than getting a boost of 0.2, it appears that all variations on the term
 are first found in the model, where dist*  0.5.
 * dist = levenshteinDistance / length of min(termlength, variantlength)
 This then leads to a boolean OR search of all the variant terms, each of whose
 boost is set to (dist - 0.5)*2 for that variant.
 The upshot of all of this is that there are many cases where a fuzzy match 
 will
 get a higher relevance score than an exact match.
 See this email for a test case to reproduce this anomalous behaviour.
 http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html
 Here is a candidate patch to address the issue -
 *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java   Sun Jun 
 09
 13:47:54 2002
 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java  
 Fri
 Mar 14 11:37:20 2003
 ***
 *** 99,105 
   }
   
   final protected float difference() {
 ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
   }
   
   final public boolean endEnum() {
 --- 99,109 
   }
   
   final protected float difference() {
 ! if (distance == 1.0) {
 ! return 1.0f;
 ! }
 ! else
 ! return (float)((distance - FUZZY_THRESHOLD) * 
 SCALE_FACTOR);
   }
   
   final public boolean endEnum() {
 ***
 *** 111,117 
**/
   
   public static final double FUZZY_THRESHOLD = 0.5;
 ! public static final double SCALE_FACTOR = 1.0f / (1.0f - 
 FUZZY_THRESHOLD);
   
   /**
Finds and returns the smallest of three integers 
 --- 115,121 
**/
   
   public static final double FUZZY_THRESHOLD = 0.5;
 ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f -
 FUZZY_THRESHOLD));
   
   /**
Finds and returns the smallest of three integers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833840#action_12833840
 ] 

Mark Harwood commented on LUCENE-329:
-

My best-practice suggestion isn't as simple as offering a choice between 
preserving IDF for all terms or not.

Instead, it is a proposal that we should use the *input* term's IDF for scoring 
all variants of the same root term (or taking an average of variants where the 
root term does not exist).

This I feel preserves the benefits of keeping IDF as a factor (as in my John~ 
Patitucci~ balancing example) but also eliminating the side effects we see 
where a rare mis-spelling beats exact matches.


 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Priority: Minor
 Attachments: LUCENE-329.patch, patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833841#action_12833841
 ] 

Robert Muir commented on LUCENE-329:


bq. My best-practice suggestion isn't as simple as offering a choice between 
preserving IDF for all terms or not. 

Mark, right, my mistake. I will move this patch to LUCENE-124 so there is a 
simple alternative, you can proceed here with a smarter method... sorry i got 
confused amongst the different issues :)

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-329:
---

Attachment: (was: LUCENE-329.patch)

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833845#action_12833845
 ] 

Shai Erera commented on LUCENE-1720:


bq. most reader-wrappers would use this as a base class?

In my understanding it should be. I've started making this change, and 
TimeLimitingIndexReader does not need for example to override 
getUniqueTermCount or other such methods which are passed blindly to the 
wrapped reader. I think it makes sense to request that a wrapping reader should 
sub-class FilterIndexReader.

Of course, because of the semantics of the various reopen methods, it is not 
enough to provide one impl in FilterIndexReader and rely on that in 
sub-classes, because they need to wrap the reader with themselves as a new 
instance (I BTW found and fixed a bug in TimeLimitingIndexReader.reopen which 
returned the wrapped reopened instance if it wasn't changed, instead of 
itself). We can get over that by offering a protected 
getNewInstance(IndexReader) which will be overridden by sub-classes ...

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc

2010-02-15 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-124:
---

Attachment: LUCENE-124.patch

Here is a patch, with a test for the issue.

This patch adds TOP_TERMS_CONSTANT_BOOLEAN_REWRITE to complement 
TOP_TERMS_SCORING_BOOLEAN_REWRITE.

Note: this solution is different than LUCENE-329, but I think this rewrite 
method could be useful for other queries as well.

example usage:
{code}
FuzzyQuery query = new FuzzyQuery(new Term(field, Lucene));
query.setRewriteMethod(MultiTermQuery.TOP_TERMS_CONSTANT_BOOLEAN_REWRITE);
ScoreDoc[] hits = searcher.search(query, ...)
 ...
{code}

 Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
 

 Key: LUCENE-124
 URL: https://issues.apache.org/jira/browse/LUCENE-124
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Cormac Twomey
Priority: Minor
 Attachments: LUCENE-124.patch


 According to the website's Query Syntax page, fuzzy searches are given a
 boost of 0.2. I've found this not to be the case, and have seen situations 
 where
 exact matches have lower relevance scores than fuzzy matches.
 Rather than getting a boost of 0.2, it appears that all variations on the term
 are first found in the model, where dist*  0.5.
 * dist = levenshteinDistance / length of min(termlength, variantlength)
 This then leads to a boolean OR search of all the variant terms, each of whose
 boost is set to (dist - 0.5)*2 for that variant.
 The upshot of all of this is that there are many cases where a fuzzy match 
 will
 get a higher relevance score than an exact match.
 See this email for a test case to reproduce this anomalous behaviour.
 http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html
 Here is a candidate patch to address the issue -
 *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java   Sun Jun 
 09
 13:47:54 2002
 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java  
 Fri
 Mar 14 11:37:20 2003
 ***
 *** 99,105 
   }
   
   final protected float difference() {
 ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
   }
   
   final public boolean endEnum() {
 --- 99,109 
   }
   
   final protected float difference() {
 ! if (distance == 1.0) {
 ! return 1.0f;
 ! }
 ! else
 ! return (float)((distance - FUZZY_THRESHOLD) * 
 SCALE_FACTOR);
   }
   
   final public boolean endEnum() {
 ***
 *** 111,117 
**/
   
   public static final double FUZZY_THRESHOLD = 0.5;
 ! public static final double SCALE_FACTOR = 1.0f / (1.0f - 
 FUZZY_THRESHOLD);
   
   /**
Finds and returns the smallest of three integers 
 --- 115,121 
**/
   
   public static final double FUZZY_THRESHOLD = 0.5;
 ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f -
 FUZZY_THRESHOLD));
   
   /**
Finds and returns the smallest of three integers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833860#action_12833860
 ] 

Eks Dev commented on LUCENE-329:


{quote}
query for John~ Patitucci~ I'm probably more interested in a partial match on 
the rarer surname than a partial match on the common forename. 
{quote}


as a matter of fact, we have not only one frequency  to consider, rather two 
Term frequencies!

consider simpler case
Query term: Johan //would be High frequency term
gives:
Fuzzy Expanded term1 Johana // High frequency
Fuzzy Expanded term2 Joahn // Low Freq

I guess you would like to score the second term higher, meaning Lower frequency 
(higher IDF)... So far so good. 

Now turn it upside down and search for LF typo Joahn... in that case you 
would preffer HF Term Johan from expanded list to score higher...

Point being, this situation here is just not complete without taking both 
frequencies into consideration (Query Term and Expanded term). In my 
experience, some simple nonlinear hints based on these two freqs bring some 
easy precision points (HF-LF Pairs are much more likely to be typos that two 
HF-HF...  ). 


 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833863#action_12833863
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. BTW found and fixed a bug in TimeLimitingIndexReader.reopen which returned 
the wrapped reopened instance if it wasn't changed, instead of itself

Good catch.

bq. We can get over that by offering a protected getNewInstance(IndexReader) 
which will be overridden by sub-classes

Would that be abstract? That would effectively help force subclasses to do the 
right thing when reopening but introduce a back-compatibility issue.
If we don't make it abstract what would be the default implementation of this 
method?
Maybe it's all best handled by simply adding a note saying you really should 
think about overriding reopen in FilterIndexReader's javadocs?



 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc

2010-02-15 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-124:
--

Assignee: Robert Muir

 Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
 

 Key: LUCENE-124
 URL: https://issues.apache.org/jira/browse/LUCENE-124
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Cormac Twomey
Assignee: Robert Muir
Priority: Minor
 Attachments: LUCENE-124.patch


 According to the website's Query Syntax page, fuzzy searches are given a
 boost of 0.2. I've found this not to be the case, and have seen situations 
 where
 exact matches have lower relevance scores than fuzzy matches.
 Rather than getting a boost of 0.2, it appears that all variations on the term
 are first found in the model, where dist*  0.5.
 * dist = levenshteinDistance / length of min(termlength, variantlength)
 This then leads to a boolean OR search of all the variant terms, each of whose
 boost is set to (dist - 0.5)*2 for that variant.
 The upshot of all of this is that there are many cases where a fuzzy match 
 will
 get a higher relevance score than an exact match.
 See this email for a test case to reproduce this anomalous behaviour.
 http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html
 Here is a candidate patch to address the issue -
 *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java   Sun Jun 
 09
 13:47:54 2002
 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java  
 Fri
 Mar 14 11:37:20 2003
 ***
 *** 99,105 
   }
   
   final protected float difference() {
 ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
   }
   
   final public boolean endEnum() {
 --- 99,109 
   }
   
   final protected float difference() {
 ! if (distance == 1.0) {
 ! return 1.0f;
 ! }
 ! else
 ! return (float)((distance - FUZZY_THRESHOLD) * 
 SCALE_FACTOR);
   }
   
   final public boolean endEnum() {
 ***
 *** 111,117 
**/
   
   public static final double FUZZY_THRESHOLD = 0.5;
 ! public static final double SCALE_FACTOR = 1.0f / (1.0f - 
 FUZZY_THRESHOLD);
   
   /**
Finds and returns the smallest of three integers 
 --- 115,121 
**/
   
   public static final double FUZZY_THRESHOLD = 0.5;
 ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f -
 FUZZY_THRESHOLD));
   
   /**
Finds and returns the smallest of three integers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc

2010-02-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833865#action_12833865
 ] 

Robert Muir commented on LUCENE-124:


I will wait till after the code freeze and commit this in a few days if no one 
objects.

I don't claim its a 'best-practice' fix for fuzzy (see LUCENE-329 for ideas on 
that), I just think TOP_TERMS_CONSTANT_BOOLEAN_REWRITE is a useful complement 
to TOP_TERMS_SCORING_BOOLEAN_REWRITE, for MultiTermQueries that want the Top-N 
terms expansion, but the constant score behavior of CONSTANT_BOOLEAN_REWRITE.

this patch doesnt change any defaults for fuzzy either. in fact its not 
specific to fuzzy at all.

 Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
 

 Key: LUCENE-124
 URL: https://issues.apache.org/jira/browse/LUCENE-124
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Cormac Twomey
Assignee: Robert Muir
Priority: Minor
 Attachments: LUCENE-124.patch


 According to the website's Query Syntax page, fuzzy searches are given a
 boost of 0.2. I've found this not to be the case, and have seen situations 
 where
 exact matches have lower relevance scores than fuzzy matches.
 Rather than getting a boost of 0.2, it appears that all variations on the term
 are first found in the model, where dist*  0.5.
 * dist = levenshteinDistance / length of min(termlength, variantlength)
 This then leads to a boolean OR search of all the variant terms, each of whose
 boost is set to (dist - 0.5)*2 for that variant.
 The upshot of all of this is that there are many cases where a fuzzy match 
 will
 get a higher relevance score than an exact match.
 See this email for a test case to reproduce this anomalous behaviour.
 http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html
 Here is a candidate patch to address the issue -
 *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java   Sun Jun 
 09
 13:47:54 2002
 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java  
 Fri
 Mar 14 11:37:20 2003
 ***
 *** 99,105 
   }
   
   final protected float difference() {
 ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
   }
   
   final public boolean endEnum() {
 --- 99,109 
   }
   
   final protected float difference() {
 ! if (distance == 1.0) {
 ! return 1.0f;
 ! }
 ! else
 ! return (float)((distance - FUZZY_THRESHOLD) * 
 SCALE_FACTOR);
   }
   
   final public boolean endEnum() {
 ***
 *** 111,117 
**/
   
   public static final double FUZZY_THRESHOLD = 0.5;
 ! public static final double SCALE_FACTOR = 1.0f / (1.0f - 
 FUZZY_THRESHOLD);
   
   /**
Finds and returns the smallest of three integers 
 --- 115,121 
**/
   
   public static final double FUZZY_THRESHOLD = 0.5;
 ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f -
 FUZZY_THRESHOLD));
   
   /**
Finds and returns the smallest of three integers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833867#action_12833867
 ] 

Shai Erera commented on LUCENE-1720:


bq. Would that be abstract?  ...

I've actually went ahead and implemented a wrap(IndexReader) method. The 
default impl returns a new FilterIndexReader, since it's a concrete class. We 
should add it to its javadocs, saying that you'd better override that method as 
well. Since it's not just for reopen (on all its 3 variants), but for 
getSequentialSubReaders as well, I think it's useful ...

BTW, you seem to implemented correctly getSequentialSubReaders on 
TimeLimitingIndexReader, while FilterIndexReader doesn't ... so I pulled your 
implementation up to FIR and by calling wrap(IndexReader) for every sub reader, 
both FIR and TLIR implement it correctly.

I'm now working on adding more tests to TestTLIR. Hope to post a patch really 
soon.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2010-02-15 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833872#action_12833872
 ] 

Renaud Delbru commented on LUCENE-1410:
---

When performing some tests on PFOR, I have noticed that your algorithm was 
favoring very small numFrameBits for a very large number of exceptions. For 
example, on block of size 512, I have noticed that many of the blocks was with 
bestFrameBits = 1, and the number of exceptions reaching  450.

I found that this was due to the seeting of the allowedNumExceptions variable 
(in the last part of the PFOR#frameBitsForCompression() method) which was set 
to the number of current exceptions + the maximum allowed (which at the end is 
generally extremely large).Is it a bug, or is it something I don't understand 
in the current PFOR algorithm ?

P.S.: btw, the previous benchmark results I have posted are wrong due to a bug 
which was due to the hardcoded byte buffer size (1024) in 
PForIndexInput/Output. I'll post soon updated results, with a comparison with 
GroupVarInt (from WSDM09 - Jeff Dean talk).

 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, 
 LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, 
 LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, 
 TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833876#action_12833876
 ] 

Mark Harwood commented on LUCENE-329:
-

bq. consider simpler case

OK - but we need to remember that it is important to achieve balance _across_ 
different fuzzy queries as well as terms _within_ the same fuzzy query.
Let's stick to the terms within a single fuzzy query for now:

bq. I guess you would like to score the second term higher, meaning Lower 
frequency

No, variant's frequency is not a deciding factor - only edit distance. Johana 
is similarity 0.6 while Johana is 0.2 so I would favour result one  (although 
the this difference seems a little off in this case)
The basic assumption is that user's input is valid and not a typo (deriving 
spelling suggestions etc are a different topic and one we shouldnt try cover 
here). 
Fuzzy matching can drag in all sorts of unqualified variants with massively 
different frequencies. Because we cannot control these discrepancies we should 
reward all these alternatives using the known factors we have to hand - the IDF 
of the user's supposedly valid input and the similarity measure of each variant 
compared to the input.
We could get fancy about probability of variants given the other input terms in 
the query but that feels like its straying into spell checker territory and 
ngrams etc.

 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-124) Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc

2010-02-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833885#action_12833885
 ] 

Robert Muir commented on LUCENE-124:


uwe pointed out to me, i think there is a naming problem with 
TOP_TERMS_CONSTANT_BOOLEAN_REWRITE, as the entire booleanquery will not produce 
the same score like CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE. 

I think the behavior makes sense though, as it wouldnt make sense to use 
TOP_TERMS without per-term boosting, but we need to fix the naming... and 
TOP_TERMS_BOOST_BOOLEAN_REWRITE sounds confusing.


 Fuzzy Searches do not get a boost of 0.2 as stated in Query Syntax doc
 

 Key: LUCENE-124
 URL: https://issues.apache.org/jira/browse/LUCENE-124
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2
 Environment: Operating System: All
 Platform: All
Reporter: Cormac Twomey
Assignee: Robert Muir
Priority: Minor
 Attachments: LUCENE-124.patch


 According to the website's Query Syntax page, fuzzy searches are given a
 boost of 0.2. I've found this not to be the case, and have seen situations 
 where
 exact matches have lower relevance scores than fuzzy matches.
 Rather than getting a boost of 0.2, it appears that all variations on the term
 are first found in the model, where dist*  0.5.
 * dist = levenshteinDistance / length of min(termlength, variantlength)
 This then leads to a boolean OR search of all the variant terms, each of whose
 boost is set to (dist - 0.5)*2 for that variant.
 The upshot of all of this is that there are many cases where a fuzzy match 
 will
 get a higher relevance score than an exact match.
 See this email for a test case to reproduce this anomalous behaviour.
 http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg02819.html
 Here is a candidate patch to address the issue -
 *** lucene-1.2\src\java\org\apache\lucene\search\FuzzyTermEnum.java   Sun Jun 
 09
 13:47:54 2002
 --- lucene-1.2-modified\src\java\org\apache\lucene\search\FuzzyTermEnum.java  
 Fri
 Mar 14 11:37:20 2003
 ***
 *** 99,105 
   }
   
   final protected float difference() {
 ! return (float)((distance - FUZZY_THRESHOLD) * SCALE_FACTOR);
   }
   
   final public boolean endEnum() {
 --- 99,109 
   }
   
   final protected float difference() {
 ! if (distance == 1.0) {
 ! return 1.0f;
 ! }
 ! else
 ! return (float)((distance - FUZZY_THRESHOLD) * 
 SCALE_FACTOR);
   }
   
   final public boolean endEnum() {
 ***
 *** 111,117 
**/
   
   public static final double FUZZY_THRESHOLD = 0.5;
 ! public static final double SCALE_FACTOR = 1.0f / (1.0f - 
 FUZZY_THRESHOLD);
   
   /**
Finds and returns the smallest of three integers 
 --- 115,121 
**/
   
   public static final double FUZZY_THRESHOLD = 0.5;
 ! public static final double SCALE_FACTOR = 0.2f * (1.0f / (1.0f -
 FUZZY_THRESHOLD));
   
   /**
Finds and returns the smallest of three integers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-15 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1720:
---

Attachment: LUCENE-1720.patch

Added patch contains, above all previous things:
* Changes to FitlerIndexReader to better support extensions.
* Changes in TimeLimitingIndexReader based on the previous bullet.
* Changes to TestIndexReader to better support extensions.
* TestFilterIndexReader now extends TestIndexReader.
* TestTimeLimitingIndexReader was added few test methods.

Although some of the changes can be viewed as belonging to a separate issue, I 
think they are not very big and fit nicely here. It allows testing the new 
TimeLimitingIndexReader which is important.

Mark, the only thing that remains is to convert 
TimeLimitingIndexReaderBenchmark to a benchmark algorithm/task. Would you mind 
taking a stab at this? If not, I'll try to get to it later this week (perhaps 
in a day or two).

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-329) Fuzzy query scoring issues

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833876#action_12833876
 ] 

Mark Harwood edited comment on LUCENE-329 at 2/15/10 5:05 PM:
--

bq. consider simpler case

OK - but we need to remember that it is important to achieve balance _across_ 
different fuzzy queries as well as terms _within_ the same fuzzy query.
Let's stick to the terms within a single fuzzy query for now:

bq. I guess you would like to score the second term higher, meaning Lower 
frequency

No, variant's frequency is not a deciding factor - only edit distance. Johana 
is similarity 0.6 while Joahn is 0.2 so I would favour result one  (although 
the this difference seems a little off in this case)
The basic assumption is that user's input is valid and not a typo (deriving 
spelling suggestions etc are a different topic and one we shouldnt try cover 
here). 
Fuzzy matching can drag in all sorts of unqualified variants with massively 
different frequencies. Because we cannot control these discrepancies we should 
reward all these alternatives using the known factors we have to hand - the IDF 
of the user's supposedly valid input and the similarity measure of each variant 
compared to the input.
We could get fancy about probability of variants given the other input terms in 
the query but that feels like its straying into spell checker territory and 
ngrams etc.

  was (Author: markh):
bq. consider simpler case

OK - but we need to remember that it is important to achieve balance _across_ 
different fuzzy queries as well as terms _within_ the same fuzzy query.
Let's stick to the terms within a single fuzzy query for now:

bq. I guess you would like to score the second term higher, meaning Lower 
frequency

No, variant's frequency is not a deciding factor - only edit distance. Johana 
is similarity 0.6 while Johana is 0.2 so I would favour result one  (although 
the this difference seems a little off in this case)
The basic assumption is that user's input is valid and not a typo (deriving 
spelling suggestions etc are a different topic and one we shouldnt try cover 
here). 
Fuzzy matching can drag in all sorts of unqualified variants with massively 
different frequencies. Because we cannot control these discrepancies we should 
reward all these alternatives using the known factors we have to hand - the IDF 
of the user's supposedly valid input and the similarity measure of each variant 
compared to the input.
We could get fancy about probability of variants given the other input terms in 
the query but that feels like its straying into spell checker territory and 
ngrams etc.
  
 Fuzzy query scoring issues
 --

 Key: LUCENE-329
 URL: https://issues.apache.org/jira/browse/LUCENE-329
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.2rc5
 Environment: Operating System: All
 Platform: All
Reporter: Mark Harwood
Priority: Minor
 Attachments: patch.txt


 Queries which automatically produce multiple terms (wildcard, range, prefix, 
 fuzzy etc)currently suffer from two problems:
 1) Scores for matching documents are significantly smaller than term queries 
 because of the volume of terms introduced (A match on query Foo~ is 0.1 
 whereas a match on query Foo is 1).
 2) The rarer forms of expanded terms are favoured over those of more common 
 forms because of the IDF. When using Fuzzy queries for example, rare mis-
 spellings typically appear in results before the more common correct 
 spellings.
 I will attach a patch that corrects the issues identified above by 
 1) Overriding Similarity.coord to counteract the downplaying of scores 
 introduced by expanding terms.
 2) Taking the IDF factor of the most common form of expanded terms as the 
 basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2010-02-15 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833895#action_12833895
 ] 

Paul Elschot commented on LUCENE-1410:
--

That might indeed be a bug.
In general the frameBitsForCompression() method should try and find the number 
that has the quickest decompression.
The trade off is between exceptions taking extra time, and a smaller number of 
frame bits taking less time.
I have not timed the exception decoding yet, so I would not expect the current 
implementation of frameBitsForCompression() to be right on the spot.

Please note that I'd still like to change the exception encoding, see the 
comment of 12 May 2009:
For an exception, we store its lower b bits instead of the offset to the next 
exception in its corresponding slot, while we store the higher overflow bits 
and the offset in two separate arrays. These two arrays are compressed using 
the Simple16 method.


 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, 
 LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, 
 LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, 
 TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: [VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts

2010-02-15 Thread Uwe Schindler
As people.apache.org is down, here is an alternate location with the same 
artifacts:

http://alpha.thetaphi.de/lucene-292-301-take1-rev910082/

Happy testing!

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Monday, February 15, 2010 12:46 AM
 To: gene...@lucene.apache.org; java-dev@lucene.apache.org
 Subject: [VOTE] Lucene Java 2.9.2 and 3.0.1 release artifacts
 
 Hallo Folks,
 
 I have posted a release candidate for both Lucene Java 2.9.2 and 3.0.1
 (which both have the same bug fix level, functionality and release
 announcement), build from revision 910082 of the corresponding
 branches. Thanks for all your help! Please test them and give your
 votes until Thursday morning, as the scheduled release date for both
 versions is Friday, Feb 19th, 2010. Only votes from Lucene PMC are
 binding, but everyone
 is welcome to check the release candidate and voice their approval or
 disapproval. The vote passes if at least three binding +1 votes are
 cast.
 
 We planned the parallel release with one announcement because of their
 parallel development / bug fix level to emphasize that they are equal
 except deprecation removal and Java 5 since major version 3.
 
 Please also read the attached release announcement (Open Document) and
 send it corrected back if you miss anything or want to improve my bad
 English :-)
 
 You find the artifacts here:
 http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-
 rev910082/
 
 Maven repo:
 http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-
 rev910082/maven/
 
 The changes are here:
 http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-
 rev910082/changes-2.9.2/Changes.html
 http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-
 rev910082/changes-2.9.2/Contrib-Changes.html
 
 http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-
 rev910082/changes-3.0.1/Changes.html
 http://people.apache.org/~uschindler/staging-area/lucene-292-301-take1-
 rev910082/changes-3.0.1/Contrib-Changes.html
 
 Uwe
 
 === Proposed Release Announcement ===
 
 Hello Lucene users,
 
 On behalf of the Lucene development community I would like to announce
 the release of Lucene Java versions 3.0.1 and 2.9.2:
 
 Both releases fix bugs in the previous versions, where 2.9.2 is the
 last release working with Java 1.4, still providing all deprecated APIs
 of the Lucene Java 2.x series. 3.0.1 has the same bug fix level, but
 requires Java 5 and is no longer compatible with code using deprecated
 APIs. The API was cleaned up to make use of Java 5's generics, varargs,
 enums, and autoboxing. New users of Lucene are advised to use version
 3.0.1 for new developments, because it has a clean, type safe new API.
 Users upgrading from 2.9.x can now remove unnecessary casts and add
 generics to their code, too.
 
 Important improvements in these releases are a increased maximum number
 of unique terms in each index segment. They also add fixes in
 IndexWriter’s commit and lost document deletes in near real-time
 indexing. Also lots of bugs in Contrib’s Analyzers package were fixed.
 Additionally, the 3.0.1 release restored some public methods, that get
 lost during deprecation removal. If you are using Lucene in a web
 application environment, you will notice that the new Attribute-based
 TokenStream API now works correct with different class loaders.
 Both releases are fully compatible with the corresponding previous
 versions. We strongly recommend upgrading to 2.9.2 if you are using
 2.9.1 or 2.9.0; and to 3.0.1 if you are using 3.0.0.
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833902#action_12833902
 ] 

Mark Harwood commented on LUCENE-1720:
--

bq. Mark, the only thing that remains is to convert 
TimeLimitingIndexReaderBenchmark to a benchmark algorithm/task. Would you mind 
taking a stab at this?

Will need to look at existing benchmark tasks for guidance. I may get some time 
later.

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, Lucene-1720.patch, Lucene-1720.patch, LUCENE-1720.patch, 
 TestTimeLimitedIndexReader.java, TestTimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java, TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1720) TimeLimitedIndexReader and associated utility class

2010-02-15 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1720:
---

Attachment: LUCENE-1720.patch

* Added some Javadocs in FilterIndexReader
* Changed the exception text thrown from SegmentReader.getOnlySegmentReader().
* Added a CHANGES entry under new API

 TimeLimitedIndexReader and associated utility class
 ---

 Key: LUCENE-1720
 URL: https://issues.apache.org/jira/browse/LUCENE-1720
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Attachments: ActivityTimedOutException.java, 
 ActivityTimeMonitor.java, ActivityTimeMonitor.java, ActivityTimeMonitor.java, 
 LUCENE-1720.patch, LUCENE-1720.patch, Lucene-1720.patch, Lucene-1720.patch, 
 LUCENE-1720.patch, TestTimeLimitedIndexReader.java, 
 TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java, 
 TimeLimitedIndexReader.java


 An alternative to TimeLimitedCollector that has the following advantages:
 1) Any reader activity can be time-limited rather than just single searches 
 e.g. the document retrieve phase.
 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly 
 before last collect stage of query processing)
 Uses new utility timeout class that is independent of IndexReader.
 Initial contribution includes a performance test class but not had time as 
 yet to work up a formal Junit test.
 TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2010-02-15 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833922#action_12833922
 ] 

Renaud Delbru commented on LUCENE-1410:
---

{quote}
In general the frameBitsForCompression() method should try and find the number 
that has the quickest decompression.
{quote}

In fact,  the method is trying to find the configuration that has the smaller 
size in term of bytes (with the test totalBytes = bestBytes). But it seems 
that it is not always the best case for quick decompression. I haven't test the 
decompression speed of PFOR yet, but I imagine that with 89% of exceptions 
(while in the original algorithm exception should occurs only a few times), it 
should not be the best performance.

Why not just computing the index (in the sorted copy array) that will represent 
the percentage of values that should be packed, and the remaining of the array 
encoded as exceptions ?

 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, 
 LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, 
 LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, 
 TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1410) PFOR implementation

2010-02-15 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833952#action_12833952
 ] 

Paul Elschot commented on LUCENE-1410:
--

bq. Why not just computing the index (in the sorted copy array) that will 
represent the percentage of values that should be packed, and the remaining of 
the array encoded as exceptions ?

Because there are cases in which no exceptions are needed, for example 90% of 
the values using the same max nr of bits.

Anyway, the articles on PFor do not completely specify how this should be done, 
and they are based on C, not on Java.
That means there will be room for improvement.

 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Other
Reporter: Paul Elschot
Priority: Minor
 Attachments: autogen.tgz, LUCENE-1410-codecs.tar.bz2, 
 LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, 
 LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, 
 TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-15 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2089:


Attachment: LUCENE-2089.patch

here is a more fleshed out patch (rough but all tests pass):
* FuzzyTermsEnum is a proxy enum, it delegates behavior to either 
LinearFuzzyTermsEnum (the old code) or AutomatonFuzzyTermsEnum, and can switch 
during enumeration.
* TODO is to handle minBoostChanged, this is the event where the priority queue 
is full and the minimum score to be competitive has changed (the opportunity to 
switch to a more efficient enum).
* code is generalized for all N, but i only include the table for N=0,1 for now.


 explore using automaton for fuzzyquery
 --

 Key: LUCENE-2089
 URL: https://issues.apache.org/jira/browse/LUCENE-2089
 Project: Lucene - Java
  Issue Type: Wish
  Components: Search
Affects Versions: Flex Branch
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor
 Fix For: Flex Branch

 Attachments: LUCENE-2089.patch, LUCENE-2089.patch, 
 Moman-0.2.1.tar.gz, TestFuzzy.java


 Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
 itching to write that nasty algorithm)
 we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
 * up front, calculate the maximum required K edits needed to match the users 
 supplied float threshold.
 * for at least small common E up to some max K (1,2,3, etc) we should create 
 a DFA for each E. 
 if the required E is above our supported max, we use dumb mode at first (no 
 seeking, no DFA, just brute force like now).
 As the pq fills, we swap progressively lower DFAs into the enum, based upon 
 the lowest score in the pq.
 This should work well on avg, at high E, you will typically fill the pq very 
 quickly since you will match many terms. 
 This not only provides a mechanism to switch to more efficient DFAs during 
 enumeration, but also to switch from dumb mode to smart mode.
 i modified my wildcard benchmark to generate random fuzzy queries.
 * Pattern: 7N stands for NNN, etc.
 * AvgMS_DFA: this is the time spent creating the automaton (constructor)
 ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
 |7N|10|64.0|4155.9|38.6|20.3|
 |14N|10|0.0|2511.6|46.0|37.9| 
 |28N|10|0.0|2506.3|93.0|86.6|
 |56N|10|0.0|2524.5|304.4|298.5|
 as you can see, this prototype is no good yet, because it creates the DFA in 
 a slow way. right now it creates an NFA, and all this wasted time is in 
 NFA-DFA conversion.
 So, for a very long string, it just gets worse and worse. This has nothing to 
 do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
 AvgMS_DFA), there is no problem there.
 instead we should just build a DFA to begin with, maybe with this paper: 
 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
 we can precompute the tables with that algorithm up to some reasonable K, and 
 then I think we are ok.
 the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
 linear minimization, if someone wants to implement this they should not worry 
 about minimization.
 in fact, we need to at some point determine if AutomatonQuery should even 
 minimize FSM's at all, or if it is simply enough for them to be deterministic 
 with no transitions to dead states. (The only code that actually assumes 
 minimal DFA is the Dumb vs Smart heuristic and this can be rewritten as a 
 summation easily). we need to benchmark really complex DFAs (i.e. write a 
 regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org