[jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Dirk Goldhan (JIRA) Fri, 05 Feb 2010 08:48:02 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830160#action_12830160
 ]


Dirk Goldhan commented on LUCENE-2230:
--------------------------------------

I have tried your patch on lucene 3.0.0. I had to make a small change to get it 
work. 
In the current implementation the enumerator is before the first element. This 
is no problem in lucene 2.9.1 as it simply does one more iteration in 
FuzzyQuery (rewrite). In lucene 3.0.0 FuzzyQuery directly accesses the first 
element of the enumeration. As this is null it simply stops further processing 
of terms (if (t == null) break;). 

I have made a small change in the FuzzyTermEnumNEW class to assign the first 
element to the enumerator during creation. I simply appended the following 
lines within the constructor:

                if (this.termIterator.hasNext()) {
                        Term firstTerm = new Term(field, 
termMap.keySet().iterator().next());
                        this.currentTerm = firstTerm;
                } 

Thank you for the patch.


> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2230
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2230
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.0
>         Environment: Lucene currently uses brute force full-terms scanner and 
> calculates distance for each term. New BKTree structure improves performance 
> in average 20 times when distance is 1, and 3 times when distance is 3. I 
> tested with index size several millions docs, and 250,000 terms. 
> New algo uses integer distances between objects.
>            Reporter: Fuad Efendi
>         Attachments: BKTree.java, Distance.java, DistanceImpl.java, 
> FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java
>
>   Original Estimate: 0.02h
>  Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, 
> CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by 
> http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick 
> Johnson, Google).
> Additionally, simplified algorythm at 
> http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more 
> logically correct than Levenstein distance, and it is 3-5 times faster 
> (isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Reply via email to