[ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804001#action_12804001
 ] 

Fuad Efendi edited comment on LUCENE-2230 at 1/24/10 12:46 AM:
---------------------------------------------------------------

Minor bug fixed (with cache warm-up)...

Don't forget to disable QueryResultsCache and to enable large DocumentCache (if 
you are using SOLR); otherwise you won't see the difference. (SOLR users: there 
are some other tricks!)

With Lucene 2.9.1: 
800ms

With BKTree and Levenstein algo: 
200ms

With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html
70ms

Average timing after many hours of tests. We may consider "integer" distance 
instead of "float" for Lucene Query if we accept this algo; I tried the best to 
have it close to "float" distance.

BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; 
simplest way - to recalc it at night (separate thread will do it).

======
P.S.

FuzzyQuery constructs instance of FuzzyTermEnum and passes instance of 
IndexReader to constructor. This is the way... If IndexReader changed (new 
instance) we can simply repopulate BKTree (or, for instance, we can compare 
list of terms and simply add terms missed in BKTree)...

      was (Author: funtick):
    Minor bug fixed (with cache warm-up)...

Don't forget to disable QueryResultsCache and to enable large DocumentCache (if 
you are using SOLR); otherwise you won't see the difference. (SOLR users: there 
are some other tricks!)

With Lucene 2.9.1: 
800ms

With BKTree and Levenstein algo: 
200ms

With BKTree and http://www.catalysoft.com/articles/StrikeAMatch.html
70ms

Average timing after many hours of tests. We may consider "integer" distance 
instead of "float" for Lucene Query if we accept this algo; I tried the best to 
have it close to "float" distance.

BKTree is cached at FuzzyTermEnumNEW. It needs warm-up if index changed; 
simplest way - to recalc it at night (separate thread will do it).

Thanks,
Fuad
+1 416-993-2060(cell)

  
> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2230
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2230
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.0
>         Environment: Lucene currently uses brute force full-terms scanner and 
> calculates distance for each term. New BKTree structure improves performance 
> in average 20 times when distance is 1, and 3 times when distance is 3. I 
> tested with index size several millions docs, and 250,000 terms. 
> New algo uses integer distances between objects.
>            Reporter: Fuad Efendi
>         Attachments: BKTree.java, Distance.java, DistanceImpl.java, 
> FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java
>
>   Original Estimate: 0.02h
>  Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, 
> CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by 
> http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick 
> Johnson, Google).
> Additionally, simplified algorythm at 
> http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more 
> logically correct than Levenstein distance, and it is 3-5 times faster 
> (isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to