Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
----------------------------------------------------------------
Key: LUCENE-2230
URL: https://issues.apache.org/jira/browse/LUCENE-2230
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 3.0
Environment: Lucene currently uses brute force full-terms scanner and
calculates distance for each term. New BKTree structure improves performance in
average 20 times when distance is 1, and 3 times when distance is 3. I tested
with index size several millions docs, and 250,000 terms.
New algo uses integer distances between objects.
Reporter: Fuad Efendi
W. Burkhard and R. Keller. Some approaches to best-match file searching, CACM,
1973
http://portal.acm.org/citation.cfm?doid=362003.362025
I was inspired by
http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick
Johnson, Google).
Additionally, simplified algorythm at
http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more
logically correct than Levenstein distance, and it is 3-5 times faster
(isolated tests).
Big list od distance implementations:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]