Fuzzy search tends to be super heavy on CPU because of the Levenstein distance algo. We use it for a small index 60MB for spell correcting and our QPS suffers as a result.
There was recently a discussion of a new fuzzy algorithm: https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661286#action_12661286 M On Wed, Feb 18, 2009 at 5:59 AM, Varun Dhussa <va...@mapmyindia.com> wrote: > The method suggested would make the speed faster, but I doubt whether it > would be substantial on processors with slower clock speed. Keeping in mind > that most processors are going multi-core, it would make sense to > multi-thread the scan. > > Any remarks are welcome! > > Varun Dhussa > Product Architect > CE InfoSystems (P) Ltd > http://www.mapmyindia.com > > > > mark harwood wrote: > >> I was having some thoughts recently about speeding up fuzzy search. >> >> The current system does edit-distance on all terms A-Z, single threaded. >> Prefix length can reduce the search space and there is a "minimum >> similarity" threshold but that's roughly where we are. Multithreading this >> to make use of multiple CPUs is one option to look at but I was mainly >> thinking about smarter ways to do the fuzzy scan: >> >> I had the notion that we could move to a solution where a priority queue >> keeps the "best matches so far" and as you progress through the termEnum you >> could bail out of edit distance calculations quickly using a rough(cheap) >> assessment of if the current term is likely to make the cut (i.e. beat the >> current lowest score in the priority queue). It would make sense to populate >> the priority queue ASAP with terms that are most likely to be the best >> matches and these will be the ones that share a reasonable length prefix. >> As an example - searching for Obama~ >> >> 1) Create "best matches" priority queue >> 2) Scan all terms from oba to obz populating priority queue >> 3) Scan all terms from "a" to "oba" and "obz" to "z", exiting quickly if >> the term fails to meet lowest score in the priority queue. >> >> How we "exit quickly" and how we determine what prefix to use in 2) are to >> be determined but the principle seems reasonable >> >> Thoughts? >> >> >> >> >> ----- Original Message ---- >> From: Varun Dhussa <va...@mapmyindia.com> >> To: java-user@lucene.apache.org >> Sent: Wednesday, 18 February, 2009 10:36:07 >> Subject: Lucene search performance on Sun UltraSparc T2 (T5120) servers >> >> Hi, >> >> I have had a bad experience when migrating my application from Intel Xeon >> based servers to Sun UltraSparc T2 T5120 servers. Lucene fuzzy search just >> does not perform. A search which took approximately 500 ms takes more than 6 >> seconds to execute. >> >> The index has about 100,000,000 records. So, I tried to split it into 10 >> indices and used the ParallelSearcher on it, but still got similar results. >> >> I am guessing that this is because the distance implementation used by >> Lucene requires higher clock speed and can't be parallelized much. >> >> Please advice >> >> -- Varun Dhussa >> Product Architect >> CE InfoSystems (P) Ltd >> http://www.mapmyindia.com >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >