Have you tried NGram SpellChecker + Query expansion? This is quite similar to your proposal, you have your priority queue in SpellChecker
----- Original Message ---- > From: mark harwood <markharw...@yahoo.co.uk> > To: java-user@lucene.apache.org > Sent: Wednesday, 18 February, 2009 11:54:18 > Subject: Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers > > > I was having some thoughts recently about speeding up fuzzy search. > > The current system does edit-distance on all terms A-Z, single threaded. > Prefix > length can reduce the search space and there is a "minimum similarity" > threshold > but that's roughly where we are. Multithreading this to make use of multiple > CPUs is one option to look at but I was mainly thinking about smarter ways to > do > the fuzzy scan: > > I had the notion that we could move to a solution where a priority queue > keeps > the "best matches so far" and as you progress through the termEnum you could > bail out of edit distance calculations quickly using a rough(cheap) > assessment > of if the current term is likely to make the cut (i.e. beat the current > lowest > score in the priority queue). It would make sense to populate the priority > queue > ASAP with terms that are most likely to be the best matches and these will be > the ones that share a reasonable length prefix. > As an example - searching for Obama~ > > 1) Create "best matches" priority queue > 2) Scan all terms from oba to obz populating priority queue > 3) Scan all terms from "a" to "oba" and "obz" to "z", exiting quickly if the > term fails to meet lowest score in the priority queue. > > How we "exit quickly" and how we determine what prefix to use in 2) are to be > determined but the principle seems reasonable > > Thoughts? > > > > > ----- Original Message ---- > From: Varun Dhussa > To: java-user@lucene.apache.org > Sent: Wednesday, 18 February, 2009 10:36:07 > Subject: Lucene search performance on Sun UltraSparc T2 (T5120) servers > > Hi, > > I have had a bad experience when migrating my application from Intel Xeon > based > servers to Sun UltraSparc T2 T5120 servers. Lucene fuzzy search just does not > perform. A search which took approximately 500 ms takes more than 6 seconds > to > execute. > > The index has about 100,000,000 records. So, I tried to split it into 10 > indices > and used the ParallelSearcher on it, but still got similar results. > > I am guessing that this is because the distance implementation used by Lucene > requires higher clock speed and can't be parallelized much. > > Please advice > > -- Varun Dhussa > Product Architect > CE InfoSystems (P) Ltd > http://www.mapmyindia.com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org