Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

eks dev Wed, 18 Feb 2009 07:59:21 -0800

Have you tried NGram SpellChecker + Query expansion?  This is quite similar to 
your proposal, you have your priority queue in SpellChecker




----- Original Message ----
> From: mark harwood <markharw...@yahoo.co.uk>
> To: java-user@lucene.apache.org
> Sent: Wednesday, 18 February, 2009 11:54:18
> Subject: Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers
> 
> 
> I was having some thoughts recently about speeding up fuzzy search.
> 
> The current system does edit-distance on all terms A-Z, single threaded. 
> Prefix 
> length can reduce the search space and there is a "minimum similarity" 
> threshold 
> but that's roughly where we are. Multithreading this to make use of multiple 
> CPUs is one option to look at but I was mainly thinking about smarter ways to 
> do 
> the fuzzy scan:
> 
> I had the notion that we could move to a solution where a priority queue 
> keeps 
> the "best matches so far" and as you progress through the termEnum you could 
> bail out of edit distance calculations quickly using a rough(cheap) 
> assessment 
> of if the current term is likely to make the cut (i.e. beat the current 
> lowest 
> score in the priority queue). It would make sense to populate the priority 
> queue 
> ASAP with terms that are most likely to be the best matches and these will be 
> the ones that share a reasonable length prefix.
> As an example - searching for Obama~
> 
> 1) Create "best matches" priority queue
> 2) Scan all terms from oba to obz populating priority queue
> 3) Scan all terms from "a" to "oba" and "obz" to "z", exiting quickly if the 
> term fails to meet lowest score in the priority queue.
> 
> How we "exit quickly" and how we determine what prefix to use in 2) are to be 
> determined but the principle seems reasonable
> 
> Thoughts?
> 
> 
> 
> 
> ----- Original Message ----
> From: Varun Dhussa 
> To: java-user@lucene.apache.org
> Sent: Wednesday, 18 February, 2009 10:36:07
> Subject: Lucene search performance on Sun UltraSparc T2 (T5120) servers
> 
> Hi,
> 
> I have had a bad experience when migrating my application from Intel Xeon 
> based 
> servers to Sun UltraSparc T2 T5120 servers. Lucene fuzzy search just does not 
> perform. A search which took approximately 500 ms takes more than 6 seconds 
> to 
> execute.
> 
> The index has about 100,000,000 records. So, I tried to split it into 10 
> indices 
> and used the ParallelSearcher on it, but still got similar results.
> 
> I am guessing that this is because the distance implementation used by Lucene 
> requires higher clock speed and can't be parallelized much.
> 
> Please advice
> 
> -- Varun Dhussa
> Product Architect
> CE InfoSystems (P) Ltd
> http://www.mapmyindia.com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

Reply via email to