I was having some thoughts recently about speeding up fuzzy search.

The current system does edit-distance on all terms A-Z, single threaded. Prefix 
length can reduce the search space and there is a "minimum similarity" 
threshold but that's roughly where we are. Multithreading this to make use of 
multiple CPUs is one option to look at but I was mainly thinking about smarter 
ways to do the fuzzy scan:

I had the notion that we could move to a solution where a priority queue keeps 
the "best matches so far" and as you progress through the termEnum you could 
bail out of edit distance calculations quickly using a rough(cheap) assessment 
of if the current term is likely to make the cut (i.e. beat the current lowest 
score in the priority queue). It would make sense to populate the priority 
queue ASAP with terms that are most likely to be the best matches and these 
will be the ones that share a reasonable length prefix.
As an example - searching for Obama~

1) Create "best matches" priority queue
2) Scan all terms from oba to obz populating priority queue
3) Scan all terms from "a" to "oba" and "obz" to "z", exiting quickly if the 
term fails to meet lowest score in the priority queue.

How we "exit quickly" and how we determine what prefix to use in 2) are to be 
determined but the principle seems reasonable

Thoughts?




----- Original Message ----
From: Varun Dhussa <va...@mapmyindia.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 18 February, 2009 10:36:07
Subject: Lucene search performance on Sun UltraSparc T2 (T5120) servers

Hi,

I have had a bad experience when migrating my application from Intel Xeon based 
servers to Sun UltraSparc T2 T5120 servers. Lucene fuzzy search just does not 
perform. A search which took approximately 500 ms takes more than 6 seconds to 
execute.

The index has about 100,000,000 records. So, I tried to split it into 10 
indices and used the ParallelSearcher on it, but still got similar results.

I am guessing that this is because the distance implementation used by Lucene 
requires higher clock speed and can't be parallelized much.

Please advice

-- Varun Dhussa
Product Architect
CE InfoSystems (P) Ltd
http://www.mapmyindia.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to