Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

Michael Stoppelman Wed, 18 Feb 2009 08:59:55 -0800

Fuzzy search tends to be super heavy on CPU because of the Levenstein
distance algo. We use it for a small index 60MB for spell correcting and our
QPS suffers as a result.


There was recently a discussion of a new fuzzy algorithm:
https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661286#action_12661286

M

On Wed, Feb 18, 2009 at 5:59 AM, Varun Dhussa <[email protected]> wrote:

> The method suggested would make the speed faster, but I doubt whether it
> would be substantial on processors with slower clock speed. Keeping in mind
> that most processors are going multi-core, it would make sense to
> multi-thread the scan.
>
> Any remarks are welcome!
>
> Varun Dhussa
> Product Architect
> CE InfoSystems (P) Ltd
> http://www.mapmyindia.com
>
>
>
> mark harwood wrote:
>
>> I was having some thoughts recently about speeding up fuzzy search.
>>
>> The current system does edit-distance on all terms A-Z, single threaded.
>> Prefix length can reduce the search space and there is a "minimum
>> similarity" threshold but that's roughly where we are. Multithreading this
>> to make use of multiple CPUs is one option to look at but I was mainly
>> thinking about smarter ways to do the fuzzy scan:
>>
>> I had the notion that we could move to a solution where a priority queue
>> keeps the "best matches so far" and as you progress through the termEnum you
>> could bail out of edit distance calculations quickly using a rough(cheap)
>> assessment of if the current term is likely to make the cut (i.e. beat the
>> current lowest score in the priority queue). It would make sense to populate
>> the priority queue ASAP with terms that are most likely to be the best
>> matches and these will be the ones that share a reasonable length prefix.
>> As an example - searching for Obama~
>>
>> 1) Create "best matches" priority queue
>> 2) Scan all terms from oba to obz populating priority queue
>> 3) Scan all terms from "a" to "oba" and "obz" to "z", exiting quickly if
>> the term fails to meet lowest score in the priority queue.
>>
>> How we "exit quickly" and how we determine what prefix to use in 2) are to
>> be determined but the principle seems reasonable
>>
>> Thoughts?
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Varun Dhussa <[email protected]>
>> To: [email protected]
>> Sent: Wednesday, 18 February, 2009 10:36:07
>> Subject: Lucene search performance on Sun UltraSparc T2 (T5120) servers
>>
>> Hi,
>>
>> I have had a bad experience when migrating my application from Intel Xeon
>> based servers to Sun UltraSparc T2 T5120 servers. Lucene fuzzy search just
>> does not perform. A search which took approximately 500 ms takes more than 6
>> seconds to execute.
>>
>> The index has about 100,000,000 records. So, I tried to split it into 10
>> indices and used the ParallelSearcher on it, but still got similar results.
>>
>> I am guessing that this is because the distance implementation used by
>> Lucene requires higher clock speed and can't be parallelized much.
>>
>> Please advice
>>
>> -- Varun Dhussa
>> Product Architect
>> CE InfoSystems (P) Ltd
>> http://www.mapmyindia.com
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

Reply via email to