[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Fuad Efendi (JIRA) Tue, 09 Feb 2010 13:36:00 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829163#action_12829163
 ]


Fuad Efendi edited comment on LUCENE-2230 at 2/9/10 9:35 PM:
-------------------------------------------------------------

After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress 
simulator (with randomply generated fuzzy query samples); each box is 2x AMD 
Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want 
to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on "client" 
computer), then dropped: 

9 Threads:
==========
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===========
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===========
TPS: 180 - 220
Response: 50 - 90 (ms)
 
16 Threads:
===========
TPS: 60 - 65
Response: 230 - 260 (ms)
 

It can be explained by CPU-bound processing and 8 cores available; "top" 
command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step 
(12 stressing threads), and 200% on 4th step (16 stressing threads) - due 
probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp 
(persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads 
according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

"Fuzzy" queries were tuned such a way that distance threshold was less than or 
equal two. I used "StrikeAMatch" distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

P.S.
Before performing load-stress tests, I established the baseline in my 
environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ 
(static JSP).
And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default 
Lucene/SOLR)...

P.P.S.
Distance function must follow 3 'axioms':
{code}
D(a,a) = 0
D(a,b) = D(b,a)
D(a,b) + D(b,c) >= D(a,c)
{code}

And, function must return Integer value.

Otherwise, BKTree will produce wrong results. 


Also, it's mentioned somewhere in Levenstein Algo Java Docs (in contrib folder 
I believe) that instance method runs faster than static method; need to test 
with Java 6... most probably 'yes', depends on JVM implementation; I can guess 
only that CPU-internals are better optimized for instance method...

      was (Author: funtick):
    After long-run load-stress tests...

I used 2 boxes, one with SOLR, another one with simple multithreaded stress 
simulator (with randomply generated fuzzy query samples); each box is 2x AMD 
Opteron 2350 (8 core per box); 64-bit.

I disabled all SOLR caches except Document Cache (I want isolated tests; I want 
to ignore time taken by disk I/O to load document).

Performance boosted accordingly to number of load-stress threads (on "client" 
computer), then dropped: 

9 Threads:
==========
TPS: 200 - 210
Response: 45 - 50 (ms)

10 Threads:
===========
TPS: 200 - 215
Response: 45 - 55 (ms)

12 Threads:
===========
TPS: 180 - 220
Response: 50 - 90 (ms)
 
16 Threads:
===========
TPS: 60 - 65
Response: 230 - 260 (ms)
 

It can be explained by CPU-bound processing and 8 cores available; "top" 
command on SOLR instance was shown 750% - 790% CPU time (8-core) on 3rd step 
(12 stressing threads), and 200% on 4th step (16 stressing threads) - due 
probably to Network I/O, Tomcat internals, etc.

It's better to have Apache HTTPD in front of SOLR in production, with proxy_ajp 
(persistent connections) and HTTP caching enabled; and fine-tune Tomcat threads 
according to use case.

BTW, my best counters for default SOLR/Lucene were:
TPS: 12
Response: 750ms

"Fuzzy" queries were tuned such a way that distance threshold was less than or 
equal two. I used "StrikeAMatch" distance...

Thanks,
http://www.tokenizer.ca
+1 416-993-2060(cell)

P.S.
Before performing load-stress tests, I established the baseline in my 
environment: 1500 TPS by pinging http://x.x.x.x:8080/apache-solr-1.4/admin/ 
(static JSP).
And, I reached 220TPS for fuzzy search, starting from 12-15TPS (default 
Lucene/SOLR)...


  
> Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.
> ----------------------------------------------------------------
>
>                 Key: LUCENE-2230
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2230
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.0
>         Environment: Lucene currently uses brute force full-terms scanner and 
> calculates distance for each term. New BKTree structure improves performance 
> in average 20 times when distance is 1, and 3 times when distance is 3. I 
> tested with index size several millions docs, and 250,000 terms. 
> New algo uses integer distances between objects.
>            Reporter: Fuad Efendi
>         Attachments: BKTree.java, Distance.java, DistanceImpl.java, 
> FuzzyTermEnumNEW.java, FuzzyTermEnumNEW.java
>
>   Original Estimate: 0.02h
>  Remaining Estimate: 0.02h
>
> W. Burkhard and R. Keller. Some approaches to best-match file searching, 
> CACM, 1973
> http://portal.acm.org/citation.cfm?doid=362003.362025
> I was inspired by 
> http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees (Nick 
> Johnson, Google).
> Additionally, simplified algorythm at 
> http://www.catalysoft.com/articles/StrikeAMatch.html seems to be much more 
> logically correct than Levenstein distance, and it is 3-5 times faster 
> (isolated tests).
> Big list od distance implementations:
> http://www.dcs.shef.ac.uk/~sam/stringmetrics.htm

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-2230) Lucene Fuzzy Search: BK-Tree can improve performance 3-20 times.

Reply via email to