[ 
https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1719:
--------------------------------

    Description: 
contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is 
faster than CollationKeyFilter, the JVM-provided java.text.Collator 
implementation in the same package.  The javadocs of these classes should be 
modified to add a note to this effect.

My curiosity was piqued by [Robert Muir's 
comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300]
 on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x 
faster than CollationKeyFilter.

I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 
1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages 
(taken from the corresponding Debian wordlist packages and truncated to the 
first 90k words after a fixed random shuffling), using Collators at the default 
strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline 
consisting of WhitespaceTokenizer chained to the collation key filter, so to 
isolate the time taken by the collation key filters, I also timed 
WhitespaceTokenizer operating alone for each combination.  The rightmost column 
represents the performance advantage of the ICU4J implemtation (ICU) over the 
java.text.Collator implementation (JVM), after discounting the 
WhitespaceTokenizer time (WST): (JVM-ICU) / (ICU-WST). The best times out of 5 
runs for each combination, in milliseconds, are as follows:

||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
|1.4.2_17 (32 bit)|English|522|212|13|156%|
|1.4.2_17 (32 bit)|French|716|243|14|207%|
|1.4.2_17 (32 bit)|German|669|264|16|163%|
|1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
|1.5.0_15 (32 bit)|English|604|176|16|268%|
|1.5.0_15 (32 bit)|French|817|209|17|317%|
|1.5.0_15 (32 bit)|German|799|225|20|280%|
|1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
|1.5.0_15 (64 bit)|English|431|89|10|433%|
|1.5.0_15 (64 bit)|French|562|112|11|446%|
|1.5.0_15 (64 bit)|German|567|116|13|438%|
|1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
|1.6.0_13 (64 bit)|English|162|81|9|113%|
|1.6.0_13 (64 bit)|French|192|92|10|122%|
|1.6.0_13 (64 bit)|German|204|99|14|124%|
|1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|


  was:
contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is 
faster than CollationKeyFilter, the JVM-provided java.text.Collator 
implementation in the same package.  The javadocs of these classes should be 
modified to add a note to this effect.

My curiosity was piqued by [Robert Muir's 
comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300]
 on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x 
faster than CollationKeyFilter.

I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 
1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages 
(taken from the corresponding Debian wordlist packages and truncated to the 
first 90k words after a fixed random shuffling), using Collators at the default 
strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline 
consisting of WhitespaceTokenizer chained to the collation key filter, so to 
isolate the time taken by the collation key filters, I also timed 
WhitespaceTokenizer operating alone for each combination.  The rightmost column 
represents the performance advantage of the ICU4J implemtation (ICU) over the 
java.text.Collator implementation (JVM), after discounting the 
WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 
runs for each combination, in milliseconds, are as follows:

||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
|1.4.2_17 (32 bit)|English|522|212|13|2.6x|
|1.4.2_17 (32 bit)|French|716|243|14|3.1x|
|1.4.2_17 (32 bit)|German|669|264|16|2.6x|
|1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
|1.5.0_15 (32 bit)|English|604|176|16|3.7x|
|1.5.0_15 (32 bit)|French|817|209|17|4.2x|
|1.5.0_15 (32 bit)|German|799|225|20|3.8x|
|1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
|1.5.0_15 (64 bit)|English|431|89|10|5.3x|
|1.5.0_15 (64 bit)|French|562|112|11|5.5x|
|1.5.0_15 (64 bit)|German|567|116|13|5.4x|
|1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
|1.6.0_13 (64 bit)|English|162|81|9|2.1x|
|1.6.0_13 (64 bit)|French|192|92|10|2.2x|
|1.6.0_13 (64 bit)|German|204|99|14|2.2x|
|1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|


        Summary: Add javadoc notes about ICUCollationKeyFilter's advantages 
over CollationKeyFilter  (was: Add javadoc notes about ICUCollationKeyFilter's 
speed advantage over CollationKeyFilter)

Edited title to reflect addition of key length concerns, and switched 
performance improvement column to be percentage improvements rather than 
multipliers.

> Add javadoc notes about ICUCollationKeyFilter's advantages over 
> CollationKeyFilter
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is 
> faster than CollationKeyFilter, the JVM-provided java.text.Collator 
> implementation in the same package.  The javadocs of these classes should be 
> modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's 
> comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300]
>  on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x 
> faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 
> 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 
> 4 languages (taken from the corresponding Debian wordlist packages and 
> truncated to the first 90k words after a fixed random shuffling), using 
> Collators at the default strength, on a Windows Vista 64-bit machine.  I used 
> an analysis pipeline consisting of WhitespaceTokenizer chained to the 
> collation key filter, so to isolate the time taken by the collation key 
> filters, I also timed WhitespaceTokenizer operating alone for each 
> combination.  The rightmost column represents the performance advantage of 
> the ICU4J implemtation (ICU) over the java.text.Collator implementation 
> (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / 
> (ICU-WST). The best times out of 5 runs for each combination, in 
> milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J 
> Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|156%|
> |1.4.2_17 (32 bit)|French|716|243|14|207%|
> |1.4.2_17 (32 bit)|German|669|264|16|163%|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
> |1.5.0_15 (32 bit)|English|604|176|16|268%|
> |1.5.0_15 (32 bit)|French|817|209|17|317%|
> |1.5.0_15 (32 bit)|German|799|225|20|280%|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
> |1.5.0_15 (64 bit)|English|431|89|10|433%|
> |1.5.0_15 (64 bit)|French|562|112|11|446%|
> |1.5.0_15 (64 bit)|German|567|116|13|438%|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
> |1.6.0_13 (64 bit)|English|162|81|9|113%|
> |1.6.0_13 (64 bit)|French|192|92|10|122%|
> |1.6.0_13 (64 bit)|German|204|99|14|124%|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to