Add javadoc notes about ICUCollationKeyFilter's speed advantage over
CollationKeyFilter
---------------------------------------------------------------------------------------
Key: LUCENE-1719
URL: https://issues.apache.org/jira/browse/LUCENE-1719
Project: Lucene - Java
Issue Type: Improvement
Components: contrib/*
Affects Versions: 2.4.1
Reporter: Steven Rowe
Priority: Trivial
Fix For: 2.9
contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is
faster than CollationKeyFilter, the JVM-provided java.text.Collator
implementation in the same package. The javadocs of these classes should be
modified to add a note to this effect.
My curiosity was piqued by [Robert Muir's
comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300]
on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x
faster than CollationKeyFilter.
I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit,
1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages
(taken from the corresponding Debian wordlist packages and truncated to the
first 90k words after a fixed random shuffling), using Collators at the default
strength, on a Windows Vista 64-bit machine. I used an analysis pipeline
consisting of WhitespaceTokenizer chained to the collation key filter, so to
isolate the time taken by the collation key filters, I also timed
WhitespaceTokenizer operating alone for each combination, and then subtracted
it from both of the collation key analysis chains' times. The rightmost column
represents the performance advantage of the ICU4J implemtation (ICU) over the
java.text.Collator implementation (JVM), after discounting the
WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5
runs for each combination, in milliseconds, are as follows:
||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
|1.4.2_17 (32 bit)|English|522|212|13|2.6x|
|1.4.2_17 (32 bit)|French|716|243|14|3.1x|
|1.4.2_17 (32 bit)|German|669|264|16|2.6x|
|1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
|1.5.0_15 (32 bit)|English|604|176|16|3.7x|
|1.5.0_15 (32 bit)|French|817|209|17|4.2x|
|1.5.0_15 (32 bit)|German|799|225|20|3.8x|
|1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
|1.5.0_15 (64 bit)|English|431|89|10|5.3x|
|1.5.0_15 (64 bit)|French|562|112|11|5.5x|
|1.5.0_15 (64 bit)|German|567|116|13|5.4x|
|1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
|1.6.0_13 (64 bit)|English|162|81|9|2.1x|
|1.6.0_13 (64 bit)|French|192|92|10|2.2x|
|1.6.0_13 (64 bit)|German|204|99|14|2.2x|
|1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]