[
https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724941#action_12724941
]
Robert Muir commented on LUCENE-1719:
-------------------------------------
steven, you are correct.
i should have clarified, the gain is not as much when generating keys. but
there is still huge gains for runtime comparison. see recent numbers here for a
few languages:
http://site.icu-project.org/charts/collation-icu4j-sun
but you should also mention that key size is smaller too! (smaller term
dictionary)
> Add javadoc notes about ICUCollationKeyFilter's speed advantage over
> CollationKeyFilter
> ---------------------------------------------------------------------------------------
>
> Key: LUCENE-1719
> URL: https://issues.apache.org/jira/browse/LUCENE-1719
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/*
> Affects Versions: 2.4.1
> Reporter: Steven Rowe
> Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is
> faster than CollationKeyFilter, the JVM-provided java.text.Collator
> implementation in the same package. The javadocs of these classes should be
> modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's
> comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300]
> on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x
> faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions
> 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of
> 4 languages (taken from the corresponding Debian wordlist packages and
> truncated to the first 90k words after a fixed random shuffling), using
> Collators at the default strength, on a Windows Vista 64-bit machine. I used
> an analysis pipeline consisting of WhitespaceTokenizer chained to the
> collation key filter, so to isolate the time taken by the collation key
> filters, I also timed WhitespaceTokenizer operating alone for each
> combination. The rightmost column represents the performance advantage of
> the ICU4J implemtation (ICU) over the java.text.Collator implementation
> (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) /
> (JVM-WST). The best times out of 5 runs for each combination, in
> milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J
> Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|2.6x|
> |1.4.2_17 (32 bit)|French|716|243|14|3.1x|
> |1.4.2_17 (32 bit)|German|669|264|16|2.6x|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
> |1.5.0_15 (32 bit)|English|604|176|16|3.7x|
> |1.5.0_15 (32 bit)|French|817|209|17|4.2x|
> |1.5.0_15 (32 bit)|German|799|225|20|3.8x|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
> |1.5.0_15 (64 bit)|English|431|89|10|5.3x|
> |1.5.0_15 (64 bit)|French|562|112|11|5.5x|
> |1.5.0_15 (64 bit)|German|567|116|13|5.4x|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
> |1.6.0_13 (64 bit)|English|162|81|9|2.1x|
> |1.6.0_13 (64 bit)|French|192|92|10|2.2x|
> |1.6.0_13 (64 bit)|German|204|99|14|2.2x|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]