On Fri, Mar 19, 2010 at 5:42 PM, Toke Eskildsen <t...@statsbiblioteket.dk> 
wrote:

> I sounds like I'm missing something here... A quick check of running 20000 
> random Strings of 30 characters from a-zA-Z0-1 + 20 different national 
> characters through Java's Collator returned an average collatorKey-length of 
> 175 bytes. On http://wiki.apache.org/solr/UnicodeCollation it is stated that 
> a standard sort is used, which - to my knowledge - loads the Strings into 
> memory. For my quick test, this means a tripling of memory usage for the sort 
> field when indexing collatorKeys?
>

Right, JDK collation sucks, use the ICU for collation keys too:
http://site.icu-project.org/charts/collation-icu4j-sun
at 1.59 bytes/char, thats less than UTF-16


-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to