From: Robert Muir [rcm...@gmail.com]: > Toke, only partially-on-topic here, is it possible to describe your > use-case a little more where its preferable to use this Locale-based > sort instead of indexing collation keys (e.g. you have to support so > many locales this would be too much indexing overhead?)
My original use case was to avoid the memory overhead: Looking at our current index, we have ~7.5M documents with ~7M unique titles. They take up about 362MB as UTF-8 bytes, which translates to a neat 1GB of RAM as Java Strings. That's 1GB less heap for other stuff for us, plus a sort is fairly slow. Indexing collation keys only helps with the speed problem. Using Exposed, the permanent overhead would be 40MB with a temporary overhead of 60-80MB for building and with far greater sort speed. However, it is not set in stone that we will shift to using Exposed or similar: As many others we're pursuing real-time indexing and while Exposed sits at the segment-level and thus works well for re-open, big segment changes still occur from time to time. It is hard to see a way around this, unless the initial build time can be optimized further (a co-worker gave a great suggestion today and I think I know another way to shave 30-40% of). Still, for Exposed to be the default choice, we'd need to get down to a few seconds for million-document segments. It would be possible to generate the Exposed structure at indexing time, but since it works on the whole term array (okay, the part to sort) of the segment, a small incremental update would require rebuilding the whole structure for the changes segments. In reality it would just move the same penalty from open to index. > For reference, I tried to give some various tricks for using the > Collation package here (admittedly for Solr, but the same principles > apply): http://wiki.apache.org/solr/UnicodeCollation I can see why this will give faster sorts over just indexing the Strings at the cost of memory footprint. > Separately, if there is some reason you really need to use the > Locale-based runtime sort, perhaps we should rethink ways to allow the > ICU collator to be used instead of the JDK one, somehow. I see no reason not to support the ICU collator. As long as it can compare two Strings, I am not at all religious about which comparator to use. Currently the seatch API works both ways, providing methods both for simple Locale and for custom sorting, so I guess it would be the same for Exposed. Regards, Toke Eskildsen --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org