RE: Sorting with little memory: A suggestion

Toke Eskildsen Fri, 19 Mar 2010 10:46:38 -0700

From: Robert Muir [rcm...@gmail.com]:
> Toke, only partially-on-topic here, is it possible to describe your
> use-case a little more where its preferable to use this Locale-based
> sort instead of indexing collation keys (e.g. you have to support so
> many locales this would be too much indexing overhead?)


My original use case was to avoid the memory overhead: Looking at our current 
index, we have ~7.5M documents with ~7M unique titles. They take up about 362MB 
as UTF-8 bytes, which translates to a neat 1GB of RAM as Java Strings. That's 
1GB less heap for other stuff for us, plus a sort is fairly slow. Indexing 
collation keys only helps with the speed problem. Using Exposed, the permanent 
overhead would be 40MB with a temporary overhead of 60-80MB for building and 
with far greater sort speed.

However, it is not set in stone that we will shift to using Exposed or similar: 
As many others we're pursuing real-time indexing and while Exposed sits at the 
segment-level and thus works well for re-open, big segment changes still occur 
from time to time. It is hard to see a way around this, unless the initial 
build time can be optimized further (a co-worker gave a great suggestion today 
and I think I know another way to shave 30-40% of). Still, for Exposed to be 
the default choice, we'd need to get down to a few seconds for million-document 
segments.

It would be possible to generate the Exposed structure at indexing time, but 
since it works on the whole term array (okay, the part to sort) of the segment, 
a small incremental update would require rebuilding the whole structure for the 
changes segments. In reality it would just move the same penalty from open to 
index.

> For reference, I tried to give some various tricks for using the
> Collation package here (admittedly for Solr, but the same principles
> apply): http://wiki.apache.org/solr/UnicodeCollation

I can see why this will give faster sorts over just indexing the Strings at the 
cost of memory footprint.

> Separately, if there is some reason you really need to use the
> Locale-based runtime sort, perhaps we should rethink ways to allow the
> ICU collator to be used instead of the JDK one, somehow.

I see no reason not to support the ICU collator. As long as it can compare two 
Strings, I am not at all religious about which comparator to use. Currently the 
seatch API works both ways, providing methods both for simple Locale and for 
custom sorting, so I guess it would be the same for Exposed.

Regards,
Toke Eskildsen
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Sorting with little memory: A suggestion

Reply via email to