On Mon, 2010-10-25 at 09:41 +0200, Mathias Walter wrote:
> [...] I enabled the field cache for my ID field and another
> single char field (PAS type) to get the benefit of accessing
> the fields with an array. Unfortunately, the IDs are too
> large to fit in memory. I gave 12 GB of RAM to each node and
> also tried to use the MMapDirectory and/or CompressedOops.
> Lucene always runs out of memory.

That is a known problem with Lucene 3-. The cache uses Strings for the
terms, which has a lot of overhead. As you discovered, reducing the
length of the ID's does not help much.

[Encoding ID as 11 stored bytes]

> Recently I upgraded to trunk (4.0) and tried to use the ByteRefs
> from FieldCache.DEFAULT.getTerms directly. But the bytes are
> encoded in an unknown form (unknown to me) and cannot be decoded
> with IndexableBinaryStringTools.decode.

It depends on what you put into it, but if you represent your IDs as
normal Strings at index time, they will be stored in UTF-8 encoding.
Since you're using 11 ASCII characters for an ID, this means 11 bytes.
You can get your Strings back by calling myBytesRef.utf8ToString().

The  overhead for BytesRefs is a lot lower than Strings, so simply
indexing your ID's and using the field cache might solve your problem
when you're using trunk.

- Toke

Reply via email to