On Mon, 2010-10-25 at 09:41 +0200, Mathias Walter wrote: > [...] I enabled the field cache for my ID field and another > single char field (PAS type) to get the benefit of accessing > the fields with an array. Unfortunately, the IDs are too > large to fit in memory. I gave 12 GB of RAM to each node and > also tried to use the MMapDirectory and/or CompressedOops. > Lucene always runs out of memory.
That is a known problem with Lucene 3-. The cache uses Strings for the terms, which has a lot of overhead. As you discovered, reducing the length of the ID's does not help much. [Encoding ID as 11 stored bytes] > Recently I upgraded to trunk (4.0) and tried to use the ByteRefs > from FieldCache.DEFAULT.getTerms directly. But the bytes are > encoded in an unknown form (unknown to me) and cannot be decoded > with IndexableBinaryStringTools.decode. It depends on what you put into it, but if you represent your IDs as normal Strings at index time, they will be stored in UTF-8 encoding. Since you're using 11 ASCII characters for an ID, this means 11 bytes. You can get your Strings back by calling myBytesRef.utf8ToString(). The overhead for BytesRefs is a lot lower than Strings, so simply indexing your ID's and using the field cache might solve your problem when you're using trunk. - Toke