Term Dictionary taking up lots of heap memory, looking for solutions, lucene 5.3.1

Tom Hirschfeld Wed, 17 May 2017 20:00:02 -0700

Hey!

I am working on a lucene based service for reverse geocoding. We have a
large index with lots of unique terms (550 million) and it appears that
we're running into issue with memory on our leaf servers as the term
dictionary for the entire index is being loaded into heap space. If we
allocate > 65g heap space, our queries return relatively quickly (10s -100s
of ms), but if we drop below ~65g heap space on the leaf nodes, query time
drops dramatically, quickly hitting 20+ seconds (our test harness drops at
20s).


I did some research, and found in past versions of lucene, one could split
the loading of the terms dictionary using the 'termInfosIndexDivisor'
option in the directoryReader class. That option was deprecated in lucene
5.0.0
<https://abi-laboratory.pro/java/tracker/changelog/lucene/5.0.0/log.html> in
favor of using codecs to achieve similar functionality. Looking at the
available experimental codecs. I see the BlockTreeTermsWriter
<https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/codecs/blocktree/BlockTreeTermsWriter.html#BlockTreeTermsWriter(org.apache.lucene.index.SegmentWriteState,%20org.apache.lucene.codecs.PostingsWriterBase,%20int,%20int)>
that
seems like it could be used for a similar purpose, breaking down the term
dictionary so that we don't load the whole thing into heap space.

Has anyone run into this problem before and found an effective solution?
Does changing the codec used seem appropriate for this issue? If so, how do
I got about loading an alternative codec and configuring it to my needs?
I'm having trouble finding docs/examples of how this is used in the real
world so even if you point me to a repo or docs somewhere I'd appreciate
it.
Thanks!

Best,
Tom Hirschfeld

Term Dictionary taking up lots of heap memory, looking for solutions, lucene 5.3.1

Reply via email to