Hey! I am working on a lucene based service for reverse geocoding. We have a large index with lots of unique terms (550 million) and it appears that we're running into issue with memory on our leaf servers as the term dictionary for the entire index is being loaded into heap space. If we allocate > 65g heap space, our queries return relatively quickly (10s -100s of ms), but if we drop below ~65g heap space on the leaf nodes, query time drops dramatically, quickly hitting 20+ seconds (our test harness drops at 20s).
I did some research, and found in past versions of lucene, one could split the loading of the terms dictionary using the 'termInfosIndexDivisor' option in the directoryReader class. That option was deprecated in lucene 5.0.0 <https://abi-laboratory.pro/java/tracker/changelog/lucene/5.0.0/log.html> in favor of using codecs to achieve similar functionality. Looking at the available experimental codecs. I see the BlockTreeTermsWriter <https://lucene.apache.org/core/5_3_1/core/org/apache/lucene/codecs/blocktree/BlockTreeTermsWriter.html#BlockTreeTermsWriter(org.apache.lucene.index.SegmentWriteState,%20org.apache.lucene.codecs.PostingsWriterBase,%20int,%20int)> that seems like it could be used for a similar purpose, breaking down the term dictionary so that we don't load the whole thing into heap space. Has anyone run into this problem before and found an effective solution? Does changing the codec used seem appropriate for this issue? If so, how do I got about loading an alternative codec and configuring it to my needs? I'm having trouble finding docs/examples of how this is used in the real world so even if you point me to a repo or docs somewhere I'd appreciate it. Thanks! Best, Tom Hirschfeld