Hi Tom: I already enhanced the javadocs about this for Lucene, putting warnings everywhere in bold:
NOTE: This parameter does not apply to all PostingsFormat implementations, including the default one in this release. It only makes sense for term indexes that are implemented as a fixed gap between terms. NOTE: divisor settings > 1 do not apply to all PostingsFormat implementations, including the default one in this release. It only makes sense for terms indexes that can efficiently re-sample terms at load time. etc http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29 http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/DirectoryReader.html#open%28org.apache.lucene.store.Directory,%20int%29 In the future I expect these parameters ill be removed completely: anything like this is specific to the codec/implementation. In Lucene 4.0 the terms index works completely differently: these parameters don't make sense for it. On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West <tburt...@umich.edu> wrote: > Hello all, > > Due to multiple languages and dirty OCR, our indexes have over 2 billion > unique terms ( > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again). > In Solr 3.6 and previous we needed to reduce the memory used for storing > the in-memory representation of the tii file. We originally used the > termInfosIndexDivisor which affects the sampling of the tii file when read > into memory. While this solved our problem for searching, unfortunately > the termInfosIndexDivisor was not read during indexing and caused OOM > problems once our indexes grew beyond a certain size. See: > https://issues.apache.org/jira/browse/SOLR-2290. > > Has this been changed in Solr 4.0? > > The advantage of using the termInfosIndexDivisor is that it can be changed > without re-indexing, so we were able to experiment with different settings > to determine a good setting without re-indexing several terabytes of data. > > When we ran into problems with the memory use for the in-memory > representation of the tii file during indexing, we changed the > termIndexInterval. The termIndexInterval is an indexing-time setting > which affects the size of the tii file. It sets the sampling of the tis > file that gets written to the tii file. > > In Solr 4.0 termInfosIndexDivisor has been replaced with > termIndexDivisor. The documentation for these two features, the > index-time termIndexInterval and the run-time termIndexDivisor no longer > seems to be on the solr config page of the wiki and the docmentation in the > example file does not exlain what the termIndexDivisor does. > > Would it be appropriate to add these back to the wiki page? If not, could > someone add a line or two to the comments in the Solr 4.0 example file > explaining what the termIndexDivisor doe? > > > Tom -- lucidworks.com