Hi Tom: I already enhanced the javadocs about this for Lucene, putting
warnings everywhere in bold:

NOTE: This parameter does not apply to all PostingsFormat
implementations, including the default one in this release. It only
makes sense for term indexes that are implemented as a fixed gap
between terms.
NOTE: divisor settings > 1 do not apply to all PostingsFormat
implementations, including the default one in this release. It only
makes sense for terms indexes that can efficiently re-sample terms at
load time.
etc

http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/DirectoryReader.html#open%28org.apache.lucene.store.Directory,%20int%29

In the future I expect these parameters ill be removed completely:
anything like this is specific to the codec/implementation.

In Lucene 4.0 the terms index works completely differently: these
parameters don't make sense for it.

On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West <tburt...@umich.edu> wrote:
> Hello all,
>
> Due to multiple languages and dirty OCR, our indexes have over 2 billion
> unique terms (
> http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again).
> In Solr 3.6 and previous we needed to reduce the memory used for storing
> the in-memory representation of the tii file.   We originally used the
> termInfosIndexDivisor which affects the sampling of the tii file when read
> into memory.   While this solved our problem for searching, unfortunately
> the termInfosIndexDivisor was not read during indexing and caused OOM
> problems once our indexes grew beyond a certain size.  See:
> https://issues.apache.org/jira/browse/SOLR-2290.
>
> Has this been changed in Solr 4.0?
>
> The advantage of using the termInfosIndexDivisor is that it can be changed
> without re-indexing, so we were able to experiment with different settings
> to determine a good setting without re-indexing several terabytes of data.
>
> When we ran into problems with the memory use for the in-memory
> representation of the tii file during indexing, we changed the
> termIndexInterval.  The termIndexInterval is an indexing-time setting
>  which affects the size of the tii file.  It sets the sampling of the tis
> file that gets written to the tii file.
>
> In Solr 4.0 termInfosIndexDivisor has been replaced with
> termIndexDivisor.    The documentation for these two features, the
> index-time termIndexInterval and the run-time  termIndexDivisor no longer
> seems to be on the solr config page of the wiki and the docmentation in the
> example file does not exlain what the termIndexDivisor does.
>
> Would it be appropriate to add these back to the wiki page?  If not, could
> someone add a line or two to the comments in the Solr 4.0 example file
> explaining what the termIndexDivisor doe?
>
>
> Tom



-- 
lucidworks.com

Reply via email to