Hello,

The current documentation for Lucene 4.3 file formats says

When referring to term numbers, Lucene's current implementation uses a Java
int to hold the term index, which means the maximum number of unique terms
in any single index segment is ~2.1 billion times the term index interval
(default 128) = ~274 billion. This is technically not a limitation of the
index file format, just of Lucene's current implementation.

(
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#Limitations
)

I believe that the termIndexInterval is not used in the default codec:
http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
 and instead the terms index is now in an FST.

So the above limit does not apply to the default codec.
What is the current limit?

I suspect it may be related to the maximum number of nodes in the FST, but
I don't know what that is or how it would translate to number of unique
terms, since prefix sharing among terms probably affects the number of
nodes in the FST.

Tom.

Reply via email to