Hello, The current documentation for Lucene 4.3 file formats says
When referring to term numbers, Lucene's current implementation uses a Java int to hold the term index, which means the maximum number of unique terms in any single index segment is ~2.1 billion times the term index interval (default 128) = ~274 billion. This is technically not a limitation of the index file format, just of Lucene's current implementation. ( http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#Limitations ) I believe that the termIndexInterval is not used in the default codec: http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29 and instead the terms index is now in an FST. So the above limit does not apply to the default codec. What is the current limit? I suspect it may be related to the maximum number of nodes in the FST, but I don't know what that is or how it would translate to number of unique terms, since prefix sharing among terms probably affects the number of nodes in the FST. Tom.
