By "Index size almost never grows linearly with the number of
documents" are you saying it increases more slowly that the number of
documents, i.e. sub-linearly or more rapidly?
With dirty OCR the number of unique terms is always increasing due to
the garbage "words"
-Phil
Chris Hostetter wrote:
: > I'm surprised, as you are, by the non-linearity. Out of curiosity, what is
Unless the data in "stored" fields is significantly greater then "indexed"
fields the Index size almost never grows linearly with the number of
documents -- it's the number of unique terms that tends to primarily
influence the size of the index.
At some point someone on the java-user list who really understood the file
formats wrote a really great forumla for estimating the size of the index
assuming some ratios of unique terms per doc, but i can't find it now.
-Hoss