Re: Index size vs. number of documents

Phillip Farber Fri, 15 Aug 2008 09:23:03 -0700

By "Index size almost never grows linearly with the number of

documents" are you saying it increases more slowly that the number ofdocuments, i.e. sub-linearly or more rapidly?

With dirty OCR the number of unique terms is always increasing due tothe garbage "words"


-Phil

Chris Hostetter wrote:

: > I'm surprised, as you are, by the non-linearity. Out of curiosity, what is
Unless the data in "stored" fields is significantly greater then "indexed"fields the Index size almost never grows linearly with the number ofdocuments -- it's the number of unique terms that tends to primarilyinfluence the size of the index.
At some point someone on the java-user list who really understood the fileformats wrote a really great forumla for estimating the size of the indexassuming some ratios of unique terms per doc, but i can't find it now.
-Hoss

Re: Index size vs. number of documents

Reply via email to