By "Index size almost never grows linearly with the number of
documents" are you saying it increases more slowly that the number of documents, i.e. sub-linearly or more rapidly?

With dirty OCR the number of unique terms is always increasing due to the garbage "words"

-Phil

Chris Hostetter wrote:
: > I'm surprised, as you are, by the non-linearity. Out of curiosity, what is

Unless the data in "stored" fields is significantly greater then "indexed" fields the Index size almost never grows linearly with the number of documents -- it's the number of unique terms that tends to primarily influence the size of the index.

At some point someone on the java-user list who really understood the file formats wrote a really great forumla for estimating the size of the index assuming some ratios of unique terms per doc, but i can't find it now.


-Hoss

Reply via email to