Here's an example. Consider 2 docs with terms: doc1: term1, term2, term3 doc2: term4, term5, term6
vs. doc1: term1, term2, term3 doc2: term1, term1, term6 All other things constant, the former will make index grow faster because it has more unique terms. Even if your OCR has garbage that makes noise in form of new unique terms, there will still be some overlap (like that term1 in the second case above). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Phillip Farber <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, August 15, 2008 12:22:30 PM > Subject: Re: Index size vs. number of documents > > By "Index size almost never grows linearly with the number of > documents" are you saying it increases more slowly that the number of > documents, i.e. sub-linearly or more rapidly? > > With dirty OCR the number of unique terms is always increasing due to > the garbage "words" > > -Phil > > Chris Hostetter wrote: > > : > I'm surprised, as you are, by the non-linearity. Out of curiosity, what > > is > > > > Unless the data in "stored" fields is significantly greater then "indexed" > > fields the Index size almost never grows linearly with the number of > > documents -- it's the number of unique terms that tends to primarily > > influence the size of the index. > > > > At some point someone on the java-user list who really understood the file > > formats wrote a really great forumla for estimating the size of the index > > assuming some ratios of unique terms per doc, but i can't find it now. > > > > > > -Hoss > >