Here's an example.
Consider 2 docs with terms:

doc1: term1, term2, term3
doc2: term4, term5, term6

vs.

doc1: term1, term2, term3
doc2: term1, term1, term6

All other things constant, the former will make index grow faster because it 
has more unique terms.  Even if your OCR has garbage that makes noise in form 
of new unique terms, there will still be some overlap (like that term1 in the 
second case above).

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Phillip Farber <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, August 15, 2008 12:22:30 PM
> Subject: Re: Index size vs. number of documents
> 
> By "Index size almost never grows linearly with the number of
> documents" are you saying it increases more slowly that the number of 
> documents, i.e. sub-linearly or more rapidly?
> 
> With dirty OCR the number of unique terms is always increasing due to 
> the garbage "words"
> 
> -Phil
> 
> Chris Hostetter wrote:
> > : > I'm surprised, as you are, by the non-linearity. Out of curiosity, what 
> > is
> > 
> > Unless the data in "stored" fields is significantly greater then "indexed" 
> > fields the Index size almost never grows linearly with the number of 
> > documents -- it's the number of unique terms that tends to primarily 
> > influence the size of the index.
> > 
> > At some point someone on the java-user list who really understood the file 
> > formats wrote a really great forumla for estimating the size of the index 
> > assuming some ratios of unique terms per doc, but i can't find it now.
> > 
> > 
> > -Hoss
> > 

Reply via email to