Thanks Mike,

Yes we have many unique terms due to dirty OCR and 400 languages and probably 
lots of low doc freq terms as well (although with the ICUTokenizer and 
ICUFoldingFilter we should get fewer terms due to bad tokenization and 
normalization.)

Is this additional overhead because each unique term takes a certain amount of 
space compared to adding entries to a list for an existing term?

Does turning on IndexWriters infostream have a significant impact on memory use 
or indexing speed?  

If it does, I'll reproduce this on our test server rather than turning it on 
for a bit on the production indexer.  If it doesn't I'll turn it on and post 
here.

Tom

-----Original Message-----
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, December 01, 2010 2:43 PM
To: solr-user@lucene.apache.org
Subject: Re: ramBufferSizeMB not reflected in segment sizes in index

The ram efficiency (= size of segment once flushed divided by size of
RAM buffer) can vary drastically.

Because the in-RAM data structures must be "growable" (to append new
docs to the postings as they are encountered), the efficiency is never
100%.  I think 50% is actually a "good" ram efficiency, and lower than
that (even down to 27%) I think is still normal.

Do you have many unique or low-doc-freq terms?  That brings the efficiency down.

If you turn on IndexWriter's infoStream and post the output we can see
if anything odd is going on...

80 * 20 = ~1.6 GB so I'm not sure why you're getting 1 GB segments.
Do you do any deletions in this run?  A merged segment size will often
be less than the sum of the parts, especially if there are many terms
but across segments these terms are shared.... but the infoStream will
also show what merges are taking place.

Mike

On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> We are using a recent Solr 3.x (See below for exact version).
>
> We have set the ramBufferSizeMB to 320 in both the indexDefaults and the 
> mainIndex sections of our solrconfig.xml:
>
> <ramBufferSizeMB>320</ramBufferSizeMB>
> <mergeFactor>20</mergeFactor>
>
> We expected that this would mean that the index would not write to disk until 
> it reached somewhere approximately over 300MB in size.
> However, we see many small segments that look to be around 80MB in size.
>
> We have not yet issued a single commit so nothing else should force a write 
> to disk.
>
> With a merge factor of 20 we also expected to see larger segments somewhere 
> around 320 * 20 = 6GB in size, however we see several around 1GB.
>
> We understand that the sizes are approximate, but these seem nowhere near 
> what we expected.
>
> Can anyone explain what is going on?
>
> BTW
> maxBufferedDocs is commented out, so this should not be affecting the buffer 
> flushes
> <!--<maxBufferedDocs>1000</maxBufferedDocs>-->
>
>
> Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation 
> Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene 
> Specification Version: 3.1-SNAPSHOTLucene Implementation Version: 
> 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10
>
> Tom Burton-West
>
>

Reply via email to