Thanks Mike, Yes we have many unique terms due to dirty OCR and 400 languages and probably lots of low doc freq terms as well (although with the ICUTokenizer and ICUFoldingFilter we should get fewer terms due to bad tokenization and normalization.)
Is this additional overhead because each unique term takes a certain amount of space compared to adding entries to a list for an existing term? Does turning on IndexWriters infostream have a significant impact on memory use or indexing speed? If it does, I'll reproduce this on our test server rather than turning it on for a bit on the production indexer. If it doesn't I'll turn it on and post here. Tom -----Original Message----- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, December 01, 2010 2:43 PM To: solr-user@lucene.apache.org Subject: Re: ramBufferSizeMB not reflected in segment sizes in index The ram efficiency (= size of segment once flushed divided by size of RAM buffer) can vary drastically. Because the in-RAM data structures must be "growable" (to append new docs to the postings as they are encountered), the efficiency is never 100%. I think 50% is actually a "good" ram efficiency, and lower than that (even down to 27%) I think is still normal. Do you have many unique or low-doc-freq terms? That brings the efficiency down. If you turn on IndexWriter's infoStream and post the output we can see if anything odd is going on... 80 * 20 = ~1.6 GB so I'm not sure why you're getting 1 GB segments. Do you do any deletions in this run? A merged segment size will often be less than the sum of the parts, especially if there are many terms but across segments these terms are shared.... but the infoStream will also show what merges are taking place. Mike On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom <tburt...@umich.edu> wrote: > We are using a recent Solr 3.x (See below for exact version). > > We have set the ramBufferSizeMB to 320 in both the indexDefaults and the > mainIndex sections of our solrconfig.xml: > > <ramBufferSizeMB>320</ramBufferSizeMB> > <mergeFactor>20</mergeFactor> > > We expected that this would mean that the index would not write to disk until > it reached somewhere approximately over 300MB in size. > However, we see many small segments that look to be around 80MB in size. > > We have not yet issued a single commit so nothing else should force a write > to disk. > > With a merge factor of 20 we also expected to see larger segments somewhere > around 320 * 20 = 6GB in size, however we see several around 1GB. > > We understand that the sizes are approximate, but these seem nowhere near > what we expected. > > Can anyone explain what is going on? > > BTW > maxBufferedDocs is commented out, so this should not be affecting the buffer > flushes > <!--<maxBufferedDocs>1000</maxBufferedDocs>--> > > > Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation > Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene > Specification Version: 3.1-SNAPSHOTLucene Implementation Version: > 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10 > > Tom Burton-West > >