Hi Mike, We turned on infostream. Is there documentation about how to interpret it, or should I just grep through the codebase?
Is the excerpt below what I am looking for as far as understanding the relationship between ramBufferSize and size on disk? is newFlushedSize the size on disk in bytes? ---- DW: ramUsed=329.782 MB newFlushedSize=74520060 docs/MB=0.943 new/old=21.55% .... RAM: now balance allocations: usedMB=325.997 vs trigger=320 deletesMB=0.048 byteBlockFre e=0.125 perDocFree=0.006 charBlockFree=0 ... DW: after free: freedMB=0.225 usedMB=325.82 Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; http-8091-Processor12]: flush: now pause all indexing threads Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; http-8091-Processor12]: flush: segment=_5h docStoreSegment=_5e docStoreOffset=266 flushDocs=true flushDeletes=false flushDocStores=false numDocs=40 numBufDelTerms=40 ... Dec 1, 2010 5:40:22 PM purge field=geographic Dec 1, 2010 5:40:22 PM purge field=serialTitle_ab Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; http-8091-Processor12]: DW: ramUsed=325.772 MB newFlushedSize=69848046 docs/MB=0.6 new/old=20.447% Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; http-8091-Processor12]: flushedFiles=[_5h.frq, _5h.tis, _5h.prx, _5h.nrm, _5h.fnm, _5h.tii] Tom -----Original Message----- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, December 01, 2010 3:43 PM To: solr-user@lucene.apache.org Subject: Re: ramBufferSizeMB not reflected in segment sizes in index On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom <tburt...@umich.edu> wrote: > Thanks Mike, > > Yes we have many unique terms due to dirty OCR and 400 languages and probably > lots of low doc freq terms as well (although with the ICUTokenizer and > ICUFoldingFilter we should get fewer terms due to bad tokenization and > normalization.) OK likely this explains the lowish RAM efficiency. > Is this additional overhead because each unique term takes a certain amount > of space compared to adding entries to a list for an existing term? Exactly. There's a highish "startup cost" for each term.... but then appending docs/positions to that term is more efficient especially for higher frequency terms. In the limit, a single unique term across all docs will have very high RAM efficiency... > Does turning on IndexWriters infostream have a significant impact on memory > use or indexing speed? I don't believe so.... Mike