RE: ramBufferSizeMB not reflected in segment sizes in index

Burton-West, Tom Thu, 02 Dec 2010 13:32:26 -0800

Hi Mike,

We turned on infostream.   Is there documentation about how to interpret it, or 
should I just grep through the codebase?


Is the excerpt below what I am looking for as far as understanding the 
relationship between ramBufferSize and size on disk?
is newFlushedSize the size on disk in bytes?

----
DW:   ramUsed=329.782 MB newFlushedSize=74520060 docs/MB=0.943 new/old=21.55%
....
RAM: now balance allocations: usedMB=325.997 vs trigger=320 deletesMB=0.048 
byteBlockFre
e=0.125 perDocFree=0.006 charBlockFree=0
...
DW:     after free: freedMB=0.225 usedMB=325.82
Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; 
http-8091-Processor12]: flush: now pause all indexing threads
Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; 
http-8091-Processor12]:   flush: segment=_5h docStoreSegment=_5e 
docStoreOffset=266 flushDocs=true flushDeletes=false 
flushDocStores=false numDocs=40 numBufDelTerms=40
... Dec 1, 2010 5:40:22 PM   purge field=geographic
Dec 1, 2010 5:40:22 PM   purge field=serialTitle_ab
Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; 
http-8091-Processor12]: DW:   ramUsed=325.772 MB newFlushedSize=69848046 
docs/MB=0.6 new/old=20.447%
Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; 
http-8091-Processor12]: flushedFiles=[_5h.frq, _5h.tis, _5h.prx, _5h.nrm, 
_5h.fnm, _5h.tii]



Tom


-----Original Message-----
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, December 01, 2010 3:43 PM
To: solr-user@lucene.apache.org
Subject: Re: ramBufferSizeMB not reflected in segment sizes in index

On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> Thanks Mike,
>
> Yes we have many unique terms due to dirty OCR and 400 languages and probably 
> lots of low doc freq terms as well (although with the ICUTokenizer and 
> ICUFoldingFilter we should get fewer terms due to bad tokenization and 
> normalization.)

OK likely this explains the lowish RAM efficiency.

> Is this additional overhead because each unique term takes a certain amount 
> of space compared to adding entries to a list for an existing term?

Exactly.  There's a highish "startup cost" for each term.... but then
appending docs/positions to that term is more efficient especially for
higher frequency terms.  In the limit, a single unique term  across
all docs will have very high RAM efficiency...

> Does turning on IndexWriters infostream have a significant impact on memory 
> use or indexing speed?

I don't believe so....

Mike

RE: ramBufferSizeMB not reflected in segment sizes in index

Reply via email to