Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Michael McCandless
On Thu, Dec 2, 2010 at 4:31 PM, Burton-West, Tom  wrote:

> We turned on infostream.   Is there documentation about how to interpret it, 
> or should I just grep through the codebase?

There isn't any documentation... and it changes over time as we add
new diagnostics.

> Is the excerpt below what I am looking for as far as understanding the 
> relationship between ramBufferSize and size on disk?
> is newFlushedSize the size on disk in bytes?

Yes -- so IW's buffer was using 329.782 MB RAM, and was flushed to a
69,848,046 byte segment.

Mike


Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Yonik Seeley
On Wed, Dec 1, 2010 at 3:01 PM, Shawn Heisey  wrote:
> I have seen this.  In Solr 1.4.1, the .fdt, .fdx, and the .tv* files do not
> segment, but all the other files do.  I can't remember whether it behaves
> the same under 3.1, or whether it also creates these files in each segment.

Yep, that's the shared doc store (where stored fields go.. the
non-inverted part of the index), and it works like that in 3.x and
trunk too.
It's nice because when you merge segments, you don't have to re-copy
the docs (provided you're within a single indexing session).
There have been discussions about removing it in trunk though... we'll see.

-Yonik
http://www.lucidimagination.com


RE: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Burton-West, Tom
Hi Mike,

We turned on infostream.   Is there documentation about how to interpret it, or 
should I just grep through the codebase?

Is the excerpt below what I am looking for as far as understanding the 
relationship between ramBufferSize and size on disk?
is newFlushedSize the size on disk in bytes?


DW:   ramUsed=329.782 MB newFlushedSize=74520060 docs/MB=0.943 new/old=21.55%

RAM: now balance allocations: usedMB=325.997 vs trigger=320 deletesMB=0.048 
byteBlockFre
e=0.125 perDocFree=0.006 charBlockFree=0
...
DW: after free: freedMB=0.225 usedMB=325.82
Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; 
http-8091-Processor12]: flush: now pause all indexing threads
Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; 
http-8091-Processor12]:   flush: segment=_5h docStoreSegment=_5e 
docStoreOffset=266 flushDocs=true flushDeletes=false 
flushDocStores=false numDocs=40 numBufDelTerms=40
... Dec 1, 2010 5:40:22 PM   purge field=geographic
Dec 1, 2010 5:40:22 PM   purge field=serialTitle_ab
Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; 
http-8091-Processor12]: DW:   ramUsed=325.772 MB newFlushedSize=69848046 
docs/MB=0.6 new/old=20.447%
Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; 
http-8091-Processor12]: flushedFiles=[_5h.frq, _5h.tis, _5h.prx, _5h.nrm, 
_5h.fnm, _5h.tii]



Tom


-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, December 01, 2010 3:43 PM
To: solr-user@lucene.apache.org
Subject: Re: ramBufferSizeMB not reflected in segment sizes in index

On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom  wrote:
> Thanks Mike,
>
> Yes we have many unique terms due to dirty OCR and 400 languages and probably 
> lots of low doc freq terms as well (although with the ICUTokenizer and 
> ICUFoldingFilter we should get fewer terms due to bad tokenization and 
> normalization.)

OK likely this explains the lowish RAM efficiency.

> Is this additional overhead because each unique term takes a certain amount 
> of space compared to adding entries to a list for an existing term?

Exactly.  There's a highish "startup cost" for each term but then
appending docs/positions to that term is more efficient especially for
higher frequency terms.  In the limit, a single unique term  across
all docs will have very high RAM efficiency...

> Does turning on IndexWriters infostream have a significant impact on memory 
> use or indexing speed?

I don't believe so

Mike


Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Michael McCandless
On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom  wrote:
> Thanks Mike,
>
> Yes we have many unique terms due to dirty OCR and 400 languages and probably 
> lots of low doc freq terms as well (although with the ICUTokenizer and 
> ICUFoldingFilter we should get fewer terms due to bad tokenization and 
> normalization.)

OK likely this explains the lowish RAM efficiency.

> Is this additional overhead because each unique term takes a certain amount 
> of space compared to adding entries to a list for an existing term?

Exactly.  There's a highish "startup cost" for each term but then
appending docs/positions to that term is more efficient especially for
higher frequency terms.  In the limit, a single unique term  across
all docs will have very high RAM efficiency...

> Does turning on IndexWriters infostream have a significant impact on memory 
> use or indexing speed?

I don't believe so

Mike


RE: ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Burton-West, Tom
Thanks Mike,

Yes we have many unique terms due to dirty OCR and 400 languages and probably 
lots of low doc freq terms as well (although with the ICUTokenizer and 
ICUFoldingFilter we should get fewer terms due to bad tokenization and 
normalization.)

Is this additional overhead because each unique term takes a certain amount of 
space compared to adding entries to a list for an existing term?

Does turning on IndexWriters infostream have a significant impact on memory use 
or indexing speed?  

If it does, I'll reproduce this on our test server rather than turning it on 
for a bit on the production indexer.  If it doesn't I'll turn it on and post 
here.

Tom

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Wednesday, December 01, 2010 2:43 PM
To: solr-user@lucene.apache.org
Subject: Re: ramBufferSizeMB not reflected in segment sizes in index

The ram efficiency (= size of segment once flushed divided by size of
RAM buffer) can vary drastically.

Because the in-RAM data structures must be "growable" (to append new
docs to the postings as they are encountered), the efficiency is never
100%.  I think 50% is actually a "good" ram efficiency, and lower than
that (even down to 27%) I think is still normal.

Do you have many unique or low-doc-freq terms?  That brings the efficiency down.

If you turn on IndexWriter's infoStream and post the output we can see
if anything odd is going on...

80 * 20 = ~1.6 GB so I'm not sure why you're getting 1 GB segments.
Do you do any deletions in this run?  A merged segment size will often
be less than the sum of the parts, especially if there are many terms
but across segments these terms are shared but the infoStream will
also show what merges are taking place.

Mike

On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom  wrote:
> We are using a recent Solr 3.x (See below for exact version).
>
> We have set the ramBufferSizeMB to 320 in both the indexDefaults and the 
> mainIndex sections of our solrconfig.xml:
>
> 320
> 20
>
> We expected that this would mean that the index would not write to disk until 
> it reached somewhere approximately over 300MB in size.
> However, we see many small segments that look to be around 80MB in size.
>
> We have not yet issued a single commit so nothing else should force a write 
> to disk.
>
> With a merge factor of 20 we also expected to see larger segments somewhere 
> around 320 * 20 = 6GB in size, however we see several around 1GB.
>
> We understand that the sizes are approximate, but these seem nowhere near 
> what we expected.
>
> Can anyone explain what is going on?
>
> BTW
> maxBufferedDocs is commented out, so this should not be affecting the buffer 
> flushes
> 
>
>
> Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation 
> Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene 
> Specification Version: 3.1-SNAPSHOTLucene Implementation Version: 
> 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10
>
> Tom Burton-West
>
>


Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Shawn Heisey

On 12/1/2010 12:13 PM, Burton-West, Tom wrote:

We have set the ramBufferSizeMB to 320 in both the indexDefaults and the 
mainIndex sections of our solrconfig.xml:

320
20

We expected that this would mean that the index would not write to disk until 
it reached somewhere approximately over 300MB in size.
However, we see many small segments that look to be around 80MB in size.

We have not yet issued a single commit so nothing else should force a write to 
disk.

With a merge factor of 20 we also expected to see larger segments somewhere 
around 320 * 20 = 6GB in size, however we see several around 1GB.

We understand that the sizes are approximate, but these seem nowhere near what 
we expected.


I have seen this.  In Solr 1.4.1, the .fdt, .fdx, and the .tv* files do 
not segment, but all the other files do.  I can't remember whether it 
behaves the same under 3.1, or whether it also creates these files in 
each segment.


Here's the first segment created during a test reindex I just started, 
excluding the previously mentioned files, which will be prefixed by _57 
until I choose to optimize the index:


-rw-r--r-- 1 ncindex ncindex315 Dec  1 12:40 _58.fnm
-rw-r--r-- 1 ncindex ncindex   26000115 Dec  1 12:40 _58.frq
-rw-r--r-- 1 ncindex ncindex 399124 Dec  1 12:40 _58.nrm
-rw-r--r-- 1 ncindex ncindex   23879227 Dec  1 12:40 _58.prx
-rw-r--r-- 1 ncindex ncindex 205874 Dec  1 12:40 _58.tii
-rw-r--r-- 1 ncindex ncindex   16000953 Dec  1 12:40 _58.tis

My ramBufferSize is 256MB, and those files add up to about 66MB.  My 
guess is that it takes  256MB of RAM to represent what condenses down to 
66MB on the disk.


When it had accumulated 16 segments, it merged them down to this, all 
the while continuing to index.  This is about 870MB:


-rw-r--r-- 1 ncindex ncindex338 Dec  1 12:56 _5n.fnm
-rw-r--r-- 1 ncindex ncindex  376423659 Dec  1 12:58 _5n.frq
-rw-r--r-- 1 ncindex ncindex5726860 Dec  1 12:58 _5n.nrm
-rw-r--r-- 1 ncindex ncindex  331890058 Dec  1 12:58 _5n.prx
-rw-r--r-- 1 ncindex ncindex2037072 Dec  1 12:58 _5n.tii
-rw-r--r-- 1 ncindex ncindex  154470775 Dec  1 12:58 _5n.tis

If this merge were to happen 16 more times (256 segments created), it 
would then do a super-merge down to one very large segment.  In your 
case, with a mergeFactor of 20, that would take 400 segments.  I only 
ever saw this happen once - when I built a single index with all 49 
million documents in it.


Shawn



Re: ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Michael McCandless
The ram efficiency (= size of segment once flushed divided by size of
RAM buffer) can vary drastically.

Because the in-RAM data structures must be "growable" (to append new
docs to the postings as they are encountered), the efficiency is never
100%.  I think 50% is actually a "good" ram efficiency, and lower than
that (even down to 27%) I think is still normal.

Do you have many unique or low-doc-freq terms?  That brings the efficiency down.

If you turn on IndexWriter's infoStream and post the output we can see
if anything odd is going on...

80 * 20 = ~1.6 GB so I'm not sure why you're getting 1 GB segments.
Do you do any deletions in this run?  A merged segment size will often
be less than the sum of the parts, especially if there are many terms
but across segments these terms are shared but the infoStream will
also show what merges are taking place.

Mike

On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom  wrote:
> We are using a recent Solr 3.x (See below for exact version).
>
> We have set the ramBufferSizeMB to 320 in both the indexDefaults and the 
> mainIndex sections of our solrconfig.xml:
>
> 320
> 20
>
> We expected that this would mean that the index would not write to disk until 
> it reached somewhere approximately over 300MB in size.
> However, we see many small segments that look to be around 80MB in size.
>
> We have not yet issued a single commit so nothing else should force a write 
> to disk.
>
> With a merge factor of 20 we also expected to see larger segments somewhere 
> around 320 * 20 = 6GB in size, however we see several around 1GB.
>
> We understand that the sizes are approximate, but these seem nowhere near 
> what we expected.
>
> Can anyone explain what is going on?
>
> BTW
> maxBufferedDocs is commented out, so this should not be affecting the buffer 
> flushes
> 
>
>
> Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation 
> Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene 
> Specification Version: 3.1-SNAPSHOTLucene Implementation Version: 
> 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10
>
> Tom Burton-West
>
>


ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Burton-West, Tom
We are using a recent Solr 3.x (See below for exact version).

We have set the ramBufferSizeMB to 320 in both the indexDefaults and the 
mainIndex sections of our solrconfig.xml:

320
20

We expected that this would mean that the index would not write to disk until 
it reached somewhere approximately over 300MB in size.
However, we see many small segments that look to be around 80MB in size.

We have not yet issued a single commit so nothing else should force a write to 
disk.

With a merge factor of 20 we also expected to see larger segments somewhere 
around 320 * 20 = 6GB in size, however we see several around 1GB.

We understand that the sizes are approximate, but these seem nowhere near what 
we expected.

Can anyone explain what is going on?

BTW
maxBufferedDocs is commented out, so this should not be affecting the buffer 
flushes



Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation 
Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene Specification 
Version: 3.1-SNAPSHOTLucene Implementation Version: 3.1-SNAPSHOT 1036094 - 
2010-11-19 16:01:10

Tom Burton-West