Re: ramBufferSizeMB not reflected in segment sizes in index
On Thu, Dec 2, 2010 at 4:31 PM, Burton-West, Tom wrote: > We turned on infostream. Is there documentation about how to interpret it, > or should I just grep through the codebase? There isn't any documentation... and it changes over time as we add new diagnostics. > Is the excerpt below what I am looking for as far as understanding the > relationship between ramBufferSize and size on disk? > is newFlushedSize the size on disk in bytes? Yes -- so IW's buffer was using 329.782 MB RAM, and was flushed to a 69,848,046 byte segment. Mike
Re: ramBufferSizeMB not reflected in segment sizes in index
On Wed, Dec 1, 2010 at 3:01 PM, Shawn Heisey wrote: > I have seen this. In Solr 1.4.1, the .fdt, .fdx, and the .tv* files do not > segment, but all the other files do. I can't remember whether it behaves > the same under 3.1, or whether it also creates these files in each segment. Yep, that's the shared doc store (where stored fields go.. the non-inverted part of the index), and it works like that in 3.x and trunk too. It's nice because when you merge segments, you don't have to re-copy the docs (provided you're within a single indexing session). There have been discussions about removing it in trunk though... we'll see. -Yonik http://www.lucidimagination.com
RE: ramBufferSizeMB not reflected in segment sizes in index
Hi Mike, We turned on infostream. Is there documentation about how to interpret it, or should I just grep through the codebase? Is the excerpt below what I am looking for as far as understanding the relationship between ramBufferSize and size on disk? is newFlushedSize the size on disk in bytes? DW: ramUsed=329.782 MB newFlushedSize=74520060 docs/MB=0.943 new/old=21.55% RAM: now balance allocations: usedMB=325.997 vs trigger=320 deletesMB=0.048 byteBlockFre e=0.125 perDocFree=0.006 charBlockFree=0 ... DW: after free: freedMB=0.225 usedMB=325.82 Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; http-8091-Processor12]: flush: now pause all indexing threads Dec 1, 2010 5:40:22 PM IW 0 [Wed Dec 01 17:40:22 EST 2010; http-8091-Processor12]: flush: segment=_5h docStoreSegment=_5e docStoreOffset=266 flushDocs=true flushDeletes=false flushDocStores=false numDocs=40 numBufDelTerms=40 ... Dec 1, 2010 5:40:22 PM purge field=geographic Dec 1, 2010 5:40:22 PM purge field=serialTitle_ab Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; http-8091-Processor12]: DW: ramUsed=325.772 MB newFlushedSize=69848046 docs/MB=0.6 new/old=20.447% Dec 1, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; http-8091-Processor12]: flushedFiles=[_5h.frq, _5h.tis, _5h.prx, _5h.nrm, _5h.fnm, _5h.tii] Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, December 01, 2010 3:43 PM To: solr-user@lucene.apache.org Subject: Re: ramBufferSizeMB not reflected in segment sizes in index On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom wrote: > Thanks Mike, > > Yes we have many unique terms due to dirty OCR and 400 languages and probably > lots of low doc freq terms as well (although with the ICUTokenizer and > ICUFoldingFilter we should get fewer terms due to bad tokenization and > normalization.) OK likely this explains the lowish RAM efficiency. > Is this additional overhead because each unique term takes a certain amount > of space compared to adding entries to a list for an existing term? Exactly. There's a highish "startup cost" for each term but then appending docs/positions to that term is more efficient especially for higher frequency terms. In the limit, a single unique term across all docs will have very high RAM efficiency... > Does turning on IndexWriters infostream have a significant impact on memory > use or indexing speed? I don't believe so Mike
Re: ramBufferSizeMB not reflected in segment sizes in index
On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom wrote: > Thanks Mike, > > Yes we have many unique terms due to dirty OCR and 400 languages and probably > lots of low doc freq terms as well (although with the ICUTokenizer and > ICUFoldingFilter we should get fewer terms due to bad tokenization and > normalization.) OK likely this explains the lowish RAM efficiency. > Is this additional overhead because each unique term takes a certain amount > of space compared to adding entries to a list for an existing term? Exactly. There's a highish "startup cost" for each term but then appending docs/positions to that term is more efficient especially for higher frequency terms. In the limit, a single unique term across all docs will have very high RAM efficiency... > Does turning on IndexWriters infostream have a significant impact on memory > use or indexing speed? I don't believe so Mike
RE: ramBufferSizeMB not reflected in segment sizes in index
Thanks Mike, Yes we have many unique terms due to dirty OCR and 400 languages and probably lots of low doc freq terms as well (although with the ICUTokenizer and ICUFoldingFilter we should get fewer terms due to bad tokenization and normalization.) Is this additional overhead because each unique term takes a certain amount of space compared to adding entries to a list for an existing term? Does turning on IndexWriters infostream have a significant impact on memory use or indexing speed? If it does, I'll reproduce this on our test server rather than turning it on for a bit on the production indexer. If it doesn't I'll turn it on and post here. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, December 01, 2010 2:43 PM To: solr-user@lucene.apache.org Subject: Re: ramBufferSizeMB not reflected in segment sizes in index The ram efficiency (= size of segment once flushed divided by size of RAM buffer) can vary drastically. Because the in-RAM data structures must be "growable" (to append new docs to the postings as they are encountered), the efficiency is never 100%. I think 50% is actually a "good" ram efficiency, and lower than that (even down to 27%) I think is still normal. Do you have many unique or low-doc-freq terms? That brings the efficiency down. If you turn on IndexWriter's infoStream and post the output we can see if anything odd is going on... 80 * 20 = ~1.6 GB so I'm not sure why you're getting 1 GB segments. Do you do any deletions in this run? A merged segment size will often be less than the sum of the parts, especially if there are many terms but across segments these terms are shared but the infoStream will also show what merges are taking place. Mike On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom wrote: > We are using a recent Solr 3.x (See below for exact version). > > We have set the ramBufferSizeMB to 320 in both the indexDefaults and the > mainIndex sections of our solrconfig.xml: > > 320 > 20 > > We expected that this would mean that the index would not write to disk until > it reached somewhere approximately over 300MB in size. > However, we see many small segments that look to be around 80MB in size. > > We have not yet issued a single commit so nothing else should force a write > to disk. > > With a merge factor of 20 we also expected to see larger segments somewhere > around 320 * 20 = 6GB in size, however we see several around 1GB. > > We understand that the sizes are approximate, but these seem nowhere near > what we expected. > > Can anyone explain what is going on? > > BTW > maxBufferedDocs is commented out, so this should not be affecting the buffer > flushes > > > > Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation > Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene > Specification Version: 3.1-SNAPSHOTLucene Implementation Version: > 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10 > > Tom Burton-West > >
Re: ramBufferSizeMB not reflected in segment sizes in index
On 12/1/2010 12:13 PM, Burton-West, Tom wrote: We have set the ramBufferSizeMB to 320 in both the indexDefaults and the mainIndex sections of our solrconfig.xml: 320 20 We expected that this would mean that the index would not write to disk until it reached somewhere approximately over 300MB in size. However, we see many small segments that look to be around 80MB in size. We have not yet issued a single commit so nothing else should force a write to disk. With a merge factor of 20 we also expected to see larger segments somewhere around 320 * 20 = 6GB in size, however we see several around 1GB. We understand that the sizes are approximate, but these seem nowhere near what we expected. I have seen this. In Solr 1.4.1, the .fdt, .fdx, and the .tv* files do not segment, but all the other files do. I can't remember whether it behaves the same under 3.1, or whether it also creates these files in each segment. Here's the first segment created during a test reindex I just started, excluding the previously mentioned files, which will be prefixed by _57 until I choose to optimize the index: -rw-r--r-- 1 ncindex ncindex315 Dec 1 12:40 _58.fnm -rw-r--r-- 1 ncindex ncindex 26000115 Dec 1 12:40 _58.frq -rw-r--r-- 1 ncindex ncindex 399124 Dec 1 12:40 _58.nrm -rw-r--r-- 1 ncindex ncindex 23879227 Dec 1 12:40 _58.prx -rw-r--r-- 1 ncindex ncindex 205874 Dec 1 12:40 _58.tii -rw-r--r-- 1 ncindex ncindex 16000953 Dec 1 12:40 _58.tis My ramBufferSize is 256MB, and those files add up to about 66MB. My guess is that it takes 256MB of RAM to represent what condenses down to 66MB on the disk. When it had accumulated 16 segments, it merged them down to this, all the while continuing to index. This is about 870MB: -rw-r--r-- 1 ncindex ncindex338 Dec 1 12:56 _5n.fnm -rw-r--r-- 1 ncindex ncindex 376423659 Dec 1 12:58 _5n.frq -rw-r--r-- 1 ncindex ncindex5726860 Dec 1 12:58 _5n.nrm -rw-r--r-- 1 ncindex ncindex 331890058 Dec 1 12:58 _5n.prx -rw-r--r-- 1 ncindex ncindex2037072 Dec 1 12:58 _5n.tii -rw-r--r-- 1 ncindex ncindex 154470775 Dec 1 12:58 _5n.tis If this merge were to happen 16 more times (256 segments created), it would then do a super-merge down to one very large segment. In your case, with a mergeFactor of 20, that would take 400 segments. I only ever saw this happen once - when I built a single index with all 49 million documents in it. Shawn
Re: ramBufferSizeMB not reflected in segment sizes in index
The ram efficiency (= size of segment once flushed divided by size of RAM buffer) can vary drastically. Because the in-RAM data structures must be "growable" (to append new docs to the postings as they are encountered), the efficiency is never 100%. I think 50% is actually a "good" ram efficiency, and lower than that (even down to 27%) I think is still normal. Do you have many unique or low-doc-freq terms? That brings the efficiency down. If you turn on IndexWriter's infoStream and post the output we can see if anything odd is going on... 80 * 20 = ~1.6 GB so I'm not sure why you're getting 1 GB segments. Do you do any deletions in this run? A merged segment size will often be less than the sum of the parts, especially if there are many terms but across segments these terms are shared but the infoStream will also show what merges are taking place. Mike On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom wrote: > We are using a recent Solr 3.x (See below for exact version). > > We have set the ramBufferSizeMB to 320 in both the indexDefaults and the > mainIndex sections of our solrconfig.xml: > > 320 > 20 > > We expected that this would mean that the index would not write to disk until > it reached somewhere approximately over 300MB in size. > However, we see many small segments that look to be around 80MB in size. > > We have not yet issued a single commit so nothing else should force a write > to disk. > > With a merge factor of 20 we also expected to see larger segments somewhere > around 320 * 20 = 6GB in size, however we see several around 1GB. > > We understand that the sizes are approximate, but these seem nowhere near > what we expected. > > Can anyone explain what is going on? > > BTW > maxBufferedDocs is commented out, so this should not be affecting the buffer > flushes > > > > Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation > Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene > Specification Version: 3.1-SNAPSHOTLucene Implementation Version: > 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10 > > Tom Burton-West > >
ramBufferSizeMB not reflected in segment sizes in index
We are using a recent Solr 3.x (See below for exact version). We have set the ramBufferSizeMB to 320 in both the indexDefaults and the mainIndex sections of our solrconfig.xml: 320 20 We expected that this would mean that the index would not write to disk until it reached somewhere approximately over 300MB in size. However, we see many small segments that look to be around 80MB in size. We have not yet issued a single commit so nothing else should force a write to disk. With a merge factor of 20 we also expected to see larger segments somewhere around 320 * 20 = 6GB in size, however we see several around 1GB. We understand that the sizes are approximate, but these seem nowhere near what we expected. Can anyone explain what is going on? BTW maxBufferedDocs is commented out, so this should not be affecting the buffer flushes Solr Specification Version: 3.0.0.2010.11.19.16.00.54Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene Specification Version: 3.1-SNAPSHOTLucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10 Tom Burton-West