Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor
Thanks Robert, >>if not, just customize blocktree's params with a CodecFactory in solr, >>or even pick another implementation (FixedGap, VariableGap, whatever). Still trying to get my head around 4.0 and flexible indexing. I'll take another look at Mike's and your presentations. I'm trying to figure out how to get from the Lucene JavaDocs you pointed out to how to specify things in Solr and it's config files.. Is there an example CodecFactory somewhere I could look at? Also is Is there an example somewhere of how to specify a CodecFactory/Codec in Solr using the schema.xml or solrconfig.xml? Is there some simple way to specify minBlockSize and maxBlockSize in schema.xml? Once I get this all working and understand it, I'll be happy to draft some documentation. I'm really looking forward to experimenting with 4.0! Tom Tom On Fri, Sep 7, 2012 at 2:58 PM, Robert Muir wrote: > On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West > wrote: > > Thanks Robert, > > > > I'll have to spend some time understanding the default codec for Solr > 4.0. > > Did I miss something in the changes file? > > http://lucene.apache.org/core/4_0_0-BETA/ > > see the file formats section, especially > > http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary > > (since blocktree "covers" term dictionary and terms index) > > > > > I'll be digging into the default codec docs and testing sometime in next > > week or two (with a 2 billion term index) If I understand it well > enough, > > I'll be happy to draft some changes up for either the wiki or Solr the > > example solrconfig.xml file. > > right i think we should remove these parameters. > > > > > Does this mean that the default codec will reduce memory use for the > terms > > index enough so I don't need to use either of these settings to deal with > > my > 2 billion term indexes? > > probably. i dont know enough about your terms or how much RAM you have > to say for sure. > > if not, just customize blocktree's params with a CodecFactory in solr, > or even pick another implementation (FixedGap, VariableGap, whatever). > > the interval/divisor stuff is mostly only useful if you are not > reindexing from scratch: e.g. if you are gonna plop your 3.x index > into 4.x then you should set > those to whatever you were using before, since it will be using > PreflexCodec to read those. > > -- > lucidworks.com >
Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor
On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West wrote: > Thanks Robert, > > I'll have to spend some time understanding the default codec for Solr 4.0. > Did I miss something in the changes file? http://lucene.apache.org/core/4_0_0-BETA/ see the file formats section, especially http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary (since blocktree "covers" term dictionary and terms index) > > I'll be digging into the default codec docs and testing sometime in next > week or two (with a 2 billion term index) If I understand it well enough, > I'll be happy to draft some changes up for either the wiki or Solr the > example solrconfig.xml file. right i think we should remove these parameters. > > Does this mean that the default codec will reduce memory use for the terms > index enough so I don't need to use either of these settings to deal with > my > 2 billion term indexes? probably. i dont know enough about your terms or how much RAM you have to say for sure. if not, just customize blocktree's params with a CodecFactory in solr, or even pick another implementation (FixedGap, VariableGap, whatever). the interval/divisor stuff is mostly only useful if you are not reindexing from scratch: e.g. if you are gonna plop your 3.x index into 4.x then you should set those to whatever you were using before, since it will be using PreflexCodec to read those. -- lucidworks.com
Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor
Thanks Robert, I'll have to spend some time understanding the default codec for Solr 4.0. Did I miss something in the changes file? I'll be digging into the default codec docs and testing sometime in next week or two (with a 2 billion term index) If I understand it well enough, I'll be happy to draft some changes up for either the wiki or Solr the example solrconfig.xml file. Does this mean that the default codec will reduce memory use for the terms index enough so I don't need to use either of these settings to deal with my > 2 billion term indexes? If both of these parameters don't make sense for the default codec, then maybe they need to be commented out or removed from the solr example solrconfig.xml. Tom On Fri, Sep 7, 2012 at 1:33 PM, Robert Muir wrote: > Hi Tom: I already enhanced the javadocs about this for Lucene, putting > warnings everywhere in bold: > > NOTE: This parameter does not apply to all PostingsFormat > implementations, including the default one in this release. It only > makes sense for term indexes that are implemented as a fixed gap > between terms. > NOTE: divisor settings > 1 do not apply to all PostingsFormat > implementations, including the default one in this release. It only > makes sense for terms indexes that can efficiently re-sample terms at > load time. > etc > > > http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29 > > http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/DirectoryReader.html#open%28org.apache.lucene.store.Directory,%20int%29 > > In the future I expect these parameters ill be removed completely: > anything like this is specific to the codec/implementation. > > In Lucene 4.0 the terms index works completely differently: these > parameters don't make sense for it. > > On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West > wrote: > > Hello all, > > > > Due to multiple languages and dirty OCR, our indexes have over 2 billion > > unique terms ( > > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again > ). > > In Solr 3.6 and previous we needed to reduce the memory used for storing > > the in-memory representation of the tii file. We originally used the > > termInfosIndexDivisor which affects the sampling of the tii file when > read > > into memory. While this solved our problem for searching, unfortunately > > the termInfosIndexDivisor was not read during indexing and caused OOM > > problems once our indexes grew beyond a certain size. See: > > https://issues.apache.org/jira/browse/SOLR-2290. > > > > Has this been changed in Solr 4.0? > > > > The advantage of using the termInfosIndexDivisor is that it can be > changed > > without re-indexing, so we were able to experiment with different > settings > > to determine a good setting without re-indexing several terabytes of > data. > > > > When we ran into problems with the memory use for the in-memory > > representation of the tii file during indexing, we changed the > > termIndexInterval. The termIndexInterval is an indexing-time setting > > which affects the size of the tii file. It sets the sampling of the tis > > file that gets written to the tii file. > > > > In Solr 4.0 termInfosIndexDivisor has been replaced with > > termIndexDivisor.The documentation for these two features, the > > index-time termIndexInterval and the run-time termIndexDivisor no longer > > seems to be on the solr config page of the wiki and the docmentation in > the > > example file does not exlain what the termIndexDivisor does. > > > > Would it be appropriate to add these back to the wiki page? If not, > could > > someone add a line or two to the comments in the Solr 4.0 example file > > explaining what the termIndexDivisor doe? > > > > > > Tom > > > > -- > lucidworks.com >
Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor
Hi Tom: I already enhanced the javadocs about this for Lucene, putting warnings everywhere in bold: NOTE: This parameter does not apply to all PostingsFormat implementations, including the default one in this release. It only makes sense for term indexes that are implemented as a fixed gap between terms. NOTE: divisor settings > 1 do not apply to all PostingsFormat implementations, including the default one in this release. It only makes sense for terms indexes that can efficiently re-sample terms at load time. etc http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29 http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/DirectoryReader.html#open%28org.apache.lucene.store.Directory,%20int%29 In the future I expect these parameters ill be removed completely: anything like this is specific to the codec/implementation. In Lucene 4.0 the terms index works completely differently: these parameters don't make sense for it. On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West wrote: > Hello all, > > Due to multiple languages and dirty OCR, our indexes have over 2 billion > unique terms ( > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again). > In Solr 3.6 and previous we needed to reduce the memory used for storing > the in-memory representation of the tii file. We originally used the > termInfosIndexDivisor which affects the sampling of the tii file when read > into memory. While this solved our problem for searching, unfortunately > the termInfosIndexDivisor was not read during indexing and caused OOM > problems once our indexes grew beyond a certain size. See: > https://issues.apache.org/jira/browse/SOLR-2290. > > Has this been changed in Solr 4.0? > > The advantage of using the termInfosIndexDivisor is that it can be changed > without re-indexing, so we were able to experiment with different settings > to determine a good setting without re-indexing several terabytes of data. > > When we ran into problems with the memory use for the in-memory > representation of the tii file during indexing, we changed the > termIndexInterval. The termIndexInterval is an indexing-time setting > which affects the size of the tii file. It sets the sampling of the tis > file that gets written to the tii file. > > In Solr 4.0 termInfosIndexDivisor has been replaced with > termIndexDivisor.The documentation for these two features, the > index-time termIndexInterval and the run-time termIndexDivisor no longer > seems to be on the solr config page of the wiki and the docmentation in the > example file does not exlain what the termIndexDivisor does. > > Would it be appropriate to add these back to the wiki page? If not, could > someone add a line or two to the comments in the Solr 4.0 example file > explaining what the termIndexDivisor doe? > > > Tom -- lucidworks.com