Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
Thanks Robert,

>>if not, just customize blocktree's params with a CodecFactory in solr,
>>or even pick another implementation (FixedGap, VariableGap, whatever).

Still trying to get my head around 4.0 and flexible indexing.  I'll take
another look at Mike's and your presentations.  I'm trying to figure out
how to get from the Lucene JavaDocs you pointed out  to how to specify
things in Solr and it's config files..

Is there an example CodecFactory somewhere I could look at?  Also is
Is there an example somewhere of how to specify a CodecFactory/Codec in
Solr using the schema.xml or solrconfig.xml?

Is there some simple way to specify minBlockSize and maxBlockSize in
schema.xml?

Once I get this all working and understand it, I'll be happy to draft some
documentation.

I'm really looking forward to experimenting with 4.0!

Tom



Tom
On Fri, Sep 7, 2012 at 2:58 PM, Robert Muir  wrote:

> On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West 
> wrote:
> > Thanks Robert,
> >
> > I'll have to spend some time understanding the default codec for Solr
> 4.0.
> > Did I miss something in the changes file?
>
> http://lucene.apache.org/core/4_0_0-BETA/
>
> see the file formats section, especially
>
> http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary
>
> (since blocktree "covers" term dictionary and terms index)
>
> >
> >  I'll be digging into the default codec docs and testing sometime in next
> > week  or two (with a 2 billion term index)  If I understand it well
> enough,
> > I'll be happy to draft some changes up for either the wiki or Solr the
> > example solrconfig.xml  file.
>
> right i think we should remove these parameters.
>
> >
> > Does this mean that the default codec will reduce memory use for the
> terms
> > index enough so I don't need to use either of these settings to deal with
> > my > 2 billion term indexes?
>
> probably. i dont know enough about your terms or how much RAM you have
> to say for sure.
>
> if not, just customize blocktree's params with a CodecFactory in solr,
> or even pick another implementation (FixedGap, VariableGap, whatever).
>
> the interval/divisor stuff is mostly only useful if you are not
> reindexing from scratch: e.g. if you are gonna plop your 3.x index
> into 4.x then you should set
> those to whatever you were using before, since it will be using
> PreflexCodec to read those.
>
> --
> lucidworks.com
>


Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Robert Muir
On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West  wrote:
> Thanks Robert,
>
> I'll have to spend some time understanding the default codec for Solr 4.0.
> Did I miss something in the changes file?

http://lucene.apache.org/core/4_0_0-BETA/

see the file formats section, especially
http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary

(since blocktree "covers" term dictionary and terms index)

>
>  I'll be digging into the default codec docs and testing sometime in next
> week  or two (with a 2 billion term index)  If I understand it well enough,
> I'll be happy to draft some changes up for either the wiki or Solr the
> example solrconfig.xml  file.

right i think we should remove these parameters.

>
> Does this mean that the default codec will reduce memory use for the terms
> index enough so I don't need to use either of these settings to deal with
> my > 2 billion term indexes?

probably. i dont know enough about your terms or how much RAM you have
to say for sure.

if not, just customize blocktree's params with a CodecFactory in solr,
or even pick another implementation (FixedGap, VariableGap, whatever).

the interval/divisor stuff is mostly only useful if you are not
reindexing from scratch: e.g. if you are gonna plop your 3.x index
into 4.x then you should set
those to whatever you were using before, since it will be using
PreflexCodec to read those.

-- 
lucidworks.com


Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
Thanks Robert,

I'll have to spend some time understanding the default codec for Solr 4.0.
Did I miss something in the changes file?

 I'll be digging into the default codec docs and testing sometime in next
week  or two (with a 2 billion term index)  If I understand it well enough,
I'll be happy to draft some changes up for either the wiki or Solr the
example solrconfig.xml  file.

Does this mean that the default codec will reduce memory use for the terms
index enough so I don't need to use either of these settings to deal with
my > 2 billion term indexes?

If both of these parameters don't make sense for the default codec, then
maybe they need to be commented out or removed from the solr example
solrconfig.xml.

Tom

On Fri, Sep 7, 2012 at 1:33 PM, Robert Muir  wrote:

> Hi Tom: I already enhanced the javadocs about this for Lucene, putting
> warnings everywhere in bold:
>
> NOTE: This parameter does not apply to all PostingsFormat
> implementations, including the default one in this release. It only
> makes sense for term indexes that are implemented as a fixed gap
> between terms.
> NOTE: divisor settings > 1 do not apply to all PostingsFormat
> implementations, including the default one in this release. It only
> makes sense for terms indexes that can efficiently re-sample terms at
> load time.
> etc
>
>
> http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
>
> http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/DirectoryReader.html#open%28org.apache.lucene.store.Directory,%20int%29
>
> In the future I expect these parameters ill be removed completely:
> anything like this is specific to the codec/implementation.
>
> In Lucene 4.0 the terms index works completely differently: these
> parameters don't make sense for it.
>
> On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West 
> wrote:
> > Hello all,
> >
> > Due to multiple languages and dirty OCR, our indexes have over 2 billion
> > unique terms (
> > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again
> ).
> > In Solr 3.6 and previous we needed to reduce the memory used for storing
> > the in-memory representation of the tii file.   We originally used the
> > termInfosIndexDivisor which affects the sampling of the tii file when
> read
> > into memory.   While this solved our problem for searching, unfortunately
> > the termInfosIndexDivisor was not read during indexing and caused OOM
> > problems once our indexes grew beyond a certain size.  See:
> > https://issues.apache.org/jira/browse/SOLR-2290.
> >
> > Has this been changed in Solr 4.0?
> >
> > The advantage of using the termInfosIndexDivisor is that it can be
> changed
> > without re-indexing, so we were able to experiment with different
> settings
> > to determine a good setting without re-indexing several terabytes of
> data.
> >
> > When we ran into problems with the memory use for the in-memory
> > representation of the tii file during indexing, we changed the
> > termIndexInterval.  The termIndexInterval is an indexing-time setting
> >  which affects the size of the tii file.  It sets the sampling of the tis
> > file that gets written to the tii file.
> >
> > In Solr 4.0 termInfosIndexDivisor has been replaced with
> > termIndexDivisor.The documentation for these two features, the
> > index-time termIndexInterval and the run-time  termIndexDivisor no longer
> > seems to be on the solr config page of the wiki and the docmentation in
> the
> > example file does not exlain what the termIndexDivisor does.
> >
> > Would it be appropriate to add these back to the wiki page?  If not,
> could
> > someone add a line or two to the comments in the Solr 4.0 example file
> > explaining what the termIndexDivisor doe?
> >
> >
> > Tom
>
>
>
> --
> lucidworks.com
>


Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Robert Muir
Hi Tom: I already enhanced the javadocs about this for Lucene, putting
warnings everywhere in bold:

NOTE: This parameter does not apply to all PostingsFormat
implementations, including the default one in this release. It only
makes sense for term indexes that are implemented as a fixed gap
between terms.
NOTE: divisor settings > 1 do not apply to all PostingsFormat
implementations, including the default one in this release. It only
makes sense for terms indexes that can efficiently re-sample terms at
load time.
etc

http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/DirectoryReader.html#open%28org.apache.lucene.store.Directory,%20int%29

In the future I expect these parameters ill be removed completely:
anything like this is specific to the codec/implementation.

In Lucene 4.0 the terms index works completely differently: these
parameters don't make sense for it.

On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West  wrote:
> Hello all,
>
> Due to multiple languages and dirty OCR, our indexes have over 2 billion
> unique terms (
> http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again).
> In Solr 3.6 and previous we needed to reduce the memory used for storing
> the in-memory representation of the tii file.   We originally used the
> termInfosIndexDivisor which affects the sampling of the tii file when read
> into memory.   While this solved our problem for searching, unfortunately
> the termInfosIndexDivisor was not read during indexing and caused OOM
> problems once our indexes grew beyond a certain size.  See:
> https://issues.apache.org/jira/browse/SOLR-2290.
>
> Has this been changed in Solr 4.0?
>
> The advantage of using the termInfosIndexDivisor is that it can be changed
> without re-indexing, so we were able to experiment with different settings
> to determine a good setting without re-indexing several terabytes of data.
>
> When we ran into problems with the memory use for the in-memory
> representation of the tii file during indexing, we changed the
> termIndexInterval.  The termIndexInterval is an indexing-time setting
>  which affects the size of the tii file.  It sets the sampling of the tis
> file that gets written to the tii file.
>
> In Solr 4.0 termInfosIndexDivisor has been replaced with
> termIndexDivisor.The documentation for these two features, the
> index-time termIndexInterval and the run-time  termIndexDivisor no longer
> seems to be on the solr config page of the wiki and the docmentation in the
> example file does not exlain what the termIndexDivisor does.
>
> Would it be appropriate to add these back to the wiki page?  If not, could
> someone add a line or two to the comments in the Solr 4.0 example file
> explaining what the termIndexDivisor doe?
>
>
> Tom



-- 
lucidworks.com