OK, I caused more confusion than rendered help by my stemming statement. The only reason I mentioned it was to illustrate that performance is not linearly related to size.
It took some effort to put stemming into the index, see PorterStemmer etc. This is NOT the default. So I took it out to see what the effect would be. Why not stemming made things shorter: because we also have the requirement that phrases (i.e. words in double quotes) do NOT match the stemmed version. Thus if we index running watching, the following searches have the indicated results run - hits watch - hits running - hits "run watch" does NOT hit. "running watching" hits So I indexed the following terms... run running$ watch watching& with the two forms of run indexed in the same position (0) and the two forms of watch in the same position (1). I agree that if we didn't have the exact-phrase-match requirement the stemmed version of the index should be smaller.... Sorry for the confusion Erick On 3/14/07, jm <[EMAIL PROTECTED]> wrote:
hi Erick, Well, typically my application will start with some hundreds of indexes...and then grow at a rate of several per day, for ever. At some point I know I can do some merging etc if needed. Size is dependant on the customer, could be up to a 1G per index. That is way I would like to minimize them. I am not worried with search performance. I dont understand how not stemming can reduce the size of an index...I would think it happens the other way, does not stemming makes the words shorter? (I dont stemm, so I never looked into it) thanks On 3/14/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > Store as little as possible, index as little as possible <G>..... > > How big is your index, and how much do you expect it to grow? > I ask this because it's probably not worth your time to try to > reduce the index size below some threshold... I found that > reducing my index from 8G to 4G (through not stemming) gave > me about a 10% performance improvement, so at some point > it's just not worth the effort. Also, if you posted the index size, > it would give folks a chance to say "there's not much you can > gain by reducing things more". As it is, I don't have a clue > whether your index is 100M or 100T. The former is in the > "don't waste your time" class, and the latter is...er... > different.... > > I wouldn't bother compressing for 1%.... > > Question for "the guys" so I can check an assumption.... > Is there any difference between these two? > Field(Name, Value, Store, index) > *< file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Field.Store,%20org.apache.lucene.document.Field.Index,%20org.apache.lucene.document.Field.TermVector%29 > > *Field(Name, Value, Store, index, Field.TermVector.NO) > > > Best > Erick > > On 3/14/07, jm <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > I want to make my index as small as possible. I noticed about > > field.setOmitNorms(true), I read in the list the diff is 1 byte per > > field per doc, not huge but hey...is the only effect the score being > > different? I hardly mind about the score so that would be ok. > > > > And can I add to an index without norms when it has previous doc with > > norms? > > > > Any other way to minimize size of index? Most of my fields but one are > > Field.Store.NO, Field.Index.TOKENIZED and Field.TermVector.NO, one is > > Field.Store.YES, Field.Index.UN_TOKENIZED and Field.TermVector.NO. I > > tried compressing that one and size is reduced around 1% (it's a small > > field), but I guess compression means worse performance so I am not > > sure about applying that. > > > > thanks > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]