On Wednesday 17 August 2005 22:36, Tony Schwartz wrote:
> Has anyone done any work on getting these types of terms or all terms for 
> that matter
> into a collection that spilled onto the disk if necessary to avoid this 
> problem?  I'm
> just wondering if anyone has had any luck without crippling the search speed. 
>  This is
> definitely a problem that has burned me in the past.  I am going to start 
> working on and
> testing a solution to this, but was wondering if anyone had already messed 
> with it or
> had any ideas up front?
> 
> Thanks,
> 
> Tony Schwartz
> [EMAIL PROTECTED]
> 
> 
> 
> 
> 
> From: John Wang <[EMAIL PROTECTED]>
> Subject: Re: OutOfMemoryError on addIndexes()
> 
> 
> --------------------------------------------------------------------------------
> 
> Under many usecases a date field is often indexed. If the granularity
> of the date value is in milliseconds, the number of unique terms in
> the index could potentially be huge.
> 
> So if this is indeed the case, it is a potential scalability
> bottleneck in lucene index size.

Splitting the date field into century, year in century, month, day, hour, 
seconds, and
milliseconds will reduce the total number of indexed terms to 2300 or so.
That's probably overdoing it a bit, but even a more crude split can
help to reduce the number of indexed terms quite drastically.
The downside is that you'll need to adapt searching for dates to
your indexed format.

Regards,
Paul Elschot

> 
> Thanks
> 
> -John
> 
> On 8/12/05, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> > Okay, just for the record, I'm currently on vacation, and i don't have
> > access to any of my indexes at work in order to make a comparison, but the
> > number of unique terms in your index (which is i'm 99% sure what
> > indexEnum.size represents in the code you cited) seems HUGE!!!
> >
> > You havne't given us a lot of details about what your index contains (ie:
> > the nature of the documents) .. in fact, for the number of terms you cite
> > (811806819) the only info we have is that the index containing that number
> > of terms is 29MB in size -- no idea how many documents are in that index.
> > But if we look at your previous email, you mentioend having a nother index
> > that cuases the same problem which is 120MB, which you built from 11359
> > files.  If we assume that index has no more then the same number of unique
> > terms indexed (which seems unlikely, but lets give it the benefit of the
> > doubt, and assume the added size is all stored fields) and assume that you
> > made one document per file, and that those files are 100% unique from each
> > other, and contain no terms in common -- that means that each file
> > contains roughtly 71,500 unique terms.
> >
> > that seems like a lot.
> >
> > A quick google search tells me that the english language contains
> > somewhere from 500,000 to 1,000,000 words - your index has 800 times that
> > many terms.  even assuming you index a lot of numerical or date based data
> > -- that seems like a lot.
> >
> 
> ...
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to