Has anyone done any work on getting these types of terms or all terms for that 
matter
into a collection that spilled onto the disk if necessary to avoid this 
problem?  I'm
just wondering if anyone has had any luck without crippling the search speed.  
This is
definitely a problem that has burned me in the past.  I am going to start 
working on and
testing a solution to this, but was wondering if anyone had already messed with 
it or
had any ideas up front?

Thanks,

Tony Schwartz
[EMAIL PROTECTED]





From: John Wang <[EMAIL PROTECTED]>
Subject: Re: OutOfMemoryError on addIndexes()


--------------------------------------------------------------------------------

Under many usecases a date field is often indexed. If the granularity
of the date value is in milliseconds, the number of unique terms in
the index could potentially be huge.

So if this is indeed the case, it is a potential scalability
bottleneck in lucene index size.

Thanks

-John

On 8/12/05, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> Okay, just for the record, I'm currently on vacation, and i don't have
> access to any of my indexes at work in order to make a comparison, but the
> number of unique terms in your index (which is i'm 99% sure what
> indexEnum.size represents in the code you cited) seems HUGE!!!
>
> You havne't given us a lot of details about what your index contains (ie:
> the nature of the documents) .. in fact, for the number of terms you cite
> (811806819) the only info we have is that the index containing that number
> of terms is 29MB in size -- no idea how many documents are in that index.
> But if we look at your previous email, you mentioend having a nother index
> that cuases the same problem which is 120MB, which you built from 11359
> files.  If we assume that index has no more then the same number of unique
> terms indexed (which seems unlikely, but lets give it the benefit of the
> doubt, and assume the added size is all stored fields) and assume that you
> made one document per file, and that those files are 100% unique from each
> other, and contain no terms in common -- that means that each file
> contains roughtly 71,500 unique terms.
>
> that seems like a lot.
>
> A quick google search tells me that the english language contains
> somewhere from 500,000 to 1,000,000 words - your index has 800 times that
> many terms.  even assuming you index a lot of numerical or date based data
> -- that seems like a lot.
>

...

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to