maximum index size

2004-09-08 Thread Chris Fraschetti
I know the index size is very dependent on the content being index...

but running on a unix based machine w/o a filesize limit, best case
scenario... what is the largest number of documents that can be
indexed.

I've seen throughout the list mentions of millions of documents.. 8
million, 20 million, etc etc.. but can lucene potentially handle
billions of documents and still efficiently search through them?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: maximum index size

2004-09-08 Thread Otis Gospodnetic
Given adequate hardware, it can.  Take a look at nutch.org.  Nutch uses
Lucene at its core.

Otis

--- Chris Fraschetti [EMAIL PROTECTED] wrote:

 I know the index size is very dependent on the content being index...
 
 but running on a unix based machine w/o a filesize limit, best case
 scenario... what is the largest number of documents that can be
 indexed.
 
 I've seen throughout the list mentions of millions of documents.. 8
 million, 20 million, etc etc.. but can lucene potentially handle
 billions of documents and still efficiently search through them?
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: maximum index size

2004-09-08 Thread Doug Cutting
Chris Fraschetti wrote:
I've seen throughout the list mentions of millions of documents.. 8
million, 20 million, etc etc.. but can lucene potentially handle
billions of documents and still efficiently search through them?
Lucene can currently handle up to 2^31 documents in a single index.  To 
a large degree this is limited by Java ints and arrays (which are 
accessed by ints).  There are also a few places where the file format 
limits things to 2^32.

On typical PC hardware, 2-3 word searches of an index with 10M 
documents, each with around 10k of text, require around 1 second, 
including index i/o time.  Performance is more-or-less linear, so that a 
100M document index might require nearly 10 seconds per search.  Thus, 
as indexes grow folks tend to distribute searches in parallel to many 
smaller indexes.  That's what Nutch and Google 
(http://www.computer.org/micro/mi2003/m2022.pdf) do.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]