Re: indexing 100GB of data

Shai Erera Wed, 22 Jul 2009 00:26:43 -0700

>From my experience, you shouldn't have any problems indexing that amount of
content even into one index. I've successfully indexed 450 GB of data w/
Lucene, and I believe it can scale much higher if rich text documents are
indexed. Though I haven't tried yet, I believe it can scale into the 1-5 TB
domain, on a modern CPU + HD and enough RAM.

Usually, when rich text documents are involved, some considerable time is
spent converting these into raw text documents. The raw size of a rich text
document (PDF, DOC, HTML) is usually (based on my measures) 15-20% of its
original size, and that is compressed even more when added to Lucene.

I hope this helps. BTW, you can always just try to index that amount of
content in one index on your machine and decide if the machine can handle
that amount of data.

Shai

On Wed, Jul 22, 2009 at 9:07 AM, m.harig <m.ha...@gmail.com> wrote:

>
> hello all
>
>             We've got 100GB of data which has doc,txt,pdf,ppt,etc.., we've
> separate parser for each file format, so we're going to index those data by
> lucene. (since we scared of Nutch setup , thats why we didn't use it) My
> doubt is , will it be scalable when i index those dcouments ? we planned to
> do separate index for each file format , and we planned to use multi index
> reader for searching, please anyone suggest me
>
>          1. Are we going on the right way?
>            2. Please suggest me about mergeFactors & segments
>            3. How much index size can lucene handle?
>            4. Will it cause for java OOM.
> --
> View this message in context:
> http://www.nabble.com/indexing-100GB-of-data-tp24600563p24600563.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: indexing 100GB of data

Reply via email to