>From my experience, you shouldn't have any problems indexing that amount of content even into one index. I've successfully indexed 450 GB of data w/ Lucene, and I believe it can scale much higher if rich text documents are indexed. Though I haven't tried yet, I believe it can scale into the 1-5 TB domain, on a modern CPU + HD and enough RAM.
Usually, when rich text documents are involved, some considerable time is spent converting these into raw text documents. The raw size of a rich text document (PDF, DOC, HTML) is usually (based on my measures) 15-20% of its original size, and that is compressed even more when added to Lucene. I hope this helps. BTW, you can always just try to index that amount of content in one index on your machine and decide if the machine can handle that amount of data. Shai On Wed, Jul 22, 2009 at 9:07 AM, m.harig <m.ha...@gmail.com> wrote: > > hello all > > We've got 100GB of data which has doc,txt,pdf,ppt,etc.., we've > separate parser for each file format, so we're going to index those data by > lucene. (since we scared of Nutch setup , thats why we didn't use it) My > doubt is , will it be scalable when i index those dcouments ? we planned to > do separate index for each file format , and we planned to use multi index > reader for searching, please anyone suggest me > > 1. Are we going on the right way? > 2. Please suggest me about mergeFactors & segments > 3. How much index size can lucene handle? > 4. Will it cause for java OOM. > -- > View this message in context: > http://www.nabble.com/indexing-100GB-of-data-tp24600563p24600563.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >