Rossini{}, I think what you might have read might have been that searching a Lucene index that lives in a HDFS would be slow. As far as I understand things, the thing to do is to copy the index to a local disk, out of HDFS, and then search it with Lucene from there.
Otis() ----- Original Message ---- From: Rafael Rossini <[EMAIL PROTECTED]> To: java-user@lucene.apache.org; Otis Gospodnetic <[EMAIL PROTECTED]> Sent: Thursday, July 27, 2006 4:23:56 PM Subject: Re: Indexing large sets of documents? Oits, You mentioned the hadoop project. I check it out not a long time ago and I read someting about it did not support the lucene index. Is it possible to index and then search in a HDFS? []s Rossini On 7/27/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > Michael, > > Certainly parallelizing on a set of servers would work (hmm... hadoop?), > but if you want to do this on a single machine you should tune some of the > IndexWriter params. You didn't mention them, so I assume you didn't tune > anything yet. If you have Lucene in Action, check out > 2.7.1 : Tuning indexing performance starts on page 42 under > section 2.7 (Controlling the indexing process) in chapter 2 (Indexing) > (found via: http://lucenebook.com/search?query=index+tuning ) > > If not, check maxBufferedDocs and mergeFactor in IndexWriter > javadocs. This is likely in the FAQ, too, but I didn't check. > > Otis > > ----- Original Message ---- > From: Michael J. Prichard > To: java-user@lucene.apache.org > Sent: Thursday, July 27, 2006 12:29:31 PM > Subject: Indexing large sets of documents? > > I built an indexer that runs through email and its attachments, rips out > content and what not and then creates a Document and adds it to an > index. It works w/ no problem. The issue is that it takes around 3-5 > seconds per email and I have seen up to 10-15 seconds for email w/ > attachments. I need to index 750k emails and at those times it will > take FOREVER! I am trying to find places to cut a second or two here or > there but are there any suggestions as to what I can do? Should I look > into parallelizing indexing? Help?! > > Thanks, > Michael > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]