Michael, Certainly parallelizing on a set of servers would work (hmm... hadoop?), but if you want to do this on a single machine you should tune some of the IndexWriter params. You didn't mention them, so I assume you didn't tune anything yet. If you have Lucene in Action, check out 2.7.1 : Tuning indexing performance starts on page 42 under section 2.7 (Controlling the indexing process) in chapter 2 (Indexing) (found via: http://lucenebook.com/search?query=index+tuning )
If not, check maxBufferedDocs and mergeFactor in IndexWriter javadocs. This is likely in the FAQ, too, but I didn't check. Otis ----- Original Message ---- From: Michael J. Prichard To: java-user@lucene.apache.org Sent: Thursday, July 27, 2006 12:29:31 PM Subject: Indexing large sets of documents? I built an indexer that runs through email and its attachments, rips out content and what not and then creates a Document and adds it to an index. It works w/ no problem. The issue is that it takes around 3-5 seconds per email and I have seen up to 10-15 seconds for email w/ attachments. I need to index 750k emails and at those times it will take FOREVER! I am trying to find places to cut a second or two here or there but are there any suggestions as to what I can do? Should I look into parallelizing indexing? Help?! Thanks, Michael --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]