Yes - parallelizing works great - we built a share-nothing java-spaces based system at X1 and on a 11-way cluster were able to index 350 office documents per second - this included the binary-2-text conversion, using Stellent INSO libraries. The trick is to create separate indexes and, if you do not have a federated search setup - merge the indexes into one big index after they are completed.
Dejan -----Original Message----- From: Michael J. Prichard [mailto:[EMAIL PROTECTED] Sent: Thursday, July 27, 2006 9:30 AM To: java-user@lucene.apache.org Subject: Indexing large sets of documents? I built an indexer that runs through email and its attachments, rips out content and what not and then creates a Document and adds it to an index. It works w/ no problem. The issue is that it takes around 3-5 seconds per email and I have seen up to 10-15 seconds for email w/ attachments. I need to index 750k emails and at those times it will take FOREVER! I am trying to find places to cut a second or two here or there but are there any suggestions as to what I can do? Should I look into parallelizing indexing? Help?! Thanks, Michael --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]