Yes - parallelizing works great - we built a share-nothing java-spaces based
system at X1 and on a 11-way cluster were able to index 350 office documents
per second - this included the binary-2-text conversion, using Stellent INSO
libraries. The trick is to create separate indexes and, if you do not have a
federated search setup  - merge the indexes into one big index after they
are completed.

Dejan

-----Original Message-----
From: Michael J. Prichard [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 27, 2006 9:30 AM
To: java-user@lucene.apache.org
Subject: Indexing large sets of documents?

I built an indexer that runs through email and its attachments, rips out 
content and what not and then creates a Document and adds it to an 
index.  It works w/ no problem.  The issue is that it takes around 3-5 
seconds per email and I have seen up to 10-15 seconds for email w/ 
attachments.  I need to index 750k emails and at those times it will 
take FOREVER!  I am trying to find places to cut a second or two here or 
there but are there any suggestions as to what I can do?  Should I look 
into parallelizing indexing?  Help?!

Thanks,
Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to