Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s.
The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings