Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS...
We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooooooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Ariel <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]