Ariel,

I believe PDFBox is not the fastest thing and was built more to handle all 
possible PDFs than for speed (just my impression - Ben, PDFBox's author might 
still be on this list and might comment).  Pulling data from NFS to index seems 
like a bad idea.  I hope at least the indices are local and not on a remote 
NFS...

We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and 
indexing overNFS was slooooooow.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Ariel <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, January 9, 2008 2:50:41 PM
Subject: Why is lucene so slow indexing in nfs file system ?

Hi:
I have seen the post in
http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
 and
I am implementing a similar application in a distributed enviroment, a
cluster of nodes only 5 nodes. The operating system I use is
 Linux(Centos)
so I am using nfs file system too to access the home directory where
 the
documents to be indexed reside and I would like to know how much time
 an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
 every
nodes, LAN: 1Gbits/s.

The problem I have is that my application spends a lot of time to index
 all
the documents, the delay to index 10 gb of pdf documents is about 2
 days (to
convert pdf to text I am using pdfbox) that is of course a lot of time,
others applications based in lucene, for instance ibm omnifind only
 takes 5
hours to index the same amount of pdfs documents. I would like to find
 out
why my application has this big delay to index, any help is welcome.
Dou you know others distributed architecture application that uses
 lucene to
index big amounts of documents ? How long time it takes to index ?
I hope yo can help me
Greetings




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to