In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit(I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? I hope you can help me. I have take in consideration the suggestions you have make me before, I going to do some things to test it. Ariel
On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote: > Thanks all you for yours answers, I going to change a few things in my > application and make tests. > One thing I haven't find another good pdfToText converter like pdfBox Do > you know any other faster ? > Greetings > Thanks for yours answers > Ariel > > > On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > wrote: > > > Ariel, > > > > I believe PDFBox is not the fastest thing and was built more to handle > > all possible PDFs than for speed (just my impression - Ben, PDFBox's author > > might still be on this list and might comment). Pulling data from NFS to > > index seems like a bad idea. I hope at least the indices are local and not > > on a remote NFS... > > > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) > > and indexing overNFS was slooooooow. > > > > Otis > > > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- > > From: Ariel <[EMAIL PROTECTED]> > > To: java-user@lucene.apache.org > > Sent: Wednesday, January 9, 2008 2:50:41 PM > > Subject: Why is lucene so slow indexing in nfs file system ? > > > > Hi: > > I have seen the post in > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html > > and > > I am implementing a similar application in a distributed enviroment, a > > cluster of nodes only 5 nodes. The operating system I use is > > Linux(Centos) > > so I am using nfs file system too to access the home directory where > > the > > documents to be indexed reside and I would like to know how much time > > an > > application spends to index a big amount of documents like 10 Gb ? > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in > > every > > nodes, LAN: 1Gbits/s. > > > > The problem I have is that my application spends a lot of time to index > > all > > the documents, the delay to index 10 gb of pdf documents is about 2 > > days (to > > convert pdf to text I am using pdfbox) that is of course a lot of time, > > others applications based in lucene, for instance ibm omnifind only > > takes 5 > > hours to index the same amount of pdfs documents. I would like to find > > out > > why my application has this big delay to index, any help is welcome. > > Dou you know others distributed architecture application that uses > > lucene to > > index big amounts of documents ? How long time it takes to index ? > > I hope yo can help me > > Greetings > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > >