Ariel, Comments inline.
----- Original Message ---- From: Ariel <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, January 10, 2008 10:05:28 AM Subject: Re: Why is lucene so slow indexing in nfs file system ? In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. OG: What about SAN connected over FC for example? One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit(I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will do in-memory thing for you. Make good use of your RAM and use 2.3 which gives you more control over RAM use during indexing. Parallelizing indexing over multiple machines and merging at the end is faster, so that's a good approach. Also, if your boxes have multiple CPUs write your code so that it has multiple worker threads that do indexing and feed docs to IndexWriter.addDocument(Document) to keep the CPUs fully utilized. OG: Oh, something faster than PDFBox? There is (can't remember the name now... itextstream or something like that?), though it may not be free like PDFBox. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote: > Thanks all you for yours answers, I going to change a few things in my > application and make tests. > One thing I haven't find another good pdfToText converter like pdfBox Do > you know any other faster ? > Greetings > Thanks for yours answers > Ariel > > > On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > wrote: > > > Ariel, > > > > I believe PDFBox is not the fastest thing and was built more to handle > > all possible PDFs than for speed (just my impression - Ben, PDFBox's author > > might still be on this list and might comment). Pulling data from NFS to > > index seems like a bad idea. I hope at least the indices are local and not > > on a remote NFS... > > > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) > > and indexing overNFS was slooooooow. > > > > Otis > > > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- > > From: Ariel <[EMAIL PROTECTED]> > > To: java-user@lucene.apache.org > > Sent: Wednesday, January 9, 2008 2:50:41 PM > > Subject: Why is lucene so slow indexing in nfs file system ? > > > > Hi: > > I have seen the post in > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html > > and > > I am implementing a similar application in a distributed enviroment, a > > cluster of nodes only 5 nodes. The operating system I use is > > Linux(Centos) > > so I am using nfs file system too to access the home directory where > > the > > documents to be indexed reside and I would like to know how much time > > an > > application spends to index a big amount of documents like 10 Gb ? > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in > > every > > nodes, LAN: 1Gbits/s. > > > > The problem I have is that my application spends a lot of time to index > > all > > the documents, the delay to index 10 gb of pdf documents is about 2 > > days (to > > convert pdf to text I am using pdfbox) that is of course a lot of time, > > others applications based in lucene, for instance ibm omnifind only > > takes 5 > > hours to index the same amount of pdfs documents. I would like to find > > out > > why my application has this big delay to index, any help is welcome. > > Dou you know others distributed architecture application that uses > > lucene to > > index big amounts of documents ? How long time it takes to index ? > > I hope yo can help me > > Greetings > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]