This seems really clunky. Especially if your merge step also optimizes. There's not much point in indexing into RAM then merging explicitly. Just use an FSDirectory rather than a RAMDirectory. There is *already* buffering built in to FSDirectory, and your merge factor etc. control how much RAM is used before flushing to disk. There's considerable discussion of this on the Wiki I believe, but in the mail archive for sure. And I believe there's a RAM usage based flushing policy somewhere.
You're adding complexity where it's probably not necessary. Did you adopt this scheme because you *thought* it would be faster or because you were addressing a *known* problem? Don't *ever* write complex code to support a theoretical case unless you have considerable certainty that it really is a problem. "It would be faster" is a weak argument when you don't know whether you're talking about saving 1% or 95%. The added maintenance is just not worth it. There's a famous quote about that from Donald Knuth (paraphrasing Hoare) "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." It's true. So the very *first* measurement I'd take is to get rid of the in-RAM stuff and just write the index to local disk. I suspect you'll be *far* better off doing this then just copying your index to the nfs mount. Best Erick On Jan 10, 2008 10:05 AM, Ariel <[EMAIL PROTECTED]> wrote: > In a distributed enviroment the application should make an exhaustive use > of > the network and there is not another way to access to the documents in a > remote repository but accessing in nfs file system. > One thing I must clarify: I index the documents in memory, I use > RAMDirectory to do that, then when the RAMDirectory reach the limit(I have > put about 10 Mb) then I serialize to disk(nfs) the index to merge it with > the central index(the central index is in nfs file system), is that > correct? > I hope you can help me. > I have take in consideration the suggestions you have make me before, I > going to do some things to test it. > Ariel > > > On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote: > > > Thanks all you for yours answers, I going to change a few things in my > > application and make tests. > > One thing I haven't find another good pdfToText converter like pdfBox Do > > you know any other faster ? > > Greetings > > Thanks for yours answers > > Ariel > > > > > > On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > > wrote: > > > > > Ariel, > > > > > > I believe PDFBox is not the fastest thing and was built more to handle > > > all possible PDFs than for speed (just my impression - Ben, PDFBox's > author > > > might still be on this list and might comment). Pulling data from NFS > to > > > index seems like a bad idea. I hope at least the indices are local > and not > > > on a remote NFS... > > > > > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which > one) > > > and indexing overNFS was slooooooow. > > > > > > Otis > > > > > > -- > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > ----- Original Message ---- > > > From: Ariel <[EMAIL PROTECTED]> > > > To: java-user@lucene.apache.org > > > Sent: Wednesday, January 9, 2008 2:50:41 PM > > > Subject: Why is lucene so slow indexing in nfs file system ? > > > > > > Hi: > > > I have seen the post in > > > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html > > > and > > > I am implementing a similar application in a distributed enviroment, a > > > cluster of nodes only 5 nodes. The operating system I use is > > > Linux(Centos) > > > so I am using nfs file system too to access the home directory where > > > the > > > documents to be indexed reside and I would like to know how much time > > > an > > > application spends to index a big amount of documents like 10 Gb ? > > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in > > > every > > > nodes, LAN: 1Gbits/s. > > > > > > The problem I have is that my application spends a lot of time to > index > > > all > > > the documents, the delay to index 10 gb of pdf documents is about 2 > > > days (to > > > convert pdf to text I am using pdfbox) that is of course a lot of > time, > > > others applications based in lucene, for instance ibm omnifind only > > > takes 5 > > > hours to index the same amount of pdfs documents. I would like to find > > > out > > > why my application has this big delay to index, any help is welcome. > > > Dou you know others distributed architecture application that uses > > > lucene to > > > index big amounts of documents ? How long time it takes to index ? > > > I hope yo can help me > > > Greetings > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > >