I am indexing into RAM then merging explicitly because my application demand it due to I have design it as a distributed enviroment so many threads or workers are in different machines indexing into RAM serialize to disk an another thread in another machine access the segment index to merge it with the principal one, that is faster than if I had just one thread indexing the documents, doesn' it ? Yours suggestions are very useful. I hope you can help me. Greetings Ariel
On Jan 10, 2008 10:21 AM, Erick Erickson <[EMAIL PROTECTED]> wrote: > This seems really clunky. Especially if your merge step also optimizes. > > There's not much point in indexing into RAM then merging explicitly. > Just use an FSDirectory rather than a RAMDirectory. There is *already* > buffering built in to FSDirectory, and your merge factor etc. control > how much RAM is used before flushing to disk. There's considerable > discussion of this on the Wiki I believe, but in the mail archive for > sure. > And I believe there's a RAM usage based flushing policy somewhere. > > You're adding complexity where it's probably not necessary. Did you > adopt this scheme because you *thought* it would be faster or because > you were addressing a *known* problem? Don't *ever* write complex code > to support a theoretical case unless you have considerable certainty > that it really is a problem. "It would be faster" is a weak argument when > you don't know whether you're talking about saving 1% or 95%. The > added maintenance is just not worth it. > > There's a famous quote about that from Donald Knuth > (paraphrasing Hoare) "We should forget about small efficiencies, > say about 97% of the time: premature optimization is the root of > all evil." It's true. > > So the very *first* measurement I'd take is to get rid of the in-RAM > stuff and just write the index to local disk. I suspect you'll be *far* > better off doing this then just copying your index to the nfs mount. > > Best > Erick > > On Jan 10, 2008 10:05 AM, Ariel <[EMAIL PROTECTED]> wrote: > > > In a distributed enviroment the application should make an exhaustive > use > > of > > the network and there is not another way to access to the documents in a > > remote repository but accessing in nfs file system. > > One thing I must clarify: I index the documents in memory, I use > > RAMDirectory to do that, then when the RAMDirectory reach the limit(I > have > > put about 10 Mb) then I serialize to disk(nfs) the index to merge it > with > > the central index(the central index is in nfs file system), is that > > correct? > > I hope you can help me. > > I have take in consideration the suggestions you have make me before, I > > going to do some things to test it. > > Ariel > > > > > > On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote: > > > > > Thanks all you for yours answers, I going to change a few things in my > > > application and make tests. > > > One thing I haven't find another good pdfToText converter like pdfBox > Do > > > you know any other faster ? > > > Greetings > > > Thanks for yours answers > > > Ariel > > > > > > > > > On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Ariel, > > > > > > > > I believe PDFBox is not the fastest thing and was built more to > handle > > > > all possible PDFs than for speed (just my impression - Ben, PDFBox's > > author > > > > might still be on this list and might comment). Pulling data from > NFS > > to > > > > index seems like a bad idea. I hope at least the indices are local > > and not > > > > on a remote NFS... > > > > > > > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which > > one) > > > > and indexing overNFS was slooooooow. > > > > > > > > Otis > > > > > > > > -- > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > ----- Original Message ---- > > > > From: Ariel <[EMAIL PROTECTED]> > > > > To: java-user@lucene.apache.org > > > > Sent: Wednesday, January 9, 2008 2:50:41 PM > > > > Subject: Why is lucene so slow indexing in nfs file system ? > > > > > > > > Hi: > > > > I have seen the post in > > > > > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html > > > > and > > > > I am implementing a similar application in a distributed enviroment, > a > > > > cluster of nodes only 5 nodes. The operating system I use is > > > > Linux(Centos) > > > > so I am using nfs file system too to access the home directory where > > > > the > > > > documents to be indexed reside and I would like to know how much > time > > > > an > > > > application spends to index a big amount of documents like 10 Gb ? > > > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb > in > > > > every > > > > nodes, LAN: 1Gbits/s. > > > > > > > > The problem I have is that my application spends a lot of time to > > index > > > > all > > > > the documents, the delay to index 10 gb of pdf documents is about 2 > > > > days (to > > > > convert pdf to text I am using pdfbox) that is of course a lot of > > time, > > > > others applications based in lucene, for instance ibm omnifind only > > > > takes 5 > > > > hours to index the same amount of pdfs documents. I would like to > find > > > > out > > > > why my application has this big delay to index, any help is welcome. > > > > Dou you know others distributed architecture application that uses > > > > lucene to > > > > index big amounts of documents ? How long time it takes to index ? > > > > I hope yo can help me > > > > Greetings > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > >