This seems really clunky. Especially if your merge step also optimizes.

There's not much point in indexing into RAM then merging explicitly.
Just use an FSDirectory rather than a RAMDirectory. There is *already*
buffering built in to FSDirectory, and your merge factor etc. control
how much RAM is used before flushing to disk. There's considerable
discussion of this on the Wiki I believe, but in the mail archive for sure.
And I believe there's a RAM usage based flushing policy somewhere.

You're adding complexity where it's probably not necessary. Did you
adopt this scheme because you *thought* it would be faster or because
you were addressing a *known* problem? Don't *ever* write complex code
to support a theoretical case unless you have considerable certainty
that it really is a problem. "It would be faster" is a weak argument when
you don't know whether you're talking about saving 1% or 95%. The
added maintenance is just not worth it.

There's a famous quote about that from Donald Knuth
(paraphrasing Hoare) "We should forget about small efficiencies,
say about 97% of the time: premature optimization is the root of
all evil." It's true.

So the very *first* measurement I'd take is to get rid of the in-RAM
stuff and just write the index to local disk. I suspect you'll be *far*
better off doing this then just copying your index to the nfs mount.

Best
Erick

On Jan 10, 2008 10:05 AM, Ariel <[EMAIL PROTECTED]> wrote:

> In a distributed enviroment the application should make an exhaustive use
> of
> the network and there is not another way to access to the documents in a
> remote repository but accessing in nfs file system.
> One thing I must clarify: I index the documents in memory, I use
> RAMDirectory to do that, then when the RAMDirectory reach the limit(I have
> put about 10 Mb) then I serialize to disk(nfs) the index to merge it with
> the central index(the central index is in nfs file system), is that
> correct?
> I hope you can help me.
> I have take in consideration the suggestions you have make me before, I
> going to do some things to test it.
> Ariel
>
>
> On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:
>
> > Thanks all you for yours answers, I going to change a few things in my
> > application and make tests.
> > One thing I haven't find another good pdfToText converter like pdfBox Do
> > you know any other faster ?
> > Greetings
> > Thanks for yours answers
> > Ariel
> >
> >
> > On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Ariel,
> > >
> > > I believe PDFBox is not the fastest thing and was built more to handle
> > > all possible PDFs than for speed (just my impression - Ben, PDFBox's
> author
> > > might still be on this list and might comment).  Pulling data from NFS
> to
> > > index seems like a bad idea.  I hope at least the indices are local
> and not
> > > on a remote NFS...
> > >
> > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
> one)
> > > and indexing overNFS was slooooooow.
> > >
> > > Otis
> > >
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > > ----- Original Message ----
> > > From: Ariel <[EMAIL PROTECTED]>
> > > To: java-user@lucene.apache.org
> > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > Subject: Why is lucene so slow indexing in nfs file system ?
> > >
> > > Hi:
> > > I have seen the post in
> > >
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
> > >  and
> > > I am implementing a similar application in a distributed enviroment, a
> > > cluster of nodes only 5 nodes. The operating system I use is
> > >  Linux(Centos)
> > > so I am using nfs file system too to access the home directory where
> > >  the
> > > documents to be indexed reside and I would like to know how much time
> > >  an
> > > application spends to index a big amount of documents like 10 Gb ?
> > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
> > >  every
> > > nodes, LAN: 1Gbits/s.
> > >
> > > The problem I have is that my application spends a lot of time to
> index
> > >  all
> > > the documents, the delay to index 10 gb of pdf documents is about 2
> > >  days (to
> > > convert pdf to text I am using pdfbox) that is of course a lot of
> time,
> > > others applications based in lucene, for instance ibm omnifind only
> > >  takes 5
> > > hours to index the same amount of pdfs documents. I would like to find
> > >  out
> > > why my application has this big delay to index, any help is welcome.
> > > Dou you know others distributed architecture application that uses
> > >  lucene to
> > > index big amounts of documents ? How long time it takes to index ?
> > > I hope yo can help me
> > > Greetings
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
>

Reply via email to