In a distributed enviroment the application should make an exhaustive use of
the network and there is not another way to access to the documents in a
remote repository but accessing in nfs file system.
One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit(I have
put about 10 Mb) then I serialize to disk(nfs) the index to merge it with
the central index(the central index is in nfs file system), is that correct?
I hope you can help me.
I have take in consideration the suggestions you have make me before, I
going to do some things to test it.
Ariel


On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:

> Thanks all you for yours answers, I going to change a few things in my
> application and make tests.
> One thing I haven't find another good pdfToText converter like pdfBox Do
> you know any other faster ?
> Greetings
> Thanks for yours answers
> Ariel
>
>
> On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
> wrote:
>
> > Ariel,
> >
> > I believe PDFBox is not the fastest thing and was built more to handle
> > all possible PDFs than for speed (just my impression - Ben, PDFBox's author
> > might still be on this list and might comment).  Pulling data from NFS to
> > index seems like a bad idea.  I hope at least the indices are local and not
> > on a remote NFS...
> >
> > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one)
> > and indexing overNFS was slooooooow.
> >
> > Otis
> >
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > ----- Original Message ----
> > From: Ariel <[EMAIL PROTECTED]>
> > To: java-user@lucene.apache.org
> > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > Subject: Why is lucene so slow indexing in nfs file system ?
> >
> > Hi:
> > I have seen the post in
> > http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
> >  and
> > I am implementing a similar application in a distributed enviroment, a
> > cluster of nodes only 5 nodes. The operating system I use is
> >  Linux(Centos)
> > so I am using nfs file system too to access the home directory where
> >  the
> > documents to be indexed reside and I would like to know how much time
> >  an
> > application spends to index a big amount of documents like 10 Gb ?
> > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
> >  every
> > nodes, LAN: 1Gbits/s.
> >
> > The problem I have is that my application spends a lot of time to index
> >  all
> > the documents, the delay to index 10 gb of pdf documents is about 2
> >  days (to
> > convert pdf to text I am using pdfbox) that is of course a lot of time,
> > others applications based in lucene, for instance ibm omnifind only
> >  takes 5
> > hours to index the same amount of pdfs documents. I would like to find
> >  out
> > why my application has this big delay to index, any help is welcome.
> > Dou you know others distributed architecture application that uses
> >  lucene to
> > index big amounts of documents ? How long time it takes to index ?
> > I hope yo can help me
> > Greetings
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

Reply via email to