Thanks for yours suggestions.

I'm sorry I didn't know but I would want to know what Do you mean with "SAN"
and "FC"?

Another thing, I have visited the lucene home page and there is not released
the 2.3 version, could you tell me where is the download link ?

Thanks in advance.
Ariel

On Jan 10, 2008 2:59 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Ariel,
>
> Comments inline.
>
>
> ----- Original Message ----
> From: Ariel <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Thursday, January 10, 2008 10:05:28 AM
> Subject: Re: Why is lucene so slow indexing in nfs file system ?
>
> In a distributed enviroment the application should make an exhaustive
>  use of
> the network and there is not another way to access to the documents in
>  a
> remote repository but accessing in nfs file system.
>
> OG: What about SAN connected over FC for example?
>
> One thing I must clarify: I index the documents in memory, I use
> RAMDirectory to do that, then when the RAMDirectory reach the limit(I
>  have
> put about 10 Mb) then I serialize to disk(nfs) the index to merge it
>  with
> the central index(the central index is in nfs file system), is that
>  correct?
>
> OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will
> do in-memory thing for you.  Make good use of your RAM and use 2.3 which
> gives you more control over RAM use during indexing.  Parallelizing indexing
> over multiple machines and merging at the end is faster, so that's a good
> approach.  Also, if your boxes have multiple CPUs write your code so that it
> has multiple worker threads that do indexing and feed docs to
> IndexWriter.addDocument(Document) to keep the CPUs fully utilized.
>
> OG: Oh, something faster than PDFBox?  There is (can't remember the name
> now... itextstream or something like that?), though it may not be free like
> PDFBox.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:
>
> > Thanks all you for yours answers, I going to change a few things in
>  my
> > application and make tests.
> > One thing I haven't find another good pdfToText converter like pdfBox
>  Do
> > you know any other faster ?
> > Greetings
> > Thanks for yours answers
> > Ariel
> >
> >
> > On Jan 9, 2008 11:08 PM, Otis Gospodnetic
>  <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Ariel,
> > >
> > > I believe PDFBox is not the fastest thing and was built more to
>  handle
> > > all possible PDFs than for speed (just my impression - Ben,
>  PDFBox's author
> > > might still be on this list and might comment).  Pulling data from
>  NFS to
> > > index seems like a bad idea.  I hope at least the indices are local
>  and not
> > > on a remote NFS...
> > >
> > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
>  one)
> > > and indexing overNFS was slooooooow.
> > >
> > > Otis
> > >
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > > ----- Original Message ----
> > > From: Ariel <[EMAIL PROTECTED]>
> > > To: java-user@lucene.apache.org
> > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > Subject: Why is lucene so slow indexing in nfs file system ?
> > >
> > > Hi:
> > > I have seen the post in
> > >
>  http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
> > >  and
> > > I am implementing a similar application in a distributed
>  enviroment, a
> > > cluster of nodes only 5 nodes. The operating system I use is
> > >  Linux(Centos)
> > > so I am using nfs file system too to access the home directory
>  where
> > >  the
> > > documents to be indexed reside and I would like to know how much
>  time
> > >  an
> > > application spends to index a big amount of documents like 10 Gb ?
> > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
>  in
> > >  every
> > > nodes, LAN: 1Gbits/s.
> > >
> > > The problem I have is that my application spends a lot of time to
>  index
> > >  all
> > > the documents, the delay to index 10 gb of pdf documents is about 2
> > >  days (to
> > > convert pdf to text I am using pdfbox) that is of course a lot of
>  time,
> > > others applications based in lucene, for instance ibm omnifind only
> > >  takes 5
> > > hours to index the same amount of pdfs documents. I would like to
>  find
> > >  out
> > > why my application has this big delay to index, any help is
>  welcome.
> > > Dou you know others distributed architecture application that uses
> > >  lucene to
> > > index big amounts of documents ? How long time it takes to index ?
> > > I hope yo can help me
> > > Greetings
> > >
> > >
> > >
> > >
> > >
>  ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Reply via email to