Re: Why is lucene so slow indexing in nfs file system ?

Chris Lu Thu, 10 Jan 2008 20:33:06 -0800

SAN is Storage Area Network. FC is fiber channel.

I can confirm by one customer experience that using SAN does scale
pretty well, and pretty simple. Well, it costs some money.


-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request)
got 2.6 Million Euro funding!


On Jan 10, 2008 3:26 PM, Ariel <[EMAIL PROTECTED]> wrote:
> Thanks for yours suggestions.
>
> I'm sorry I didn't know but I would want to know what Do you mean with "SAN"
> and "FC"?
>
> Another thing, I have visited the lucene home page and there is not released
> the 2.3 version, could you tell me where is the download link ?
>
> Thanks in advance.
> Ariel
>
> On Jan 10, 2008 2:59 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
>
> wrote:
>
> > Ariel,
> >
> > Comments inline.
> >
> >
> > ----- Original Message ----
> > From: Ariel <[EMAIL PROTECTED]>
> > To: [email protected]
> > Sent: Thursday, January 10, 2008 10:05:28 AM
> > Subject: Re: Why is lucene so slow indexing in nfs file system ?
> >
> > In a distributed enviroment the application should make an exhaustive
> >  use of
> > the network and there is not another way to access to the documents in
> >  a
> > remote repository but accessing in nfs file system.
> >
> > OG: What about SAN connected over FC for example?
> >
> > One thing I must clarify: I index the documents in memory, I use
> > RAMDirectory to do that, then when the RAMDirectory reach the limit(I
> >  have
> > put about 10 Mb) then I serialize to disk(nfs) the index to merge it
> >  with
> > the central index(the central index is in nfs file system), is that
> >  correct?
> >
> > OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will
> > do in-memory thing for you.  Make good use of your RAM and use 2.3 which
> > gives you more control over RAM use during indexing.  Parallelizing indexing
> > over multiple machines and merging at the end is faster, so that's a good
> > approach.  Also, if your boxes have multiple CPUs write your code so that it
> > has multiple worker threads that do indexing and feed docs to
> > IndexWriter.addDocument(Document) to keep the CPUs fully utilized.
> >
> > OG: Oh, something faster than PDFBox?  There is (can't remember the name
> > now... itextstream or something like that?), though it may not be free like
> > PDFBox.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> > On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:
> >
> > > Thanks all you for yours answers, I going to change a few things in
> >  my
> > > application and make tests.
> > > One thing I haven't find another good pdfToText converter like pdfBox
> >  Do
> > > you know any other faster ?
> > > Greetings
> > > Thanks for yours answers
> > > Ariel
> > >
> > >
> > > On Jan 9, 2008 11:08 PM, Otis Gospodnetic
> >  <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Ariel,
> > > >
> > > > I believe PDFBox is not the fastest thing and was built more to
> >  handle
> > > > all possible PDFs than for speed (just my impression - Ben,
> >  PDFBox's author
> > > > might still be on this list and might comment).  Pulling data from
> >  NFS to
> > > > index seems like a bad idea.  I hope at least the indices are local
> >  and not
> > > > on a remote NFS...
> > > >
> > > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
> >  one)
> > > > and indexing overNFS was slooooooow.
> > > >
> > > > Otis
> > > >
> > > > --
> > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > >
> > > > ----- Original Message ----
> > > > From: Ariel <[EMAIL PROTECTED]>
> > > > To: [email protected]
> > > > Sent: Wednesday, January 9, 2008 2:50:41 PM
> > > > Subject: Why is lucene so slow indexing in nfs file system ?
> > > >
> > > > Hi:
> > > > I have seen the post in
> > > >
> >  http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
> > > >  and
> > > > I am implementing a similar application in a distributed
> >  enviroment, a
> > > > cluster of nodes only 5 nodes. The operating system I use is
> > > >  Linux(Centos)
> > > > so I am using nfs file system too to access the home directory
> >  where
> > > >  the
> > > > documents to be indexed reside and I would like to know how much
> >  time
> > > >  an
> > > > application spends to index a big amount of documents like 10 Gb ?
> > > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
> >  in
> > > >  every
> > > > nodes, LAN: 1Gbits/s.
> > > >
> > > > The problem I have is that my application spends a lot of time to
> >  index
> > > >  all
> > > > the documents, the delay to index 10 gb of pdf documents is about 2
> > > >  days (to
> > > > convert pdf to text I am using pdfbox) that is of course a lot of
> >  time,
> > > > others applications based in lucene, for instance ibm omnifind only
> > > >  takes 5
> > > > hours to index the same amount of pdfs documents. I would like to
> >  find
> > > >  out
> > > > why my application has this big delay to index, any help is
> >  welcome.
> > > > Dou you know others distributed architecture application that uses
> > > >  lucene to
> > > > index big amounts of documents ? How long time it takes to index ?
> > > > I hope yo can help me
> > > > Greetings
> > > >
> > > >
> > > >
> > > >
> > > >
> >  ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > > >
> > >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Why is lucene so slow indexing in nfs file system ?

Reply via email to