Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
Thanks all you for yours answers, I going to change a few things in my
application and make tests.
One thing I haven't find another good pdfToText converter like pdfBox Do you
know any other faster ?
Greetings
Thanks for yours answers
Ariel

On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Ariel,

 I believe PDFBox is not the fastest thing and was built more to handle all
 possible PDFs than for speed (just my impression - Ben, PDFBox's author
 might still be on this list and might comment).  Pulling data from NFS to
 index seems like a bad idea.  I hope at least the indices are local and not
 on a remote NFS...

 We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one)
 and indexing overNFS was slooow.

 Otis

 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Ariel [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Wednesday, January 9, 2008 2:50:41 PM
 Subject: Why is lucene so slow indexing in nfs file system ?

 Hi:
 I have seen the post in
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
  and
 I am implementing a similar application in a distributed enviroment, a
 cluster of nodes only 5 nodes. The operating system I use is
  Linux(Centos)
 so I am using nfs file system too to access the home directory where
  the
 documents to be indexed reside and I would like to know how much time
  an
 application spends to index a big amount of documents like 10 Gb ?
 I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
  every
 nodes, LAN: 1Gbits/s.

 The problem I have is that my application spends a lot of time to index
  all
 the documents, the delay to index 10 gb of pdf documents is about 2
  days (to
 convert pdf to text I am using pdfbox) that is of course a lot of time,
 others applications based in lucene, for instance ibm omnifind only
  takes 5
 hours to index the same amount of pdfs documents. I would like to find
  out
 why my application has this big delay to index, any help is welcome.
 Dou you know others distributed architecture application that uses
  lucene to
 index big amounts of documents ? How long time it takes to index ?
 I hope yo can help me
 Greetings




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
In a distributed enviroment the application should make an exhaustive use of
the network and there is not another way to access to the documents in a
remote repository but accessing in nfs file system.
One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit(I have
put about 10 Mb) then I serialize to disk(nfs) the index to merge it with
the central index(the central index is in nfs file system), is that correct?
I hope you can help me.
I have take in consideration the suggestions you have make me before, I
going to do some things to test it.
Ariel


On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote:

 Thanks all you for yours answers, I going to change a few things in my
 application and make tests.
 One thing I haven't find another good pdfToText converter like pdfBox Do
 you know any other faster ?
 Greetings
 Thanks for yours answers
 Ariel


 On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED]
 wrote:

  Ariel,
 
  I believe PDFBox is not the fastest thing and was built more to handle
  all possible PDFs than for speed (just my impression - Ben, PDFBox's author
  might still be on this list and might comment).  Pulling data from NFS to
  index seems like a bad idea.  I hope at least the indices are local and not
  on a remote NFS...
 
  We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one)
  and indexing overNFS was slooow.
 
  Otis
 
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
  - Original Message 
  From: Ariel [EMAIL PROTECTED]
  To: java-user@lucene.apache.org
  Sent: Wednesday, January 9, 2008 2:50:41 PM
  Subject: Why is lucene so slow indexing in nfs file system ?
 
  Hi:
  I have seen the post in
  http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
   and
  I am implementing a similar application in a distributed enviroment, a
  cluster of nodes only 5 nodes. The operating system I use is
   Linux(Centos)
  so I am using nfs file system too to access the home directory where
   the
  documents to be indexed reside and I would like to know how much time
   an
  application spends to index a big amount of documents like 10 Gb ?
  I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
   every
  nodes, LAN: 1Gbits/s.
 
  The problem I have is that my application spends a lot of time to index
   all
  the documents, the delay to index 10 gb of pdf documents is about 2
   days (to
  convert pdf to text I am using pdfbox) that is of course a lot of time,
  others applications based in lucene, for instance ibm omnifind only
   takes 5
  hours to index the same amount of pdfs documents. I would like to find
   out
  why my application has this big delay to index, any help is welcome.
  Dou you know others distributed architecture application that uses
   lucene to
  index big amounts of documents ? How long time it takes to index ?
  I hope yo can help me
  Greetings
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Erick Erickson
This seems really clunky. Especially if your merge step also optimizes.

There's not much point in indexing into RAM then merging explicitly.
Just use an FSDirectory rather than a RAMDirectory. There is *already*
buffering built in to FSDirectory, and your merge factor etc. control
how much RAM is used before flushing to disk. There's considerable
discussion of this on the Wiki I believe, but in the mail archive for sure.
And I believe there's a RAM usage based flushing policy somewhere.

You're adding complexity where it's probably not necessary. Did you
adopt this scheme because you *thought* it would be faster or because
you were addressing a *known* problem? Don't *ever* write complex code
to support a theoretical case unless you have considerable certainty
that it really is a problem. It would be faster is a weak argument when
you don't know whether you're talking about saving 1% or 95%. The
added maintenance is just not worth it.

There's a famous quote about that from Donald Knuth
(paraphrasing Hoare) We should forget about small efficiencies,
say about 97% of the time: premature optimization is the root of
all evil. It's true.

So the very *first* measurement I'd take is to get rid of the in-RAM
stuff and just write the index to local disk. I suspect you'll be *far*
better off doing this then just copying your index to the nfs mount.

Best
Erick

On Jan 10, 2008 10:05 AM, Ariel [EMAIL PROTECTED] wrote:

 In a distributed enviroment the application should make an exhaustive use
 of
 the network and there is not another way to access to the documents in a
 remote repository but accessing in nfs file system.
 One thing I must clarify: I index the documents in memory, I use
 RAMDirectory to do that, then when the RAMDirectory reach the limit(I have
 put about 10 Mb) then I serialize to disk(nfs) the index to merge it with
 the central index(the central index is in nfs file system), is that
 correct?
 I hope you can help me.
 I have take in consideration the suggestions you have make me before, I
 going to do some things to test it.
 Ariel


 On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote:

  Thanks all you for yours answers, I going to change a few things in my
  application and make tests.
  One thing I haven't find another good pdfToText converter like pdfBox Do
  you know any other faster ?
  Greetings
  Thanks for yours answers
  Ariel
 
 
  On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED]
  wrote:
 
   Ariel,
  
   I believe PDFBox is not the fastest thing and was built more to handle
   all possible PDFs than for speed (just my impression - Ben, PDFBox's
 author
   might still be on this list and might comment).  Pulling data from NFS
 to
   index seems like a bad idea.  I hope at least the indices are local
 and not
   on a remote NFS...
  
   We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
 one)
   and indexing overNFS was slooow.
  
   Otis
  
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
   - Original Message 
   From: Ariel [EMAIL PROTECTED]
   To: java-user@lucene.apache.org
   Sent: Wednesday, January 9, 2008 2:50:41 PM
   Subject: Why is lucene so slow indexing in nfs file system ?
  
   Hi:
   I have seen the post in
  
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
and
   I am implementing a similar application in a distributed enviroment, a
   cluster of nodes only 5 nodes. The operating system I use is
Linux(Centos)
   so I am using nfs file system too to access the home directory where
the
   documents to be indexed reside and I would like to know how much time
an
   application spends to index a big amount of documents like 10 Gb ?
   I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
every
   nodes, LAN: 1Gbits/s.
  
   The problem I have is that my application spends a lot of time to
 index
all
   the documents, the delay to index 10 gb of pdf documents is about 2
days (to
   convert pdf to text I am using pdfbox) that is of course a lot of
 time,
   others applications based in lucene, for instance ibm omnifind only
takes 5
   hours to index the same amount of pdfs documents. I would like to find
out
   why my application has this big delay to index, any help is welcome.
   Dou you know others distributed architecture application that uses
lucene to
   index big amounts of documents ? How long time it takes to index ?
   I hope yo can help me
   Greetings
  
  
  
  
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Michael McCandless


If possible you should also test the soon-to-be-released version 2.3,  
which has a number of speedups to indexing.


Also try the steps here:

  http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

You should also try an A/B test: A) writing your index to the NFS  
directory and then B) to a local IO system, to see how much NFS is  
really slowing you down.


Mike

Erick Erickson wrote:

This seems really clunky. Especially if your merge step also  
optimizes.


There's not much point in indexing into RAM then merging explicitly.
Just use an FSDirectory rather than a RAMDirectory. There is *already*
buffering built in to FSDirectory, and your merge factor etc. control
how much RAM is used before flushing to disk. There's considerable
discussion of this on the Wiki I believe, but in the mail archive  
for sure.

And I believe there's a RAM usage based flushing policy somewhere.

You're adding complexity where it's probably not necessary. Did you
adopt this scheme because you *thought* it would be faster or because
you were addressing a *known* problem? Don't *ever* write complex code
to support a theoretical case unless you have considerable certainty
that it really is a problem. It would be faster is a weak  
argument when

you don't know whether you're talking about saving 1% or 95%. The
added maintenance is just not worth it.

There's a famous quote about that from Donald Knuth
(paraphrasing Hoare) We should forget about small efficiencies,
say about 97% of the time: premature optimization is the root of
all evil. It's true.

So the very *first* measurement I'd take is to get rid of the in-RAM
stuff and just write the index to local disk. I suspect you'll be  
*far*

better off doing this then just copying your index to the nfs mount.

Best
Erick

On Jan 10, 2008 10:05 AM, Ariel [EMAIL PROTECTED] wrote:

In a distributed enviroment the application should make an  
exhaustive use

of
the network and there is not another way to access to the  
documents in a

remote repository but accessing in nfs file system.
One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit 
(I have
put about 10 Mb) then I serialize to disk(nfs) the index to merge  
it with

the central index(the central index is in nfs file system), is that
correct?
I hope you can help me.
I have take in consideration the suggestions you have make me  
before, I

going to do some things to test it.
Ariel


On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote:

Thanks all you for yours answers, I going to change a few things  
in my

application and make tests.
One thing I haven't find another good pdfToText converter like  
pdfBox Do

you know any other faster ?
Greetings
Thanks for yours answers
Ariel


On Jan 9, 2008 11:08 PM, Otis Gospodnetic  
[EMAIL PROTECTED]

wrote:


Ariel,

I believe PDFBox is not the fastest thing and was built more to  
handle
all possible PDFs than for speed (just my impression - Ben,  
PDFBox's

author
might still be on this list and might comment).  Pulling data  
from NFS

to

index seems like a bad idea.  I hope at least the indices are local

and not

on a remote NFS...

We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which

one)

and indexing overNFS was slooow.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Ariel [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Wednesday, January 9, 2008 2:50:41 PM
Subject: Why is lucene so slow indexing in nfs file system ?

Hi:
I have seen the post in

http://www.mail-archive.com/[EMAIL PROTECTED]/ 
msg12700.html

 and
I am implementing a similar application in a distributed  
enviroment, a

cluster of nodes only 5 nodes. The operating system I use is
 Linux(Centos)
so I am using nfs file system too to access the home directory  
where

 the
documents to be indexed reside and I would like to know how much  
time

 an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512  
Mb in

 every
nodes, LAN: 1Gbits/s.

The problem I have is that my application spends a lot of time to

index

 all
the documents, the delay to index 10 gb of pdf documents is about 2
 days (to
convert pdf to text I am using pdfbox) that is of course a lot of

time,

others applications based in lucene, for instance ibm omnifind only
 takes 5
hours to index the same amount of pdfs documents. I would like  
to find

 out
why my application has this big delay to index, any help is  
welcome.

Dou you know others distributed architecture application that uses
 lucene to
index big amounts of documents ? How long time it takes to index ?
I hope yo can help me
Greetings




--- 
--

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]










Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
I am indexing into RAM then merging explicitly because my application demand
it due to I have design it as a distributed enviroment so many threads or
workers are in different machines indexing into RAM serialize to disk an
another thread in another machine access the segment index to merge it with
the principal one, that is faster than if I had just one thread indexing the
documents, doesn' it ?
Yours suggestions are very useful.
I hope you can help me.
Greetings
Ariel

On Jan 10, 2008 10:21 AM, Erick Erickson [EMAIL PROTECTED] wrote:

 This seems really clunky. Especially if your merge step also optimizes.

 There's not much point in indexing into RAM then merging explicitly.
 Just use an FSDirectory rather than a RAMDirectory. There is *already*
 buffering built in to FSDirectory, and your merge factor etc. control
 how much RAM is used before flushing to disk. There's considerable
 discussion of this on the Wiki I believe, but in the mail archive for
 sure.
 And I believe there's a RAM usage based flushing policy somewhere.

 You're adding complexity where it's probably not necessary. Did you
 adopt this scheme because you *thought* it would be faster or because
 you were addressing a *known* problem? Don't *ever* write complex code
 to support a theoretical case unless you have considerable certainty
 that it really is a problem. It would be faster is a weak argument when
 you don't know whether you're talking about saving 1% or 95%. The
 added maintenance is just not worth it.

 There's a famous quote about that from Donald Knuth
 (paraphrasing Hoare) We should forget about small efficiencies,
 say about 97% of the time: premature optimization is the root of
 all evil. It's true.

 So the very *first* measurement I'd take is to get rid of the in-RAM
 stuff and just write the index to local disk. I suspect you'll be *far*
 better off doing this then just copying your index to the nfs mount.

 Best
 Erick

 On Jan 10, 2008 10:05 AM, Ariel [EMAIL PROTECTED] wrote:

  In a distributed enviroment the application should make an exhaustive
 use
  of
  the network and there is not another way to access to the documents in a
  remote repository but accessing in nfs file system.
  One thing I must clarify: I index the documents in memory, I use
  RAMDirectory to do that, then when the RAMDirectory reach the limit(I
 have
  put about 10 Mb) then I serialize to disk(nfs) the index to merge it
 with
  the central index(the central index is in nfs file system), is that
  correct?
  I hope you can help me.
  I have take in consideration the suggestions you have make me before, I
  going to do some things to test it.
  Ariel
 
 
  On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote:
 
   Thanks all you for yours answers, I going to change a few things in my
   application and make tests.
   One thing I haven't find another good pdfToText converter like pdfBox
 Do
   you know any other faster ?
   Greetings
   Thanks for yours answers
   Ariel
  
  
   On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED]
   wrote:
  
Ariel,
   
I believe PDFBox is not the fastest thing and was built more to
 handle
all possible PDFs than for speed (just my impression - Ben, PDFBox's
  author
might still be on this list and might comment).  Pulling data from
 NFS
  to
index seems like a bad idea.  I hope at least the indices are local
  and not
on a remote NFS...
   
We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
  one)
and indexing overNFS was slooow.
   
Otis
   
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
   
- Original Message 
From: Ariel [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Wednesday, January 9, 2008 2:50:41 PM
Subject: Why is lucene so slow indexing in nfs file system ?
   
Hi:
I have seen the post in
   
  http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
 and
I am implementing a similar application in a distributed enviroment,
 a
cluster of nodes only 5 nodes. The operating system I use is
 Linux(Centos)
so I am using nfs file system too to access the home directory where
 the
documents to be indexed reside and I would like to know how much
 time
 an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
 in
 every
nodes, LAN: 1Gbits/s.
   
The problem I have is that my application spends a lot of time to
  index
 all
the documents, the delay to index 10 gb of pdf documents is about 2
 days (to
convert pdf to text I am using pdfbox) that is of course a lot of
  time,
others applications based in lucene, for instance ibm omnifind only
 takes 5
hours to index the same amount of pdfs documents. I would like to
 find
 out
why my application has this big delay to index, any help is welcome.
Dou you know 

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Otis Gospodnetic
Ariel,
 
Comments inline.


- Original Message 
From: Ariel [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thursday, January 10, 2008 10:05:28 AM
Subject: Re: Why is lucene so slow indexing in nfs file system ?

In a distributed enviroment the application should make an exhaustive
 use of
the network and there is not another way to access to the documents in
 a
remote repository but accessing in nfs file system.

OG: What about SAN connected over FC for example?

One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit(I
 have
put about 10 Mb) then I serialize to disk(nfs) the index to merge it
 with
the central index(the central index is in nfs file system), is that
 correct?

OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will do 
in-memory thing for you.  Make good use of your RAM and use 2.3 which gives you 
more control over RAM use during indexing.  Parallelizing indexing over 
multiple machines and merging at the end is faster, so that's a good approach.  
Also, if your boxes have multiple CPUs write your code so that it has multiple 
worker threads that do indexing and feed docs to 
IndexWriter.addDocument(Document) to keep the CPUs fully utilized.

OG: Oh, something faster than PDFBox?  There is (can't remember the name now... 
itextstream or something like that?), though it may not be free like PDFBox.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote:

 Thanks all you for yours answers, I going to change a few things in
 my
 application and make tests.
 One thing I haven't find another good pdfToText converter like pdfBox
 Do
 you know any other faster ?
 Greetings
 Thanks for yours answers
 Ariel


 On Jan 9, 2008 11:08 PM, Otis Gospodnetic
 [EMAIL PROTECTED]
 wrote:

  Ariel,
 
  I believe PDFBox is not the fastest thing and was built more to
 handle
  all possible PDFs than for speed (just my impression - Ben,
 PDFBox's author
  might still be on this list and might comment).  Pulling data from
 NFS to
  index seems like a bad idea.  I hope at least the indices are local
 and not
  on a remote NFS...
 
  We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
 one)
  and indexing overNFS was slooow.
 
  Otis
 
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
  - Original Message 
  From: Ariel [EMAIL PROTECTED]
  To: java-user@lucene.apache.org
  Sent: Wednesday, January 9, 2008 2:50:41 PM
  Subject: Why is lucene so slow indexing in nfs file system ?
 
  Hi:
  I have seen the post in
 
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
   and
  I am implementing a similar application in a distributed
 enviroment, a
  cluster of nodes only 5 nodes. The operating system I use is
   Linux(Centos)
  so I am using nfs file system too to access the home directory
 where
   the
  documents to be indexed reside and I would like to know how much
 time
   an
  application spends to index a big amount of documents like 10 Gb ?
  I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
 in
   every
  nodes, LAN: 1Gbits/s.
 
  The problem I have is that my application spends a lot of time to
 index
   all
  the documents, the delay to index 10 gb of pdf documents is about 2
   days (to
  convert pdf to text I am using pdfbox) that is of course a lot of
 time,
  others applications based in lucene, for instance ibm omnifind only
   takes 5
  hours to index the same amount of pdfs documents. I would like to
 find
   out
  why my application has this big delay to index, any help is
 welcome.
  Dou you know others distributed architecture application that uses
   lucene to
  index big amounts of documents ? How long time it takes to index ?
  I hope yo can help me
  Greetings
 
 
 
 
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Ariel
Thanks for yours suggestions.

I'm sorry I didn't know but I would want to know what Do you mean with SAN
and FC?

Another thing, I have visited the lucene home page and there is not released
the 2.3 version, could you tell me where is the download link ?

Thanks in advance.
Ariel

On Jan 10, 2008 2:59 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Ariel,

 Comments inline.


 - Original Message 
 From: Ariel [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Thursday, January 10, 2008 10:05:28 AM
 Subject: Re: Why is lucene so slow indexing in nfs file system ?

 In a distributed enviroment the application should make an exhaustive
  use of
 the network and there is not another way to access to the documents in
  a
 remote repository but accessing in nfs file system.

 OG: What about SAN connected over FC for example?

 One thing I must clarify: I index the documents in memory, I use
 RAMDirectory to do that, then when the RAMDirectory reach the limit(I
  have
 put about 10 Mb) then I serialize to disk(nfs) the index to merge it
  with
 the central index(the central index is in nfs file system), is that
  correct?

 OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will
 do in-memory thing for you.  Make good use of your RAM and use 2.3 which
 gives you more control over RAM use during indexing.  Parallelizing indexing
 over multiple machines and merging at the end is faster, so that's a good
 approach.  Also, if your boxes have multiple CPUs write your code so that it
 has multiple worker threads that do indexing and feed docs to
 IndexWriter.addDocument(Document) to keep the CPUs fully utilized.

 OG: Oh, something faster than PDFBox?  There is (can't remember the name
 now... itextstream or something like that?), though it may not be free like
 PDFBox.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


 On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote:

  Thanks all you for yours answers, I going to change a few things in
  my
  application and make tests.
  One thing I haven't find another good pdfToText converter like pdfBox
  Do
  you know any other faster ?
  Greetings
  Thanks for yours answers
  Ariel
 
 
  On Jan 9, 2008 11:08 PM, Otis Gospodnetic
  [EMAIL PROTECTED]
  wrote:
 
   Ariel,
  
   I believe PDFBox is not the fastest thing and was built more to
  handle
   all possible PDFs than for speed (just my impression - Ben,
  PDFBox's author
   might still be on this list and might comment).  Pulling data from
  NFS to
   index seems like a bad idea.  I hope at least the indices are local
  and not
   on a remote NFS...
  
   We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
  one)
   and indexing overNFS was slooow.
  
   Otis
  
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
   - Original Message 
   From: Ariel [EMAIL PROTECTED]
   To: java-user@lucene.apache.org
   Sent: Wednesday, January 9, 2008 2:50:41 PM
   Subject: Why is lucene so slow indexing in nfs file system ?
  
   Hi:
   I have seen the post in
  
  http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
and
   I am implementing a similar application in a distributed
  enviroment, a
   cluster of nodes only 5 nodes. The operating system I use is
Linux(Centos)
   so I am using nfs file system too to access the home directory
  where
the
   documents to be indexed reside and I would like to know how much
  time
an
   application spends to index a big amount of documents like 10 Gb ?
   I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
  in
every
   nodes, LAN: 1Gbits/s.
  
   The problem I have is that my application spends a lot of time to
  index
all
   the documents, the delay to index 10 gb of pdf documents is about 2
days (to
   convert pdf to text I am using pdfbox) that is of course a lot of
  time,
   others applications based in lucene, for instance ibm omnifind only
takes 5
   hours to index the same amount of pdfs documents. I would like to
  find
out
   why my application has this big delay to index, any help is
  welcome.
   Dou you know others distributed architecture application that uses
lucene to
   index big amounts of documents ? How long time it takes to index ?
   I hope yo can help me
   Greetings
  
  
  
  
  
  -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 




 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Chris Lu
SAN is Storage Area Network. FC is fiber channel.

I can confirm by one customer experience that using SAN does scale
pretty well, and pretty simple. Well, it costs some money.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request)
got 2.6 Million Euro funding!


On Jan 10, 2008 3:26 PM, Ariel [EMAIL PROTECTED] wrote:
 Thanks for yours suggestions.

 I'm sorry I didn't know but I would want to know what Do you mean with SAN
 and FC?

 Another thing, I have visited the lucene home page and there is not released
 the 2.3 version, could you tell me where is the download link ?

 Thanks in advance.
 Ariel

 On Jan 10, 2008 2:59 PM, Otis Gospodnetic [EMAIL PROTECTED]

 wrote:

  Ariel,
 
  Comments inline.
 
 
  - Original Message 
  From: Ariel [EMAIL PROTECTED]
  To: java-user@lucene.apache.org
  Sent: Thursday, January 10, 2008 10:05:28 AM
  Subject: Re: Why is lucene so slow indexing in nfs file system ?
 
  In a distributed enviroment the application should make an exhaustive
   use of
  the network and there is not another way to access to the documents in
   a
  remote repository but accessing in nfs file system.
 
  OG: What about SAN connected over FC for example?
 
  One thing I must clarify: I index the documents in memory, I use
  RAMDirectory to do that, then when the RAMDirectory reach the limit(I
   have
  put about 10 Mb) then I serialize to disk(nfs) the index to merge it
   with
  the central index(the central index is in nfs file system), is that
   correct?
 
  OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will
  do in-memory thing for you.  Make good use of your RAM and use 2.3 which
  gives you more control over RAM use during indexing.  Parallelizing indexing
  over multiple machines and merging at the end is faster, so that's a good
  approach.  Also, if your boxes have multiple CPUs write your code so that it
  has multiple worker threads that do indexing and feed docs to
  IndexWriter.addDocument(Document) to keep the CPUs fully utilized.
 
  OG: Oh, something faster than PDFBox?  There is (can't remember the name
  now... itextstream or something like that?), though it may not be free like
  PDFBox.
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
  On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote:
 
   Thanks all you for yours answers, I going to change a few things in
   my
   application and make tests.
   One thing I haven't find another good pdfToText converter like pdfBox
   Do
   you know any other faster ?
   Greetings
   Thanks for yours answers
   Ariel
  
  
   On Jan 9, 2008 11:08 PM, Otis Gospodnetic
   [EMAIL PROTECTED]
   wrote:
  
Ariel,
   
I believe PDFBox is not the fastest thing and was built more to
   handle
all possible PDFs than for speed (just my impression - Ben,
   PDFBox's author
might still be on this list and might comment).  Pulling data from
   NFS to
index seems like a bad idea.  I hope at least the indices are local
   and not
on a remote NFS...
   
We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
   one)
and indexing overNFS was slooow.
   
Otis
   
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
   
- Original Message 
From: Ariel [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Wednesday, January 9, 2008 2:50:41 PM
Subject: Why is lucene so slow indexing in nfs file system ?
   
Hi:
I have seen the post in
   
   http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
 and
I am implementing a similar application in a distributed
   enviroment, a
cluster of nodes only 5 nodes. The operating system I use is
 Linux(Centos)
so I am using nfs file system too to access the home directory
   where
 the
documents to be indexed reside and I would like to know how much
   time
 an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb
   in
 every
nodes, LAN: 1Gbits/s.
   
The problem I have is that my application spends a lot of time to
   index
 all
the documents, the delay to index 10 gb of pdf documents is about 2
 days (to
convert pdf to text I am using pdfbox) that is of course a lot of
   time,
others applications based in lucene, for instance ibm omnifind only
 takes 5
hours to index the same amount of pdfs documents. I would like to
   find
 out
why my application has this big delay to index, any help is
   welcome.
Dou you know others distributed architecture application that uses
 lucene

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-10 Thread Otis Gospodnetic
2.3 is in the process of being released.  Give it another week to 10 days and 
it will be out.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Ariel [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thursday, January 10, 2008 6:26:44 PM
Subject: Re: Why is lucene so slow indexing in nfs file system ?

Thanks for yours suggestions.

I'm sorry I didn't know but I would want to know what Do you mean with
 SAN
and FC?

Another thing, I have visited the lucene home page and there is not
 released
the 2.3 version, could you tell me where is the download link ?

Thanks in advance.
Ariel

On Jan 10, 2008 2:59 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Ariel,

 Comments inline.


 - Original Message 
 From: Ariel [EMAIL PROTECTED]
 To: java-user@lucene.apache.org
 Sent: Thursday, January 10, 2008 10:05:28 AM
 Subject: Re: Why is lucene so slow indexing in nfs file system ?

 In a distributed enviroment the application should make an exhaustive
  use of
 the network and there is not another way to access to the documents
 in
  a
 remote repository but accessing in nfs file system.

 OG: What about SAN connected over FC for example?

 One thing I must clarify: I index the documents in memory, I use
 RAMDirectory to do that, then when the RAMDirectory reach the limit(I
  have
 put about 10 Mb) then I serialize to disk(nfs) the index to merge it
  with
 the central index(the central index is in nfs file system), is that
  correct?

 OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it
 will
 do in-memory thing for you.  Make good use of your RAM and use 2.3
 which
 gives you more control over RAM use during indexing.  Parallelizing
 indexing
 over multiple machines and merging at the end is faster, so that's a
 good
 approach.  Also, if your boxes have multiple CPUs write your code so
 that it
 has multiple worker threads that do indexing and feed docs to
 IndexWriter.addDocument(Document) to keep the CPUs fully utilized.

 OG: Oh, something faster than PDFBox?  There is (can't remember the
 name
 now... itextstream or something like that?), though it may not be
 free like
 PDFBox.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


 On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote:

  Thanks all you for yours answers, I going to change a few things in
  my
  application and make tests.
  One thing I haven't find another good pdfToText converter like
 pdfBox
  Do
  you know any other faster ?
  Greetings
  Thanks for yours answers
  Ariel
 
 
  On Jan 9, 2008 11:08 PM, Otis Gospodnetic
  [EMAIL PROTECTED]
  wrote:
 
   Ariel,
  
   I believe PDFBox is not the fastest thing and was built more to
  handle
   all possible PDFs than for speed (just my impression - Ben,
  PDFBox's author
   might still be on this list and might comment).  Pulling data
 from
  NFS to
   index seems like a bad idea.  I hope at least the indices are
 local
  and not
   on a remote NFS...
  
   We benchmarked local disk vs. NFS vs. a FC SAN (don't recall
 which
  one)
   and indexing overNFS was slooow.
  
   Otis
  
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
   - Original Message 
   From: Ariel [EMAIL PROTECTED]
   To: java-user@lucene.apache.org
   Sent: Wednesday, January 9, 2008 2:50:41 PM
   Subject: Why is lucene so slow indexing in nfs file system ?
  
   Hi:
   I have seen the post in
  

  http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
and
   I am implementing a similar application in a distributed
  enviroment, a
   cluster of nodes only 5 nodes. The operating system I use is
Linux(Centos)
   so I am using nfs file system too to access the home directory
  where
the
   documents to be indexed reside and I would like to know how much
  time
an
   application spends to index a big amount of documents like 10 Gb
 ?
   I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512
 Mb
  in
every
   nodes, LAN: 1Gbits/s.
  
   The problem I have is that my application spends a lot of time to
  index
all
   the documents, the delay to index 10 gb of pdf documents is about
 2
days (to
   convert pdf to text I am using pdfbox) that is of course a lot of
  time,
   others applications based in lucene, for instance ibm omnifind
 only
takes 5
   hours to index the same amount of pdfs documents. I would like to
  find
out
   why my application has this big delay to index, any help is
  welcome.
   Dou you know others distributed architecture application that
 uses
lucene to
   index big amounts of documents ? How long time it takes to index
 ?
   I hope yo can help me
   Greetings
  
  
  
  
  

  -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED

Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Erick Erickson
 would like to find out why my application has this big
delay to index

Well, then you have to measure G. Tthe first thing I'd do
is pinpoint where the time was being spent. Until you have
that answered, you simply cannot take any meaningful action.

1 don't do any of the indexing. No new Documents, don't
add any fields, etc. This will just time the PDF parsing.
(I'd run this for set number of documents rather than the
whole 10G). This'll tell you whether the issue is indexing or
PDFBox.

2 Perhaps try the above with local files rather than files
on the nfs mount.

3 Put back some of the indexing and measure each
step. For instance, create the new documents but don't
add them to the index.

4Then go ahead and add them to the index.

The numbers you get for these measurements will tell
you a lot. At that point, perhaps folks will have more useful
suggestions.

The reason I'm being so unhelpful is that without lots more
detail, there's really nothing we can help with since there
are so many variables that it's just impossible to say
which one is the problem. For instance, is it a single
10G document and you're swapping like crazy? Are you
CPU bound or IO bound? Have you tried profiling your
process at all to find the choke points?

Best
Erick


On Jan 9, 2008 8:50 AM, Ariel [EMAIL PROTECTED] wrote:

 Hi:
 I have seen the post in
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.htmland
 I am implementing a similar application in a distributed enviroment, a
 cluster of nodes only 5 nodes. The operating system I use is Linux(Centos)
 so I am using nfs file system too to access the home directory where the
 documents to be indexed reside and I would like to know how much time an
 application spends to index a big amount of documents like 10 Gb ?
 I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
 every
 nodes, LAN: 1Gbits/s.

 The problem I have is that my application spends a lot of time to index
 all
 the documents, the delay to index 10 gb of pdf documents is about 2 days
 (to
 convert pdf to text I am using pdfbox) that is of course a lot of time,
 others applications based in lucene, for instance ibm omnifind only takes
 5
 hours to index the same amount of pdfs documents. I would like to find out
 why my application has this big delay to index, any help is welcome.
 Dou you know others distributed architecture application that uses lucene
 to
 index big amounts of documents ? How long time it takes to index ?
 I hope yo can help me
 Greetings



RE: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Steven A Rowe
Hi Ariel,

On 01/09/2008 at 8:50 AM, Ariel wrote:
 Dou you know others distributed architecture application that
 uses lucene to index big amounts of documents ?

Apache Solr is an open source enterprise search server based on the Lucene Java 
search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, 
caching, replication, and a web administration interface. It runs in a Java 
servlet container such as Tomcat.

http://lucene.apache.org/solr/

Steve


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Grant Ingersoll
There's also Nutch.  However, 10GB isn't that big...  Perhaps you can  
index where the docs/index lives, then just make the index available  
via NFS?  Or, better yet, use rsync to replicate it like Solr does.


-Grant

On Jan 9, 2008, at 10:49 AM, Steven A Rowe wrote:


Hi Ariel,

On 01/09/2008 at 8:50 AM, Ariel wrote:

Dou you know others distributed architecture application that
uses lucene to index big amounts of documents ?


Apache Solr is an open source enterprise search server based on the  
Lucene Java search library, with XML/HTTP and JSON APIs, hit  
highlighting, faceted search, caching, replication, and a web  
administration interface. It runs in a Java servlet container such  
as Tomcat.


http://lucene.apache.org/solr/

Steve


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Antony Bowesman

Ariel wrote:


The problem I have is that my application spends a lot of time to index all
the documents, the delay to index 10 gb of pdf documents is about 2 days (to
convert pdf to text I am using pdfbox) that is of course a lot of time,
others applications based in lucene, for instance ibm omnifind only takes 5
hours to index the same amount of pdfs documents. I would like to find out


If you are using log4j, make sure you have the pdfbox log4j categories set to 
info or higher, otherwise this really slows it down (factor of 10) or make sure 
you are using the non log4j version.  See 
http://sourceforge.net/forum/message.php?msg_id=3947448


Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is lucene so slow indexing in nfs file system ?

2008-01-09 Thread Otis Gospodnetic
Ariel,

I believe PDFBox is not the fastest thing and was built more to handle all 
possible PDFs than for speed (just my impression - Ben, PDFBox's author might 
still be on this list and might comment).  Pulling data from NFS to index seems 
like a bad idea.  I hope at least the indices are local and not on a remote 
NFS...

We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and 
indexing overNFS was slooow.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Ariel [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Wednesday, January 9, 2008 2:50:41 PM
Subject: Why is lucene so slow indexing in nfs file system ?

Hi:
I have seen the post in
http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html
 and
I am implementing a similar application in a distributed enviroment, a
cluster of nodes only 5 nodes. The operating system I use is
 Linux(Centos)
so I am using nfs file system too to access the home directory where
 the
documents to be indexed reside and I would like to know how much time
 an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
 every
nodes, LAN: 1Gbits/s.

The problem I have is that my application spends a lot of time to index
 all
the documents, the delay to index 10 gb of pdf documents is about 2
 days (to
convert pdf to text I am using pdfbox) that is of course a lot of time,
others applications based in lucene, for instance ibm omnifind only
 takes 5
hours to index the same amount of pdfs documents. I would like to find
 out
why my application has this big delay to index, any help is welcome.
Dou you know others distributed architecture application that uses
 lucene to
index big amounts of documents ? How long time it takes to index ?
I hope yo can help me
Greetings




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]