Re: Why is lucene so slow indexing in nfs file system ?
Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit(I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? I hope you can help me. I have take in consideration the suggestions you have make me before, I going to do some things to test it. Ariel On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote: Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
This seems really clunky. Especially if your merge step also optimizes. There's not much point in indexing into RAM then merging explicitly. Just use an FSDirectory rather than a RAMDirectory. There is *already* buffering built in to FSDirectory, and your merge factor etc. control how much RAM is used before flushing to disk. There's considerable discussion of this on the Wiki I believe, but in the mail archive for sure. And I believe there's a RAM usage based flushing policy somewhere. You're adding complexity where it's probably not necessary. Did you adopt this scheme because you *thought* it would be faster or because you were addressing a *known* problem? Don't *ever* write complex code to support a theoretical case unless you have considerable certainty that it really is a problem. It would be faster is a weak argument when you don't know whether you're talking about saving 1% or 95%. The added maintenance is just not worth it. There's a famous quote about that from Donald Knuth (paraphrasing Hoare) We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. It's true. So the very *first* measurement I'd take is to get rid of the in-RAM stuff and just write the index to local disk. I suspect you'll be *far* better off doing this then just copying your index to the nfs mount. Best Erick On Jan 10, 2008 10:05 AM, Ariel [EMAIL PROTECTED] wrote: In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit(I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? I hope you can help me. I have take in consideration the suggestions you have make me before, I going to do some things to test it. Ariel On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote: Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
If possible you should also test the soon-to-be-released version 2.3, which has a number of speedups to indexing. Also try the steps here: http://wiki.apache.org/lucene-java/ImproveIndexingSpeed You should also try an A/B test: A) writing your index to the NFS directory and then B) to a local IO system, to see how much NFS is really slowing you down. Mike Erick Erickson wrote: This seems really clunky. Especially if your merge step also optimizes. There's not much point in indexing into RAM then merging explicitly. Just use an FSDirectory rather than a RAMDirectory. There is *already* buffering built in to FSDirectory, and your merge factor etc. control how much RAM is used before flushing to disk. There's considerable discussion of this on the Wiki I believe, but in the mail archive for sure. And I believe there's a RAM usage based flushing policy somewhere. You're adding complexity where it's probably not necessary. Did you adopt this scheme because you *thought* it would be faster or because you were addressing a *known* problem? Don't *ever* write complex code to support a theoretical case unless you have considerable certainty that it really is a problem. It would be faster is a weak argument when you don't know whether you're talking about saving 1% or 95%. The added maintenance is just not worth it. There's a famous quote about that from Donald Knuth (paraphrasing Hoare) We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. It's true. So the very *first* measurement I'd take is to get rid of the in-RAM stuff and just write the index to local disk. I suspect you'll be *far* better off doing this then just copying your index to the nfs mount. Best Erick On Jan 10, 2008 10:05 AM, Ariel [EMAIL PROTECTED] wrote: In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit (I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? I hope you can help me. I have take in consideration the suggestions you have make me before, I going to do some things to test it. Ariel On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote: Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/ msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
I am indexing into RAM then merging explicitly because my application demand it due to I have design it as a distributed enviroment so many threads or workers are in different machines indexing into RAM serialize to disk an another thread in another machine access the segment index to merge it with the principal one, that is faster than if I had just one thread indexing the documents, doesn' it ? Yours suggestions are very useful. I hope you can help me. Greetings Ariel On Jan 10, 2008 10:21 AM, Erick Erickson [EMAIL PROTECTED] wrote: This seems really clunky. Especially if your merge step also optimizes. There's not much point in indexing into RAM then merging explicitly. Just use an FSDirectory rather than a RAMDirectory. There is *already* buffering built in to FSDirectory, and your merge factor etc. control how much RAM is used before flushing to disk. There's considerable discussion of this on the Wiki I believe, but in the mail archive for sure. And I believe there's a RAM usage based flushing policy somewhere. You're adding complexity where it's probably not necessary. Did you adopt this scheme because you *thought* it would be faster or because you were addressing a *known* problem? Don't *ever* write complex code to support a theoretical case unless you have considerable certainty that it really is a problem. It would be faster is a weak argument when you don't know whether you're talking about saving 1% or 95%. The added maintenance is just not worth it. There's a famous quote about that from Donald Knuth (paraphrasing Hoare) We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. It's true. So the very *first* measurement I'd take is to get rid of the in-RAM stuff and just write the index to local disk. I suspect you'll be *far* better off doing this then just copying your index to the nfs mount. Best Erick On Jan 10, 2008 10:05 AM, Ariel [EMAIL PROTECTED] wrote: In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit(I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? I hope you can help me. I have take in consideration the suggestions you have make me before, I going to do some things to test it. Ariel On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote: Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know
Re: Why is lucene so slow indexing in nfs file system ?
Ariel, Comments inline. - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, January 10, 2008 10:05:28 AM Subject: Re: Why is lucene so slow indexing in nfs file system ? In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. OG: What about SAN connected over FC for example? One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit(I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will do in-memory thing for you. Make good use of your RAM and use 2.3 which gives you more control over RAM use during indexing. Parallelizing indexing over multiple machines and merging at the end is faster, so that's a good approach. Also, if your boxes have multiple CPUs write your code so that it has multiple worker threads that do indexing and feed docs to IndexWriter.addDocument(Document) to keep the CPUs fully utilized. OG: Oh, something faster than PDFBox? There is (can't remember the name now... itextstream or something like that?), though it may not be free like PDFBox. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote: Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
Thanks for yours suggestions. I'm sorry I didn't know but I would want to know what Do you mean with SAN and FC? Another thing, I have visited the lucene home page and there is not released the 2.3 version, could you tell me where is the download link ? Thanks in advance. Ariel On Jan 10, 2008 2:59 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, Comments inline. - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, January 10, 2008 10:05:28 AM Subject: Re: Why is lucene so slow indexing in nfs file system ? In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. OG: What about SAN connected over FC for example? One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit(I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will do in-memory thing for you. Make good use of your RAM and use 2.3 which gives you more control over RAM use during indexing. Parallelizing indexing over multiple machines and merging at the end is faster, so that's a good approach. Also, if your boxes have multiple CPUs write your code so that it has multiple worker threads that do indexing and feed docs to IndexWriter.addDocument(Document) to keep the CPUs fully utilized. OG: Oh, something faster than PDFBox? There is (can't remember the name now... itextstream or something like that?), though it may not be free like PDFBox. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote: Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
SAN is Storage Area Network. FC is fiber channel. I can confirm by one customer experience that using SAN does scale pretty well, and pretty simple. Well, it costs some money. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On Jan 10, 2008 3:26 PM, Ariel [EMAIL PROTECTED] wrote: Thanks for yours suggestions. I'm sorry I didn't know but I would want to know what Do you mean with SAN and FC? Another thing, I have visited the lucene home page and there is not released the 2.3 version, could you tell me where is the download link ? Thanks in advance. Ariel On Jan 10, 2008 2:59 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, Comments inline. - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, January 10, 2008 10:05:28 AM Subject: Re: Why is lucene so slow indexing in nfs file system ? In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. OG: What about SAN connected over FC for example? One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit(I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will do in-memory thing for you. Make good use of your RAM and use 2.3 which gives you more control over RAM use during indexing. Parallelizing indexing over multiple machines and merging at the end is faster, so that's a good approach. Also, if your boxes have multiple CPUs write your code so that it has multiple worker threads that do indexing and feed docs to IndexWriter.addDocument(Document) to keep the CPUs fully utilized. OG: Oh, something faster than PDFBox? There is (can't remember the name now... itextstream or something like that?), though it may not be free like PDFBox. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote: Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene
Re: Why is lucene so slow indexing in nfs file system ?
2.3 is in the process of being released. Give it another week to 10 days and it will be out. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, January 10, 2008 6:26:44 PM Subject: Re: Why is lucene so slow indexing in nfs file system ? Thanks for yours suggestions. I'm sorry I didn't know but I would want to know what Do you mean with SAN and FC? Another thing, I have visited the lucene home page and there is not released the 2.3 version, could you tell me where is the download link ? Thanks in advance. Ariel On Jan 10, 2008 2:59 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, Comments inline. - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Thursday, January 10, 2008 10:05:28 AM Subject: Re: Why is lucene so slow indexing in nfs file system ? In a distributed enviroment the application should make an exhaustive use of the network and there is not another way to access to the documents in a remote repository but accessing in nfs file system. OG: What about SAN connected over FC for example? One thing I must clarify: I index the documents in memory, I use RAMDirectory to do that, then when the RAMDirectory reach the limit(I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with the central index(the central index is in nfs file system), is that correct? OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will do in-memory thing for you. Make good use of your RAM and use 2.3 which gives you more control over RAM use during indexing. Parallelizing indexing over multiple machines and merging at the end is faster, so that's a good approach. Also, if your boxes have multiple CPUs write your code so that it has multiple worker threads that do indexing and feed docs to IndexWriter.addDocument(Document) to keep the CPUs fully utilized. OG: Oh, something faster than PDFBox? There is (can't remember the name now... itextstream or something like that?), though it may not be free like PDFBox. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch On Jan 10, 2008 8:45 AM, Ariel [EMAIL PROTECTED] wrote: Thanks all you for yours answers, I going to change a few things in my application and make tests. One thing I haven't find another good pdfToText converter like pdfBox Do you know any other faster ? Greetings Thanks for yours answers Ariel On Jan 9, 2008 11:08 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
Re: Why is lucene so slow indexing in nfs file system ?
would like to find out why my application has this big delay to index Well, then you have to measure G. Tthe first thing I'd do is pinpoint where the time was being spent. Until you have that answered, you simply cannot take any meaningful action. 1 don't do any of the indexing. No new Documents, don't add any fields, etc. This will just time the PDF parsing. (I'd run this for set number of documents rather than the whole 10G). This'll tell you whether the issue is indexing or PDFBox. 2 Perhaps try the above with local files rather than files on the nfs mount. 3 Put back some of the indexing and measure each step. For instance, create the new documents but don't add them to the index. 4Then go ahead and add them to the index. The numbers you get for these measurements will tell you a lot. At that point, perhaps folks will have more useful suggestions. The reason I'm being so unhelpful is that without lots more detail, there's really nothing we can help with since there are so many variables that it's just impossible to say which one is the problem. For instance, is it a single 10G document and you're swapping like crazy? Are you CPU bound or IO bound? Have you tried profiling your process at all to find the choke points? Best Erick On Jan 9, 2008 8:50 AM, Ariel [EMAIL PROTECTED] wrote: Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.htmland I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings
RE: Why is lucene so slow indexing in nfs file system ?
Hi Ariel, On 01/09/2008 at 8:50 AM, Ariel wrote: Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? Apache Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat. http://lucene.apache.org/solr/ Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
There's also Nutch. However, 10GB isn't that big... Perhaps you can index where the docs/index lives, then just make the index available via NFS? Or, better yet, use rsync to replicate it like Solr does. -Grant On Jan 9, 2008, at 10:49 AM, Steven A Rowe wrote: Hi Ariel, On 01/09/2008 at 8:50 AM, Ariel wrote: Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? Apache Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat. http://lucene.apache.org/solr/ Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
Ariel wrote: The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out If you are using log4j, make sure you have the pdfbox log4j categories set to info or higher, otherwise this really slows it down (factor of 10) or make sure you are using the non log4j version. See http://sourceforge.net/forum/message.php?msg_id=3947448 Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is lucene so slow indexing in nfs file system ?
Ariel, I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's author might still be on this list and might comment). Pulling data from NFS to index seems like a bad idea. I hope at least the indices are local and not on a remote NFS... We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which one) and indexing overNFS was slooow. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ariel [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, January 9, 2008 2:50:41 PM Subject: Why is lucene so slow indexing in nfs file system ? Hi: I have seen the post in http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html and I am implementing a similar application in a distributed enviroment, a cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) so I am using nfs file system too to access the home directory where the documents to be indexed reside and I would like to know how much time an application spends to index a big amount of documents like 10 Gb ? I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in every nodes, LAN: 1Gbits/s. The problem I have is that my application spends a lot of time to index all the documents, the delay to index 10 gb of pdf documents is about 2 days (to convert pdf to text I am using pdfbox) that is of course a lot of time, others applications based in lucene, for instance ibm omnifind only takes 5 hours to index the same amount of pdfs documents. I would like to find out why my application has this big delay to index, any help is welcome. Dou you know others distributed architecture application that uses lucene to index big amounts of documents ? How long time it takes to index ? I hope yo can help me Greetings - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]