2.3 is in the process of being released. Give it another week to 10 days and it will be out.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Ariel <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, January 10, 2008 6:26:44 PM Subject: Re: Why is lucene so slow indexing in nfs file system ? Thanks for yours suggestions. I'm sorry I didn't know but I would want to know what Do you mean with "SAN" and "FC"? Another thing, I have visited the lucene home page and there is not released the 2.3 version, could you tell me where is the download link ? Thanks in advance. Ariel On Jan 10, 2008 2:59 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Ariel, > > Comments inline. > > > ----- Original Message ---- > From: Ariel <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Thursday, January 10, 2008 10:05:28 AM > Subject: Re: Why is lucene so slow indexing in nfs file system ? > > In a distributed enviroment the application should make an exhaustive > use of > the network and there is not another way to access to the documents in > a > remote repository but accessing in nfs file system. > > OG: What about SAN connected over FC for example? > > One thing I must clarify: I index the documents in memory, I use > RAMDirectory to do that, then when the RAMDirectory reach the limit(I > have > put about 10 Mb) then I serialize to disk(nfs) the index to merge it > with > the central index(the central index is in nfs file system), is that > correct? > > OG: Nah, don't bother with RAMDirectory, just use FSDirectory and it will > do in-memory thing for you. Make good use of your RAM and use 2.3 which > gives you more control over RAM use during indexing. Parallelizing indexing > over multiple machines and merging at the end is faster, so that's a good > approach. Also, if your boxes have multiple CPUs write your code so that it > has multiple worker threads that do indexing and feed docs to > IndexWriter.addDocument(Document) to keep the CPUs fully utilized. > > OG: Oh, something faster than PDFBox? There is (can't remember the name > now... itextstream or something like that?), though it may not be free like > PDFBox. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote: > > > Thanks all you for yours answers, I going to change a few things in > my > > application and make tests. > > One thing I haven't find another good pdfToText converter like pdfBox > Do > > you know any other faster ? > > Greetings > > Thanks for yours answers > > Ariel > > > > > > On Jan 9, 2008 11:08 PM, Otis Gospodnetic > <[EMAIL PROTECTED]> > > wrote: > > > > > Ariel, > > > > > > I believe PDFBox is not the fastest thing and was built more to > handle > > > all possible PDFs than for speed (just my impression - Ben, > PDFBox's author > > > might still be on this list and might comment). Pulling data from > NFS to > > > index seems like a bad idea. I hope at least the indices are local > and not > > > on a remote NFS... > > > > > > We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which > one) > > > and indexing overNFS was slooooooow. > > > > > > Otis > > > > > > -- > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > ----- Original Message ---- > > > From: Ariel <[EMAIL PROTECTED]> > > > To: java-user@lucene.apache.org > > > Sent: Wednesday, January 9, 2008 2:50:41 PM > > > Subject: Why is lucene so slow indexing in nfs file system ? > > > > > > Hi: > > > I have seen the post in > > > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.html > > > and > > > I am implementing a similar application in a distributed > enviroment, a > > > cluster of nodes only 5 nodes. The operating system I use is > > > Linux(Centos) > > > so I am using nfs file system too to access the home directory > where > > > the > > > documents to be indexed reside and I would like to know how much > time > > > an > > > application spends to index a big amount of documents like 10 Gb ? > > > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb > in > > > every > > > nodes, LAN: 1Gbits/s. > > > > > > The problem I have is that my application spends a lot of time to > index > > > all > > > the documents, the delay to index 10 gb of pdf documents is about 2 > > > days (to > > > convert pdf to text I am using pdfbox) that is of course a lot of > time, > > > others applications based in lucene, for instance ibm omnifind only > > > takes 5 > > > hours to index the same amount of pdfs documents. I would like to > find > > > out > > > why my application has this big delay to index, any help is > welcome. > > > Dou you know others distributed architecture application that uses > > > lucene to > > > index big amounts of documents ? How long time it takes to index ? > > > I hope yo can help me > > > Greetings > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]