<<< would like to find out why my application has this big delay to index>>>
Well, then you have to measure <G>. Tthe first thing I'd do is pinpoint where the time was being spent. Until you have that answered, you simply cannot take any meaningful action. 1> don't do any of the indexing. No new Documents, don't add any fields, etc. This will just time the PDF parsing. (I'd run this for set number of documents rather than the whole 10G). This'll tell you whether the issue is indexing or PDFBox. 2> Perhaps try the above with local files rather than files on the nfs mount. 3> Put back some of the indexing and measure each step. For instance, create the new documents but don't add them to the index. 4>Then go ahead and add them to the index. The numbers you get for these measurements will tell you a lot. At that point, perhaps folks will have more useful suggestions. The reason I'm being so unhelpful is that without lots more detail, there's really nothing we can help with since there are so many variables that it's just impossible to say which one is the problem. For instance, is it a single 10G document and you're swapping like crazy? Are you CPU bound or IO bound? Have you tried profiling your process at all to find the choke points? Best Erick On Jan 9, 2008 8:50 AM, Ariel <[EMAIL PROTECTED]> wrote: > Hi: > I have seen the post in > http://www.mail-archive.com/[EMAIL PROTECTED]/msg12700.htmland > I am implementing a similar application in a distributed enviroment, a > cluster of nodes only 5 nodes. The operating system I use is Linux(Centos) > so I am using nfs file system too to access the home directory where the > documents to be indexed reside and I would like to know how much time an > application spends to index a big amount of documents like 10 Gb ? > I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in > every > nodes, LAN: 1Gbits/s. > > The problem I have is that my application spends a lot of time to index > all > the documents, the delay to index 10 gb of pdf documents is about 2 days > (to > convert pdf to text I am using pdfbox) that is of course a lot of time, > others applications based in lucene, for instance ibm omnifind only takes > 5 > hours to index the same amount of pdfs documents. I would like to find out > why my application has this big delay to index, any help is welcome. > Dou you know others distributed architecture application that uses lucene > to > index big amounts of documents ? How long time it takes to index ? > I hope yo can help me > Greetings >