I should have included the link, but I used PDFBox. Thanks,
Steve Betts [EMAIL PROTECTED] 937-477-1797 -----Original Message----- From: "Håvard W. Kongsgård" [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 25, 2006 10:34 AM To: [email protected] Subject: Re: Parsing PDF Nutch Achilles heel? From where do I get the new version http://www.pdfbox.org/ or http://svn.apache.org/viewcvs.cgi/lucene/nutch/ Steve Betts wrote: >There is a bug in the PDF parser tool used with 0.7. You can get a newer >version to replace the jars with the parse-pdf plugin and the freeze will go >away. > >Thanks, > >Steve Betts >[EMAIL PROTECTED] >937-477-1797 > >-----Original Message----- >From: "Håvard W. Kongsgård" [mailto:[EMAIL PROTECTED] >Sent: Wednesday, January 25, 2006 10:10 AM >To: [email protected] >Subject: Parsing PDF Nutch Achilles heel? > >I have been doing some testing on different nutch configurations to see >what slows down the fetching process on my servers(nutch 0.7.1). >My general experience is that the PDF parse process is nutchs Achilles heel. > >Nutch works fine on older computers, but with the combination of >|parse-(text|html|pdf) >and http.content.limit = -1(needed to get PDF parsing to work) nutch >sometimes freezes completely. > >Is there planned any improvement to the parsing of PDF files in the next >version of nutch (0.8)? > > > > > ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
