On Sat, Oct 30, 2004 at 11:06:18AM -0400, Luke Baker wrote: > Hey all, > > Does anyone else have the problem of the pdf parser taking up so many > resources that it slows down the whole parsing process? I ran the fetch > with the -noParsing option (thanks John!). I then ran the parser on the > documents with the pdf parser enabled. The speed for parsing was quite > slow. It was only parsing about 5 pages/second. When I disabled the > pdf parser and ran the parser again on those documents, I was parsing > over 30 pages/second. All this on the same machine which is a P4 2.66 > with 512MB of RAM. The iowait is 0%, so I don't think it is thrashing > or using swap that much. Is the pdf parser just really CPU intensive? > What does everyone else do? 5 pages/second is not really acceptable, > but it'd be great to be able to parse pdfs.
What are the numbers for kb/s and bytes/page? I have a collection of mostly mswords, ppts and some pdfs, the numbers are 041001 194517 10 status: 0.17712256 pages/s, 8246.524 kb/s, 5959461.5 bytes/page Some files are very large in size: 100 - 300 Mbytes. Not sure if other pdf libs will be better that PDFBox. You can always separate crawl of html from ones of word, ppt, pdf, etc. John ------------------------------------------------------- This SF.Net email is sponsored by: Sybase ASE Linux Express Edition - download now for FREE LinuxWorld Reader's Choice Award Winner for best database on Linux. http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
