[Nutch-dev] PDF parsing speed

Luke Baker Sat, 30 Oct 2004 08:13:59 -0700

Hey all,

Does anyone else have the problem of the pdf parser taking up so many resources that it slows down the whole parsing process? I ran the fetch with the -noParsing option (thanks John!). I then ran the parser on the documents with the pdf parser enabled. The speed for parsing was quite slow. It was only parsing about 5 pages/second. When I disabled the pdf parser and ran the parser again on those documents, I was parsing over 30 pages/second. All this on the same machine which is a P4 2.66 with 512MB of RAM. The iowait is 0%, so I don't think it is thrashing or using swap that much. Is the pdf parser just really CPU intensive? What does everyone else do? 5 pages/second is not really acceptable, but it'd be great to be able to parse pdfs.

Thanks,

Luke


-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] PDF parsing speed

Reply via email to