Re: [Nutch-dev] PDF parsing speed

John X Sun, 31 Oct 2004 09:03:57 -0800

On Sat, Oct 30, 2004 at 11:06:18AM -0400, Luke Baker wrote:
> Hey all,
> 
> Does anyone else have the problem of the pdf parser taking up so many 
> resources that it slows down the whole parsing process?  I ran the fetch 
> with the -noParsing option (thanks John!).  I then ran the parser on the 
> documents with the pdf parser enabled.  The speed for parsing was quite 
> slow.  It was only parsing about 5 pages/second.  When I disabled the 
> pdf parser and ran the parser again on those documents, I was parsing 
> over 30 pages/second.  All this on the same machine which is a P4 2.66 
> with 512MB of RAM.  The iowait is 0%, so I don't think it is thrashing 
> or using swap that much.  Is the pdf parser just really CPU intensive? 
> What does everyone else do?  5 pages/second is not really acceptable, 
> but it'd be great to be able to parse pdfs.


What are the numbers for kb/s and bytes/page?
I have a collection of mostly mswords, ppts and some pdfs, the numbers are
041001 194517 10 status: 0.17712256 pages/s, 8246.524 kb/s, 5959461.5 bytes/page
Some files are very large in size: 100 - 300 Mbytes.
Not sure if other pdf libs will be better that PDFBox.
You can always separate crawl of html from ones of word, ppt, pdf, etc.

John


-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] PDF parsing speed

Reply via email to