Richard Braman wrote:
That error is actually not from the http content limit, but I would
recommend setting the content limit to -1.  For some reason this error

I would recommend against it - you may inadvertently fetch gigabyte-sized files if you skip content limits... but you can set it sufficiently high so that it still makes sense, e.g. 2-10 MB.

sems to happen sometimes even after you add the pdf parsing plug in like
you did.  I think nutch must cache the plug in properties in
nutch-default.  It will start to parse pdfs at some point.

Nutch doesn't cache plugin properties in any place except the currently running process. All properties are read anew from the config files whenever you start any nutch processing.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to