Richard Braman wrote:
That error is actually not from the http content limit, but I would recommend setting the content limit to -1. For some reason this error
I would recommend against it - you may inadvertently fetch gigabyte-sized files if you skip content limits... but you can set it sufficiently high so that it still makes sense, e.g. 2-10 MB.
sems to happen sometimes even after you add the pdf parsing plug in like you did. I think nutch must cache the plug in properties in nutch-default. It will start to parse pdfs at some point.
Nutch doesn't cache plugin properties in any place except the currently running process. All properties are read anew from the config files whenever you start any nutch processing.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
