Could be .. 1. parse-pdf plugin is not enabled plugin in nutch-site.xml .. you need to enable it.. 2. The pdf file is over the content limit .. you need to increase the content limit value in nutch-site.xml. 3. Something else that i don't know..
Regards On 4/6/07, Paul Liddelow <[EMAIL PROTECTED]> wrote: > Hi > > Does anybody know what this means exactly: > > 8. NUTCH-338 - Remove the text parser as an option for parsing PDF files > in parse-plugins.xml (Chris A. Mattmann via siren) > > In my crawl log file it says: > > Error parsing: > http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf: > failed(2,200): org.apache.nutch.parse.ParseException: parser not found > for contentType=application/pdf > url=http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf > > This maybe a stupid question, but does the Nutch crawler only retrieve > and index links i.e. URL's and not pdf's? The .pdf isn't in the > crawl-urlfilter.txt file either. And I can see it in the > parse-plugins.xml file: > > <mimeType name="application/pdf"> > <plugin id="parse-pdf" /> > </mimeType> > > Thanks > Paul > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
