Hi Thanks for that. I actually missed the part where you need to include the plugins to Nutch entirely. So I have updated my nutch-site.xml to include the parse-pdf plugin as well as many others I have discovered you need to add! :P
Paul On 4/6/07, rubdabadub <[EMAIL PROTECTED]> wrote: > Could be .. > > 1. parse-pdf plugin is not enabled plugin in nutch-site.xml .. you > need to enable it.. > 2. The pdf file is over the content limit .. you need to increase the > content limit value in nutch-site.xml. > 3. Something else that i don't know.. > > Regards > > On 4/6/07, Paul Liddelow <[EMAIL PROTECTED]> wrote: > > Hi > > > > Does anybody know what this means exactly: > > > > 8. NUTCH-338 - Remove the text parser as an option for parsing PDF files > > in parse-plugins.xml (Chris A. Mattmann via siren) > > > > In my crawl log file it says: > > > > Error parsing: > > http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf: > > failed(2,200): org.apache.nutch.parse.ParseException: parser not found > > for contentType=application/pdf > > url=http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf > > > > This maybe a stupid question, but does the Nutch crawler only retrieve > > and index links i.e. URL's and not pdf's? The .pdf isn't in the > > crawl-urlfilter.txt file either. And I can see it in the > > parse-plugins.xml file: > > > > <mimeType name="application/pdf"> > > <plugin id="parse-pdf" /> > > </mimeType> > > > > Thanks > > Paul > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
