Re: [Nutch-general] Nutch changes 0.9.txt

rubdabadub Fri, 06 Apr 2007 02:23:19 -0700

Could be ..

1. parse-pdf plugin is not enabled plugin in nutch-site.xml .. you
need to enable it..
2. The pdf file is over the content limit .. you need to increase the
content limit value in nutch-site.xml.
3. Something else that i don't know..


Regards

On 4/6/07, Paul Liddelow <[EMAIL PROTECTED]> wrote:
> Hi
>
> Does anybody know what this means exactly:
>
> 8. NUTCH-338 - Remove the text parser as an option for parsing PDF files
>     in parse-plugins.xml (Chris A. Mattmann via siren)
>
> In my crawl log file it says:
>
> Error parsing: 
> http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf:
> failed(2,200): org.apache.nutch.parse.ParseException: parser not found
> for contentType=application/pdf
> url=http://www.site.com/quick%20reference%20guide%202/$FILE/Law_v2.4_02122006.pdf
>
> This maybe a stupid question, but does the Nutch crawler only retrieve
> and index links i.e. URL's and not pdf's? The .pdf isn't in the
> crawl-urlfilter.txt file either. And I can see it in the
> parse-plugins.xml file:
>
> <mimeType name="application/pdf">
>                 <plugin id="parse-pdf" />
>         </mimeType>
>
> Thanks
> Paul
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch changes 0.9.txt

Reply via email to