In the nutch-default.xml file you have the configuration option 
plugin.includes.  Copy that property to the nutch-site.xml file and 
change the parse-(text|html|js) to look like this 
parse-(text|html|js|pdf)  This will enable the pdf parser plugin.

Dennis Kubes

Sævaldur Arnar Gunnarsson wrote:
> Hi, I'm evaluating Nutch as a search platform for a large Icelandic
> website.
> The website has a quite large collection of Adobe Acrobat documents
> (PDF) stored on a Lotus Domino server.
> 
> I run nutch with 
> ./bin/nutch crawl example-domain/url-list.txt -dir example-domain/index/
> -depth 9999 -topN 9999
> 
> Out of 3.072 PDF documents fetched by Nutch, 1.687 returned the
> following error:
> Error parsing:
> http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf:
>  failed(2,200): org.apache.nutch.parse.ParseException: parser not found for 
> contentType=application/pdf 
> url=http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf
> 
> With best regards,

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to