In the nutch-default.xml file you have the configuration option plugin.includes. Copy that property to the nutch-site.xml file and change the parse-(text|html|js) to look like this parse-(text|html|js|pdf) This will enable the pdf parser plugin.
Dennis Kubes Sævaldur Arnar Gunnarsson wrote: > Hi, I'm evaluating Nutch as a search platform for a large Icelandic > website. > The website has a quite large collection of Adobe Acrobat documents > (PDF) stored on a Lotus Domino server. > > I run nutch with > ./bin/nutch crawl example-domain/url-list.txt -dir example-domain/index/ > -depth 9999 -topN 9999 > > Out of 3.072 PDF documents fetched by Nutch, 1.687 returned the > following error: > Error parsing: > http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf: > failed(2,200): org.apache.nutch.parse.ParseException: parser not found for > contentType=application/pdf > url=http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf > > With best regards, ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
