Hi, I'm evaluating Nutch as a search platform for a large Icelandic
website.
The website has a quite large collection of Adobe Acrobat documents
(PDF) stored on a Lotus Domino server.

I run nutch with 
./bin/nutch crawl example-domain/url-list.txt -dir example-domain/index/
-depth 9999 -topN 9999

Out of 3.072 PDF documents fetched by Nutch, 1.687 returned the
following error:
Error parsing:
http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf:
 failed(2,200): org.apache.nutch.parse.ParseException: parser not found for 
contentType=application/pdf 
url=http://notes.example-domain.is/vefur2.nsf/Files/fr377nr20.pdf/$file/fr377nr20.pdf

With best regards,
-- 
Sævaldur Arnar Gunnarsson
System Administrator | RHCE

Hugsmiðja ehf.
Snorrabraut 56 | 105 Reykjavík
S: 550 0900 | G: 659 0007

Attachment: smime.p7s
Description: S/MIME cryptographic signature

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to