Hi everyone,

in an Intranet, I want Nutch to follow only links found in HTML (and maybe Javascript, XHTML), but clearly not office documents and PDFs.

- I took out parse-tika from the plugin.includes.
- I took out everything related to tika in parse-plugins.xml.

But now I get

Error parsing: http:...docx: org.apache.nutch.parse.ParseException: parser not found for contentType=application/x-tika-ooxml url=http:....docx

I wonder what is wrong here. Do I need a catchall in parse-plugins.xml. What does the sneaky <plugin id="feed"/> for some <mimeType> elements mean?

Regards,
Harald.

Reply via email to