Hi everyone,
in an Intranet, I want Nutch to follow only links found in HTML (and
maybe Javascript, XHTML), but clearly not office documents and PDFs.
- I took out parse-tika from the plugin.includes.
- I took out everything related to tika in parse-plugins.xml.
But now I get
Error parsing: http:...docx: org.apache.nutch.parse.ParseException:
parser not found for contentType=application/x-tika-ooxml url=http:....docx
I wonder what is wrong here. Do I need a catchall in parse-plugins.xml.
What does the sneaky <plugin id="feed"/> for some <mimeType> elements mean?
Regards,
Harald.