Prevent parsing of office documents and PDFs

Hi everyone,

in an Intranet, I want Nutch to follow only links found in HTML (andmaybe Javascript, XHTML), but clearly not office documents and PDFs.


- I took out parse-tika from the plugin.includes.
- I took out everything related to tika in parse-plugins.xml.

But now I get

Error parsing: http:...docx: org.apache.nutch.parse.ParseException:parser not found for contentType=application/x-tika-ooxml url=http:....docx

I wonder what is wrong here. Do I need a catchall in parse-plugins.xml.What does the sneaky <plugin id="feed"/> for some <mimeType> elements mean?


Regards,
Harald.

Reply via email to