Re: Prevent parsing of office documents and PDFs

Harald Kirsch Fri, 11 Jul 2014 06:52:19 -0700

Hi Julien.

The reason is that I want pdfs and such to be indexed.
But they should not be parsed to find outgoing URLs.

So I guess for indexing they need to be fetched. But Nutch should nottry to parse them. The conversion to indexable text takes placesomewhere else, not need for Nutch to sweat on it.


Harald.


On 11.07.2014 15:27, Julien Nioche wrote:

You don't need to modify parse-plugins.xml, just remove parse-tika
from plugin.includes.
Your problem here is that you have an open office document in the segment
and no parser to deal with it.

why don't you add a regular expression to URL filters to remove all URLs
ending in .pdf, .docx, .doc ? That would prevent such documents to be
fetching in the first place

Julien


On 11 July 2014 13:50, Harald Kirsch <[email protected]> wrote:

Hi everyone,

in an Intranet, I want Nutch to follow only links found in HTML (and maybe
Javascript, XHTML), but clearly not office documents and PDFs.

- I took out parse-tika from the plugin.includes.
- I took out everything related to tika in parse-plugins.xml.

But now I get

Error parsing: http:...docx: org.apache.nutch.parse.ParseException:
parser not found for contentType=application/x-tika-ooxml
url=http:....docx

I wonder what is wrong here. Do I need a catchall in parse-plugins.xml.
What does the sneaky <plugin id="feed"/> for some <mimeType> elements mean?

Regards,
Harald.


--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49 211 53883-216
Fax +49-211-550266-19
http://www.raytion.com

Re: Prevent parsing of office documents and PDFs

Reply via email to