Re: Prevent parsing of office documents and PDFs

Harald Kirsch Fri, 11 Jul 2014 07:41:34 -0700

Hi Julien,

parsing is clearly necessary for indexing, but clearly it does not haveto be Nutch which does the parsing --- which I prefer to call"conversion to indexable text" to call it differently than the parsingthat Nutch has to do to find outgoing URLs.

But if you want to call it parsing, well, it is done by another piece ofsoftware. I only want Nutch to


a) find and follow outgoing URLs --- *in HTML only*

b) download all URLs found and send out to my indexing plugin whichtakes care of converting whatever mime-type to something my indexerunderstands.


Harald.


On 11.07.2014 16:18, Julien Nioche wrote:

Hi Harald

The parsing step is necessary in order to index documents as this is where
the text and metadata are extracted. As document which is not parsed won't
get indexed. Not clear what you mean by "the conversion to indexable text
takes place somewhere else" : it is done by the parse step.

Julien

On 11 July 2014 14:50, Harald Kirsch <[email protected]> wrote:

Hi Julien.

The reason is that I want pdfs and such to be indexed.
But they should not be parsed to find outgoing URLs.

So I guess for indexing they need to be fetched. But Nutch should not try
to parse them. The conversion to indexable text takes place somewhere else,
not need for Nutch to sweat on it.

Harald.



On 11.07.2014 15:27, Julien Nioche wrote:

You don't need to modify parse-plugins.xml, just remove parse-tika
from plugin.includes.
Your problem here is that you have an open office document in the segment
and no parser to deal with it.

why don't you add a regular expression to URL filters to remove all URLs
ending in .pdf, .docx, .doc ? That would prevent such documents to be
fetching in the first place

Julien


On 11 July 2014 13:50, Harald Kirsch <[email protected]> wrote:

  Hi everyone,


in an Intranet, I want Nutch to follow only links found in HTML (and
maybe
Javascript, XHTML), but clearly not office documents and PDFs.

- I took out parse-tika from the plugin.includes.
- I took out everything related to tika in parse-plugins.xml.

But now I get

Error parsing: http:...docx: org.apache.nutch.parse.ParseException:
parser not found for contentType=application/x-tika-ooxml
url=http:....docx

I wonder what is wrong here. Do I need a catchall in parse-plugins.xml.
What does the sneaky <plugin id="feed"/> for some <mimeType> elements
mean?

Regards,
Harald.

--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49 211 53883-216
Fax +49-211-550266-19
http://www.raytion.com


--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49 211 53883-216
Fax +49-211-550266-19
http://www.raytion.com

Re: Prevent parsing of office documents and PDFs

Reply via email to