Exclude certain mime-types

2012-05-18 Thread Matthias Paul
How can I exlude certain mime-types from crawling, for example Word-documents? If I have parse-tika in plugin.includes it will parse them. Do I have to change parse-plugins.xml? I can't exclude them in regex-urlfilter as the .doc extension is not present in the urls. Thanks Matthias

RE: Exclude certain mime-types

2012-05-18 Thread Markus Jelsma
-Original message- From:Matthias Paul magethle.nu...@gmail.com Sent: Fri 18-May-2012 14:57 To: user@nutch.apache.org Subject: Exclude certain mime-types How can I exlude certain mime-types from crawling, for example Word-documents? If I have parse-tika in plugin.includes