Hi folks,
I'm using Nutch 1.11. What I'd like to do is use parse-tika for HTML and maybe a select few other content types, but nothing else. This doesn't appear to be possible without making changes in places beyond parse-plugins.xml. Implementation details: In ParserFactory, if no parser is found for the given contentType and parse-tika *is* being used, it is automatically used as a fallback, since parse-tika's plugin.xml file says it works with all contentTypes. This seems like a bit underhanded, since in parse-plugins.xml I'm explicitly disabling the glob -> parse-tika mapping. I haven't tested but I imagine I can work around this by just changing parse-tika's plugin.xml to map to a subset of contentTypes, rather than '*'. Is this a bug or just something that should be documented? Thanks, Joe