Hi folks,

 

I'm using Nutch 1.11. What I'd like to do is use parse-tika for HTML and
maybe a select few other content types, but nothing else. This doesn't
appear to be possible without making changes in places beyond
parse-plugins.xml.

 

Implementation details: In ParserFactory, if no parser is found for the
given contentType and parse-tika *is* being used, it is automatically used
as a fallback, since parse-tika's plugin.xml file says it works with all
contentTypes.

 

This seems like a bit underhanded, since in parse-plugins.xml I'm explicitly
disabling the glob -> parse-tika mapping. I haven't tested but I imagine I
can work around this by just changing parse-tika's plugin.xml to map to a
subset of contentTypes, rather than '*'.

 

Is this a bug or just something that should be documented?

 

Thanks,

Joe

Reply via email to