Hi all,

I just created an issue
https://issues.apache.org/jira/browse/TIKA-1292

In short: it's about Tika Detector detecting a JAR file (correct ZIP file,
with proper magic bytes, etc) as "text/html" instead of expected
"application/java-archive".

The reason is clear to me (we already created a PR in Nexus project for
that), but the interesting thing what bothers me is _why_ Detector behaves
correctly with tika-parsers on classpath?

How is the presence of tika-parsers affecting the MIME magic detection and
most interestingly, why does it affects? (am aware of
added org.apache.tika.parser.pkg.ZipContainerDetector).

Isn't MIME magic detection based on bundled tika-mimetypes.xml, where even
the globs defined for text/html (*.htm and *.html) does not match for the
JAR file above (*.jar), still, Tika selects the HTML mime type....


Thanks,
~t~

Reply via email to