[ https://issues.apache.org/jira/browse/TIKA-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002457#comment-14002457 ]
Nick Burch commented on TIKA-1292: ---------------------------------- Thanks for that, I've used it to write a (currently disabled) failing unit test in r1596068. Will aim to review your patches in a day or two, unless someone else beats me to it! > Inconsistent priorities in bundled tika-mimetypes.xml > ----------------------------------------------------- > > Key: TIKA-1292 > URL: https://issues.apache.org/jira/browse/TIKA-1292 > Project: Tika > Issue Type: Bug > Components: mime > Affects Versions: 1.5 > Reporter: Cservenak, Tamas > > It seems that mime-type priorities are a bit inconsistent in the tika-core > bundled tika-mimetypes.xml > Few examples: > * > [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497] > vs > [application/x-7z-compressed|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3510]: > both are similar "containers" archive formats (structured, having entries), > having distinct file extensions ("zip" vs "7z" globs), still priorities are > 40 and 50 respectively. > * > [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497] > vs > [text/html|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4713]: > not quite related MIME types, having same priority of 40. But ZIP files can > be "uncompressed" (meaning entries are mostly "concatenated", and their > content, if plaintext, is readable). Hence, having an "uncompressed" ZIP (or > any subclass like JAR) file that contains HTML files zipped up might/will be > detected as HTML, which is wrong. > And this is what happens in Nexus that uses Tika under the hud for "content" > validation, basically using MIME magic detection provided by Tika Detector: > the Java JAR {{com.intellij:annotations:7.0.3}} > ([link|http://repo1.maven.org/maven2/com/intellij/annotations/7.0.3/]) is > being detected as {{text/html}} instead of (expected) > {{application/java-archive}}. > Reason is following: the JAR file is zipped up in "uncompressed" zip format, > and among few annotations it also contains one HTML file entry (the license I > guess). Since both MIME types have same priority (40), I guess tika > "randomly" chooses the {{text/html}}. > Original Nexus issue > https://issues.sonatype.org/browse/NEXUS-6560 > At Nexus issue there is a GH Pull Request that solves the problem for us (by > raising {{application/zip}} priority to 41. > But by inspecting the bundled tike-mimetypes.xml we spotted other -- probably > -- priority inconsistencies, like that of zip vs 7z mentioned above. > Note: this happens when using tika-core solely on classpath and using it for > MIME magic detection. Interestingly, when the tika-parsers (with it's all > dependencies) are added to classpath, Tika will properly figure out that the > artifact is {{application/java-archive}}. Still, our use case in Nexus > requires the MIME magic detection only, so we do not use tika-parsers, nor we > would like to do so. > Sample project to reproduce > https://github.com/cstamas/tika-1292 -- This message was sent by Atlassian JIRA (v6.2#6252)