[
https://issues.apache.org/jira/browse/TIKA-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14008652#comment-14008652
]
Hudson commented on TIKA-1292:
--
FAILURE: Integrated in tika-trunk-jdk1.7 #3 (See
[https://builds.apache.org/job/tika-trunk-jdk1.7/3/])
TIKA-1292 If there is more than one mime magic which matches at the highest
priority, keep track and then try to pick based on filename or type hint later
(nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1596612)
* /tika/trunk/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java
*
/tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
*
/tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java
Set an explicit priority on the OLE2 match, remove two MS Word matches which
were OLE2 ones in disguise, and add an intermediate staroffice parent on the
staroffice types. Helps with TIKA-1292 testing (nick:
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1596611)
*
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Add a disabled unit test for TIKA-1292, which when working will ensure that if
we have two matching magics at the same priority, the name is used to
specialise if possible, first defined if not (nick:
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1596593)
*
/tika/trunk/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
*
/tika/trunk/tika-core/src/test/resources/org/apache/tika/mime/custom-mimetypes.xml
Container formats with specific, low-false-positive magic matches need a
slightly higher priority, so that they don't accidently end up being matched
based on the contents of the container near the start of the file. Partly
solves TIKA-1292. This closes #6 github pull request (nick:
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1596590)
*
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
*
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
Add some notes on entries, to help people maintaining the file know what to do,
related to TIKA-1292 (nick:
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1596586)
*
/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Inconsistent priorities in bundled tika-mimetypes.xml
-
Key: TIKA-1292
URL: https://issues.apache.org/jira/browse/TIKA-1292
Project: Tika
Issue Type: Bug
Components: mime
Affects Versions: 1.5
Reporter: Cservenak, Tamas
Fix For: 1.6
It seems that mime-type priorities are a bit inconsistent in the tika-core
bundled tika-mimetypes.xml
Few examples:
*
[application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497]
vs
[application/x-7z-compressed|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3510]:
both are similar containers archive formats (structured, having entries),
having distinct file extensions (zip vs 7z globs), still priorities are
40 and 50 respectively.
*
[application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497]
vs
[text/html|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4713]:
not quite related MIME types, having same priority of 40. But ZIP files can
be uncompressed (meaning entries are mostly concatenated, and their
content, if plaintext, is readable). Hence, having an uncompressed ZIP (or
any subclass like JAR) file that contains HTML files zipped up might/will be
detected as HTML, which is wrong.
And this is what happens in Nexus that uses Tika under the hud for content
validation, basically using MIME magic detection provided by Tika Detector:
the Java JAR {{com.intellij:annotations:7.0.3}}
([link|http://repo1.maven.org/maven2/com/intellij/annotations/7.0.3/]) is
being detected as {{text/html}} instead of (expected)
{{application/java-archive}}.
Reason is following: the JAR file is zipped up in uncompressed zip format,
and among few annotations it also contains one HTML file entry (the license I
guess). Since both MIME types have same priority (40), I guess tika
randomly chooses the {{text/html}}.
Original Nexus issue
https://issues.sonatype.org/browse/NEXUS-6560
At Nexus issue there is a GH Pull Request that solves the problem for us (by
raising {{application/zip}} priority to 41.
But by inspecting the bundled tike-mimetypes.xml we spotted other -- probably
-- priority inconsistencies, like that of zip vs 7z mentioned above.
Note: this happens when using