Tung Nguyen created TIKA-3034:
---------------------------------
Summary: Detector always returns text/plain when scanning
Mathematica files
Key: TIKA-3034
URL: https://issues.apache.org/jira/browse/TIKA-3034
Project: Tika
Issue Type: Bug
Components: detector
Affects Versions: 1.23
Reporter: Tung Nguyen
Fix For: 1.24
We are working with Tika to implement our mime types detection module. The
library seemingly cannot detect Mathematica files although the documentation
confirmed it does [1]. The Tika detector always returns `text/plain` instead of
`application/mathematica` as described in the documentation as well as unit
tests [2].
By doing the same need with Python code as below, we can obtain the right mime
types for any Mathematica file downloaded from the Wolfram Library Archive [3].
{code:java}
#!/usr/bin/python3
import mimetypes, os, sys
test_file = sys.argv[1]
print(mimetypes.MimeTypes().guess_type(test_file)[0])
{code}
Therefore, we suspected there is a bug in Tika detector where it tries to
guess mime types for Mathematica files.
References:
[1] [https://tika.apache.org/1.23/formats.html]
[2]
[https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java#L64]
[3] [https://library.wolfram.com/infocenter/Courseware/4706/]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)