[
https://issues.apache.org/jira/browse/TIKA-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tung Nguyen updated TIKA-3034:
------------------------------
Description:
We are working with Tika to implement our mime types detection module. The
library seemingly cannot detect Mathematica files although the documentation
confirmed it does [1]. The Tika detector always returns `text/plain` instead of
`application/mathematica` as described in the documentation as well as unit
tests [2].
By doing the same need with Python code as below, we can obtain the right mime
types for any Mathematica file downloaded from the Wolfram Library Archive [3].
{code:java}
#!/usr/bin/python3
import mimetypes, os, sys
test_file = sys.argv[1]
print(mimetypes.MimeTypes().guess_type(test_file)[0])
{code}
Therefore, we suspected there is a bug in Tika detector where it tries to guess
mime types for Mathematica files.
Also, there is an existing ticket asking for the implementation of Mathematica
file detector. Here it is: https://issues.apache.org/jira/browse/TIKA-1520
References:
[1] [https://tika.apache.org/1.23/formats.html]
[2]
[https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java#L64]
[3] [https://library.wolfram.com/infocenter/Courseware/4706/]
was:
We are working with Tika to implement our mime types detection module. The
library seemingly cannot detect Mathematica files although the documentation
confirmed it does [1]. The Tika detector always returns `text/plain` instead of
`application/mathematica` as described in the documentation as well as unit
tests [2].
By doing the same need with Python code as below, we can obtain the right mime
types for any Mathematica file downloaded from the Wolfram Library Archive [3].
{code:java}
#!/usr/bin/python3
import mimetypes, os, sys
test_file = sys.argv[1]
print(mimetypes.MimeTypes().guess_type(test_file)[0])
{code}
Therefore, we suspected there is a bug in Tika detector where it tries to
guess mime types for Mathematica files.
References:
[1] [https://tika.apache.org/1.23/formats.html]
[2]
[https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java#L64]
[3] [https://library.wolfram.com/infocenter/Courseware/4706/]
Labels: math (was: )
> Detector always returns text/plain when scanning Mathematica files
> ------------------------------------------------------------------
>
> Key: TIKA-3034
> URL: https://issues.apache.org/jira/browse/TIKA-3034
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 1.23
> Reporter: Tung Nguyen
> Priority: Blocker
> Labels: math
> Fix For: 1.23
>
>
> We are working with Tika to implement our mime types detection module. The
> library seemingly cannot detect Mathematica files although the documentation
> confirmed it does [1]. The Tika detector always returns `text/plain` instead
> of `application/mathematica` as described in the documentation as well as
> unit tests [2].
> By doing the same need with Python code as below, we can obtain the right
> mime types for any Mathematica file downloaded from the Wolfram Library
> Archive [3].
> {code:java}
> #!/usr/bin/python3
> import mimetypes, os, sys
> test_file = sys.argv[1]
> print(mimetypes.MimeTypes().guess_type(test_file)[0])
> {code}
> Therefore, we suspected there is a bug in Tika detector where it tries to
> guess mime types for Mathematica files.
> Also, there is an existing ticket asking for the implementation of
> Mathematica file detector. Here it is:
> https://issues.apache.org/jira/browse/TIKA-1520
> References:
> [1] [https://tika.apache.org/1.23/formats.html]
> [2]
> [https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java#L64]
> [3] [https://library.wolfram.com/infocenter/Courseware/4706/]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)