[ 
https://issues.apache.org/jira/browse/TIKA-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tung Nguyen updated TIKA-3034:
------------------------------
    Description: 
We are working with Tika to implement our mime types detection module. The 
library seemingly cannot detect Mathematica files although the documentation 
confirmed it does [1]. The Tika detector always returns `text/plain` instead of 
`application/mathematica` as described in the documentation as well as unit 
tests [2].

By doing the same need with Python code as below, we can obtain the right mime 
types for any Mathematica file downloaded from the Wolfram Library Archive [3]. 
{code:java}
#!/usr/bin/python3
import mimetypes, os, sys
test_file = sys.argv[1]
print(mimetypes.MimeTypes().guess_type(test_file)[0])
{code}
Therefore, we suspected there is a bug in Tika detector where it tries to guess 
mime types for Mathematica files.

Also, there is an existing ticket asking for the implementation of Mathematica 
file detector. Here it is: https://issues.apache.org/jira/browse/TIKA-1520

References:

 [1] [https://tika.apache.org/1.23/formats.html]

 [2] 
[https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java#L64]

 [3] [https://library.wolfram.com/infocenter/Courseware/4706/]

 

  was:
We are working with Tika to implement our mime types detection module. The 
library seemingly cannot detect Mathematica files although the documentation 
confirmed it does [1]. The Tika detector always returns `text/plain` instead of 
`application/mathematica` as described in the documentation as well as unit 
tests [2].

By doing the same need with Python code as below, we can obtain the right mime 
types for any Mathematica file downloaded from the Wolfram Library Archive [3]. 
{code:java}
#!/usr/bin/python3
import mimetypes, os, sys
test_file = sys.argv[1]
print(mimetypes.MimeTypes().guess_type(test_file)[0])
{code}
 Therefore, we suspected there is a bug in Tika detector where it tries to 
guess mime types for Mathematica files.

References:

 [1] [https://tika.apache.org/1.23/formats.html]

 [2] 
[https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java#L64]

[3] [https://library.wolfram.com/infocenter/Courseware/4706/]

 

         Labels: math  (was: )

> Detector always returns text/plain when scanning Mathematica files
> ------------------------------------------------------------------
>
>                 Key: TIKA-3034
>                 URL: https://issues.apache.org/jira/browse/TIKA-3034
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.23
>            Reporter: Tung Nguyen
>            Priority: Blocker
>              Labels: math
>             Fix For: 1.23
>
>
> We are working with Tika to implement our mime types detection module. The 
> library seemingly cannot detect Mathematica files although the documentation 
> confirmed it does [1]. The Tika detector always returns `text/plain` instead 
> of `application/mathematica` as described in the documentation as well as 
> unit tests [2].
> By doing the same need with Python code as below, we can obtain the right 
> mime types for any Mathematica file downloaded from the Wolfram Library 
> Archive [3]. 
> {code:java}
> #!/usr/bin/python3
> import mimetypes, os, sys
> test_file = sys.argv[1]
> print(mimetypes.MimeTypes().guess_type(test_file)[0])
> {code}
> Therefore, we suspected there is a bug in Tika detector where it tries to 
> guess mime types for Mathematica files.
> Also, there is an existing ticket asking for the implementation of 
> Mathematica file detector. Here it is: 
> https://issues.apache.org/jira/browse/TIKA-1520
> References:
>  [1] [https://tika.apache.org/1.23/formats.html]
>  [2] 
> [https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java#L64]
>  [3] [https://library.wolfram.com/infocenter/Courseware/4706/]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to