[ 
https://issues.apache.org/jira/browse/TIKA-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362049#comment-14362049
 ] 

Tyler Palsulich commented on TIKA-1116:
---------------------------------------

Any update on this? If any office file can have this magic, it doesn't seem 
correct to restrict it to just Doc in Tika.

> Wrong detection of XLS/Doc fil
> ------------------------------
>
>                 Key: TIKA-1116
>                 URL: https://issues.apache.org/jira/browse/TIKA-1116
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.3, 1.4
>            Reporter: Petr Pytelka
>              Labels: DOC,, XLS
>
> My issue:
> I have valid XLS file and this file is detected as DOC.
> Cause:
> tika-mimetypes.xml contain lines:
>   <mime-type type="application/msword">
> ..
>       <match value="\320\317\021\340\241\261\032\341" type="string" 
> offset="0"/>
> ..
>   </mime-type>
> According to MS documentation this prefix can be in any Compound Binary file 
> (DOC, XLS, PPT and others).
> There is documentation: 
> http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/WindowsCompoundBinaryFileFormatSpecification.pdf
>  (look at 2.1 Header)
> My proposal is to remove line
>       <match value="\320\317\021\340\241\261\032\341" type="string" 
> offset="0"/>
> from tika-mimetypes.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to