[ https://issues.apache.org/jira/browse/TIKA-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362049#comment-14362049 ]
Tyler Palsulich commented on TIKA-1116: --------------------------------------- Any update on this? If any office file can have this magic, it doesn't seem correct to restrict it to just Doc in Tika. > Wrong detection of XLS/Doc fil > ------------------------------ > > Key: TIKA-1116 > URL: https://issues.apache.org/jira/browse/TIKA-1116 > Project: Tika > Issue Type: Bug > Components: mime > Affects Versions: 1.3, 1.4 > Reporter: Petr Pytelka > Labels: DOC,, XLS > > My issue: > I have valid XLS file and this file is detected as DOC. > Cause: > tika-mimetypes.xml contain lines: > <mime-type type="application/msword"> > .. > <match value="\320\317\021\340\241\261\032\341" type="string" > offset="0"/> > .. > </mime-type> > According to MS documentation this prefix can be in any Compound Binary file > (DOC, XLS, PPT and others). > There is documentation: > http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/WindowsCompoundBinaryFileFormatSpecification.pdf > (look at 2.1 Header) > My proposal is to remove line > <match value="\320\317\021\340\241\261\032\341" type="string" > offset="0"/> > from tika-mimetypes.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)