MS Word Detection magics are a bit overzealous
----------------------------------------------

                 Key: TIKA-806
                 URL: https://issues.apache.org/jira/browse/TIKA-806
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 1.1
            Reporter: Antoni Mylka


tika-mimetypes.xml contains a following magic for MS Word:

{noformat}
<match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8">
<match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t" 
type="string" offset="1152:4096" />
</match>
{noformat}

So if a file is an MS Office document (parent Office magic) and has the 
WordDocument string within the given offsets, then it's Word. I have a few 
(regrettably confidential) counterexamples of MS Excel files with embedded Word 
documents. For instance one has "Workbook" (with 0x00 between characters) at 
offset 0x0480 and "WordDocument" (0x00's between characters) at offset 0x0B80. 
This is an Excel file, which does meet the above-mentioned magic criterion. 
Returning x-tika-msoffice would dispatch the file to POI detector, which would 
return the correct answer.

I vote for removing that magic. I took a look at some of my files and it seems 
that "WordDocument" and "Workbook" strings do occur at various offsets. The 
presence of embedded documents makes detection by those strings unreliable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to