[ 
https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoni Mylka updated TIKA-806:
------------------------------

    Attachment: tika-806.patch

A patch which removes those magics from tika-mimetypes.xml.
                
> MS Word Detection magics are a bit overzealous
> ----------------------------------------------
>
>                 Key: TIKA-806
>                 URL: https://issues.apache.org/jira/browse/TIKA-806
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: tika-806.patch
>
>
> tika-mimetypes.xml contains a following magic for MS Word:
> {noformat}
> <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8">
> <match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t" 
> type="string" offset="1152:4096" />
> </match>
> {noformat}
> So if a file is an MS Office document (parent Office magic) and has the 
> WordDocument string within the given offsets, then it's Word. I have a few 
> (regrettably confidential) counterexamples of MS Excel files with embedded 
> Word documents. For instance one has "Workbook" (with 0x00 between 
> characters) at offset 0x0480 and "WordDocument" (0x00's between characters) 
> at offset 0x0B80. This is an Excel file, which does meet the above-mentioned 
> magic criterion. Returning x-tika-msoffice would dispatch the file to POI 
> detector, which would return the correct answer.
> I vote for removing that magic. I took a look at some of my files and it 
> seems that "WordDocument" and "Workbook" strings do occur at various offsets. 
> The presence of embedded documents makes detection by those strings 
> unreliable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to