[ 
https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167043#comment-13167043
 ] 

Nick Burch commented on TIKA-806:
---------------------------------

The file format allows for the directory entries to occur at any point within 
the file, so you're correct that the only fully reliable way to detect the 
format is to open up the OLE2 container and see what the contents are

However, the directory listing is often stored in the first couple of blocks, 
so it can allow for certain files to be detected without needing to open up the 
whole file and process it.

We now prefer the container detectors over the mimetype ones by default, when 
using DefaultDetector, so this shouldn't be an issue on trunk. Is it?
                
> MS Word Detection magics are a bit overzealous
> ----------------------------------------------
>
>                 Key: TIKA-806
>                 URL: https://issues.apache.org/jira/browse/TIKA-806
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: tika-806-ver2.patch
>
>
> tika-mimetypes.xml contains a following magic for MS Word:
> {noformat}
> <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8">
> <match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t" 
> type="string" offset="1152:4096" />
> </match>
> {noformat}
> So if a file is an MS Office document (parent Office magic) and has the 
> WordDocument string within the given offsets, then it's Word. I have a few 
> (regrettably confidential) counterexamples of MS Excel files with embedded 
> Word documents. For instance one has "Workbook" (with 0x00 between 
> characters) at offset 0x0480 and "WordDocument" (0x00's between characters) 
> at offset 0x0B80. This is an Excel file, which does meet the above-mentioned 
> magic criterion. Returning x-tika-msoffice would dispatch the file to POI 
> detector, which would return the correct answer.
> I vote for removing that magic. I took a look at some of my files and it 
> seems that "WordDocument" and "Workbook" strings do occur at various offsets. 
> The presence of embedded documents makes detection by those strings 
> unreliable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to