[jira] [Updated] (TIKA-806) MS Word Detection magics are a bit overzealous

Antoni Mylka (Updated) (JIRA) Mon, 12 Dec 2011 07:34:06 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Antoni Mylka updated TIKA-806:
------------------------------

    Attachment: tika-806-ver3.zip

It turns out that the XLR files are not detected by POIFSContainerDetector. 
With the third version of the patch they are. This should probably be reported 
as a separate issue, but it's difficult to separate them.

Both boil down to the same thing. MimeTypes should not "guess" the concrete 
type of an msoffice document because there are two cases where it will return a 
wrong answer.

# A document with another document embedded within. The choice will depend on 
the ordering of matchers as in TIKA-391.
# A Works 7.0 Spreadsheet document will be detected as Excel, while it should 
be passed to the container detector.
                
> MS Word Detection magics are a bit overzealous
> ----------------------------------------------
>
>                 Key: TIKA-806
>                 URL: https://issues.apache.org/jira/browse/TIKA-806
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: tika-806-ver2.patch, tika-806-ver3.zip
>
>
> tika-mimetypes.xml contains a following magic for MS Word:
> {noformat}
> <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8">
> <match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t" 
> type="string" offset="1152:4096" />
> </match>
> {noformat}
> So if a file is an MS Office document (parent Office magic) and has the 
> WordDocument string within the given offsets, then it's Word. I have a few 
> (regrettably confidential) counterexamples of MS Excel files with embedded 
> Word documents. For instance one has "Workbook" (with 0x00 between 
> characters) at offset 0x0480 and "WordDocument" (0x00's between characters) 
> at offset 0x0B80. This is an Excel file, which does meet the above-mentioned 
> magic criterion. Returning x-tika-msoffice would dispatch the file to POI 
> detector, which would return the correct answer.
> I vote for removing that magic. I took a look at some of my files and it 
> seems that "WordDocument" and "Workbook" strings do occur at various offsets. 
> The presence of embedded documents makes detection by those strings 
> unreliable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-806) MS Word Detection magics are a bit overzealous

Reply via email to