[jira] [Commented] (TIKA-806) MS Word Detection magics are a bit overzealous

Nick Burch (Commented) (JIRA) Tue, 13 Dec 2011 05:24:03 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168369#comment-13168369
 ]


Nick Burch commented on TIKA-806:
---------------------------------

You can always get a false positive with mime magic though... We can never be 
completely certain, so I tend to think the line should be drawn at "generally 
helpful and rarely harmful". (whether this comes under that may be a different 
matter!)

For the OLE2 and Zip cases, we do provide more accurate detectors, which will 
only run for files with the right initial mime magic, so people who care about 
greater accuracy (at the expense of a little more processing time) can make use 
of that if they choose

For your specific case, you only need to check the first 4 bytes to know if a 
file has the Zip or OLE2 mime magic. It may be best to have code that tries the 
first few bytes from your truncated stream, if it matches then it can pass the 
whole file to the appropriate container detector, and if not it can pass the 
first few kb to the regular mimetypes code. That's likely to be less brittle, 
as well as easier to follow. It should also cope well for adding other 
container detectors (eg Ogg) later. 

(Most people can simply pass in the whole stream to DefaultDetector and have 
something like this done for them, it's only special for you because you want 
to detect most files off of the initial few kb, with the whole file for certain 
types)
                
> MS Word Detection magics are a bit overzealous
> ----------------------------------------------
>
>                 Key: TIKA-806
>                 URL: https://issues.apache.org/jira/browse/TIKA-806
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: tika-806-ver2.patch, tika-806-ver3.zip
>
>
> tika-mimetypes.xml contains a following magic for MS Word:
> {noformat}
> <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8">
> <match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t" 
> type="string" offset="1152:4096" />
> </match>
> {noformat}
> So if a file is an MS Office document (parent Office magic) and has the 
> WordDocument string within the given offsets, then it's Word. I have a few 
> (regrettably confidential) counterexamples of MS Excel files with embedded 
> Word documents. For instance one has "Workbook" (with 0x00 between 
> characters) at offset 0x0480 and "WordDocument" (0x00's between characters) 
> at offset 0x0B80. This is an Excel file, which does meet the above-mentioned 
> magic criterion. Returning x-tika-msoffice would dispatch the file to POI 
> detector, which would return the correct answer.
> I vote for removing that magic. I took a look at some of my files and it 
> seems that "WordDocument" and "Workbook" strings do occur at various offsets. 
> The presence of embedded documents makes detection by those strings 
> unreliable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-806) MS Word Detection magics are a bit overzealous

Reply via email to