[ https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168056#comment-13168056 ]
Nick Burch commented on TIKA-806: --------------------------------- If you use DefaultDetector it isn't an issue, as the container ones get run first. For you case, can't you just say "if the type is x-tika-msoffice or the type's parent is x-tika-msoffice use the container detector"? I agree that we need container aware detectors for true OLE2 detection (that's why I wrote the original POIFS detector!), but I'm not sure about removing mime magic that is commonly correct. For many people, having that in will give a better answer than not > MS Word Detection magics are a bit overzealous > ---------------------------------------------- > > Key: TIKA-806 > URL: https://issues.apache.org/jira/browse/TIKA-806 > Project: Tika > Issue Type: Bug > Components: mime > Affects Versions: 1.1 > Reporter: Antoni Mylka > Attachments: tika-806-ver2.patch, tika-806-ver3.zip > > > tika-mimetypes.xml contains a following magic for MS Word: > {noformat} > <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8"> > <match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t" > type="string" offset="1152:4096" /> > </match> > {noformat} > So if a file is an MS Office document (parent Office magic) and has the > WordDocument string within the given offsets, then it's Word. I have a few > (regrettably confidential) counterexamples of MS Excel files with embedded > Word documents. For instance one has "Workbook" (with 0x00 between > characters) at offset 0x0480 and "WordDocument" (0x00's between characters) > at offset 0x0B80. This is an Excel file, which does meet the above-mentioned > magic criterion. Returning x-tika-msoffice would dispatch the file to POI > detector, which would return the correct answer. > I vote for removing that magic. I took a look at some of my files and it > seems that "WordDocument" and "Workbook" strings do occur at various offsets. > The presence of embedded documents makes detection by those strings > unreliable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira