[ https://issues.apache.org/jira/browse/TIKA-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoni Mylka resolved TIKA-806. ------------------------------- Resolution: Not A Problem Fix Version/s: 1.1 Assignee: Antoni Mylka You're right. No further comments. I guess I can just make use of my newly-found JIRA authority and close this issue as "Not a Problem". Then I'll add the hack to the app. If in doubt - reopen. > MS Word Detection magics are a bit overzealous > ---------------------------------------------- > > Key: TIKA-806 > URL: https://issues.apache.org/jira/browse/TIKA-806 > Project: Tika > Issue Type: Bug > Components: mime > Affects Versions: 1.1 > Reporter: Antoni Mylka > Assignee: Antoni Mylka > Fix For: 1.1 > > Attachments: tika-806-ver2.patch, tika-806-ver3.zip > > > tika-mimetypes.xml contains a following magic for MS Word: > {noformat} > <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8"> > <match value="W\x00o\x00r\x00d\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t" > type="string" offset="1152:4096" /> > </match> > {noformat} > So if a file is an MS Office document (parent Office magic) and has the > WordDocument string within the given offsets, then it's Word. I have a few > (regrettably confidential) counterexamples of MS Excel files with embedded > Word documents. For instance one has "Workbook" (with 0x00 between > characters) at offset 0x0480 and "WordDocument" (0x00's between characters) > at offset 0x0B80. This is an Excel file, which does meet the above-mentioned > magic criterion. Returning x-tika-msoffice would dispatch the file to POI > detector, which would return the correct answer. > I vote for removing that magic. I took a look at some of my files and it > seems that "WordDocument" and "Workbook" strings do occur at various offsets. > The presence of embedded documents makes detection by those strings > unreliable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira