[ https://issues.apache.org/jira/browse/TIKA-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Burch resolved TIKA-779. ----------------------------- Resolution: Fixed Fix Version/s: 1.1 > Detection of Microsoft Works 2000 Word Processor files > ------------------------------------------------------ > > Key: TIKA-779 > URL: https://issues.apache.org/jira/browse/TIKA-779 > Project: Tika > Issue Type: Test > Affects Versions: 1.0 > Environment: Windows 7, 64 bit > Reporter: Antoni Mylka > Fix For: 1.1 > > Attachments: microsoft-works-word-processor-2000.wps, tika-779.patch > > > In older versions of Tika, our Microsoft Works 2000 Word Processor example > file would get recognized properly by the POIFSContainerDetector. Now it > isn't. Some debugging revealed that the improvements from TIKA-704 broke the > detection of that particular file. The detection is based on top-level names > obtained from the root DirectoryNode. In case of this file there are two > strings in that set: "CONTENTS" and "\u0001CompObj". In older versions > "CONTENTS" was enough to recognize a file as "application/vnd.ms-works". Now > it looks like this: > {noformat} > if (names.contains("CONTENTS") && names.contains("SPELLING")) { > return WPS; > } else if (names.contains("CONTENTS")) { > // CONTENTS without SPELLING normally means some sort of > // embedded non-office file inside an OLE2 document > // This is most commonly triggered on nested directories > return OLE; > } > {noformat} > Now I have a file with CONTENTS, but without SPELLING, and it's a normal WPS > file. I did a workaround like this: > {noformat} > if ( names.contains("CONTENTS") && > (names.contains("SPELLING") || names.contains("\u0001CompObj"))) { > return WPS; > } else if (names.contains("CONTENTS")) { > // CONTENTS without SPELLING normally means some sort of > // embedded non-office file inside an OLE2 document > // This is most commonly triggered on nested directories > return OLE; > } > {noformat} > So "CONTENTS" has to be supplemented by "SPELLING" or "\u0001CompObj". I > don't know the meaning of this and I don't know if that second string also > occurs in those "embedded non-office files inside an OLE2 documents", > referred to in that comment. The workaround solves the problem for me, the > Tika build tests pass and regression tests of my apps pass as well. > Jukka, do you have more than one WPS file, and all of them have both CONTENTS > and SPELLING names in that collection? Is the "\u0001CompObj" string > characteristic to this format, or is it a generic thing which also occurs on > those "non-office files" or "nested directories". If yes, just close this as > wontfix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira