[ 
https://issues.apache.org/jira/browse/TIKA-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Burch resolved TIKA-779.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.1
    
> Detection of Microsoft Works 2000 Word Processor files
> ------------------------------------------------------
>
>                 Key: TIKA-779
>                 URL: https://issues.apache.org/jira/browse/TIKA-779
>             Project: Tika
>          Issue Type: Test
>    Affects Versions: 1.0
>         Environment: Windows 7, 64 bit
>            Reporter: Antoni Mylka
>             Fix For: 1.1
>
>         Attachments: microsoft-works-word-processor-2000.wps, tika-779.patch
>
>
> In older versions of Tika, our Microsoft Works 2000 Word Processor example 
> file would get recognized properly by the POIFSContainerDetector. Now it 
> isn't. Some debugging revealed that the improvements from TIKA-704 broke the 
> detection of that particular file. The detection is based on top-level names 
> obtained from the root DirectoryNode. In case of this file there are two 
> strings in that set: "CONTENTS" and "\u0001CompObj". In older versions 
> "CONTENTS" was enough to recognize a file as "application/vnd.ms-works". Now 
> it looks like this:
> {noformat}
> if (names.contains("CONTENTS") && names.contains("SPELLING")) {
>    return WPS;
> } else if (names.contains("CONTENTS")) {
>    // CONTENTS without SPELLING normally means some sort of
>    //  embedded non-office file inside an OLE2 document
>    // This is most commonly triggered on nested directories
>    return OLE;
> }
> {noformat}
> Now I have a file with CONTENTS, but without SPELLING, and it's a normal WPS 
> file. I did a workaround like this:
> {noformat}
> if ( names.contains("CONTENTS") && 
>     (names.contains("SPELLING") || names.contains("\u0001CompObj"))) {
>    return WPS;
> } else if (names.contains("CONTENTS")) {
>    // CONTENTS without SPELLING normally means some sort of
>    //  embedded non-office file inside an OLE2 document
>    // This is most commonly triggered on nested directories
>    return OLE;
> }
> {noformat}
> So "CONTENTS" has to be supplemented by "SPELLING" or "\u0001CompObj". I 
> don't know the meaning of this and I don't know if that second string also 
> occurs in those "embedded non-office files inside an OLE2 documents", 
> referred to in that comment. The workaround solves the problem for me, the 
> Tika build tests pass and regression tests of my apps pass as well.
> Jukka, do you have more than one WPS file, and all of them have both CONTENTS 
> and SPELLING names in that collection? Is the "\u0001CompObj" string 
> characteristic to this format, or is it a generic thing which also occurs on 
> those "non-office files" or "nested directories". If yes, just close this as 
> wontfix. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to