[jira] [Commented] (TIKA-791) Fix the detection of protected OOXML files

Nick Burch (Commented) (JIRA) Fri, 25 Nov 2011 07:58:02 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157212#comment-13157212
 ]


Nick Burch commented on TIKA-791:
---------------------------------

Maybe we should add a new tika specific mimetype for encrypted ooxml documents?

This would extend from tika-ooxml, but the (ole2 based) OfficeParser could 
claim it. That would allow it to decrypt and recurse as now, but would also 
permit the mime type detection to correct to the specific kind based on the 
filename?

I'm not sure we want to duplicate the protected document logic between 
OfficeParser and OOXMLParser though, I think we want to keep it all in one 
place. OfficeParser seems like the right point for it to me, as it's all within 
the OLE2 layer at that point.
                
> Fix the detection of protected OOXML files
> ------------------------------------------
>
>                 Key: TIKA-791
>                 URL: https://issues.apache.org/jira/browse/TIKA-791
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.1
>         Environment: Windows 7 64 bit
>            Reporter: Antoni Mylka
>         Attachments: tika-791.zip
>
>
> TIKA-437 patch allowed Tika to work with OOXML files protected with the 
> default VelvetSweatshop password. I feel there is room for improvement.
> # The POIFSContainerDetector lies when it sees such a file. It should be able 
> to mark it as x-tika-ooxml
> # The OOXMLParser can't work with such a file. It should:
> ## If it's protected with the default password - it should be decrypted and 
> processed normally.
> ## If it's protected with a non-default password - the file should be marked 
> as protected, no weird exceptions should appear.
> Therefore I'd like to add an 'if' to POIFSContainerDetector which returns 
> x-tika-ooxml, and some code to OOXMLParser, which would be similar to the 
> code currently residing in OfficeParser. After this improvement both the 
> OfficeParser and the OOXMLParser will treat such files in the same way.
> When I have that, I can add a hack in my application, which will say "If the 
> type is x-tika-ooxml and the name-based detection is a specialization of 
> ooxml, then use the name-based detection". This will be a workaround for the 
> fact that in MimeTypes, magic always trumps the name. With that, the 
> encrypted DOCX files will appear with the normal DOCX mimetype in my app.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-791) Fix the detection of protected OOXML files

Reply via email to