[ https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158444#comment-13158444 ]
Nick Burch commented on TIKA-791: --------------------------------- One thing - I'm not sure that we should be returning the same mimetype for a regular .xlsx file and a password protected .xlsx file. One is zip based, one is encrypted ole2. I'd say it's a similar situation to us not returning the same mimetype for .tar and .tar.gz - while they are both technically tar files, one is directly tar and the other is a wrapper tar that needs unpacking first. In this case, the protected ooxml files need special handling before they turn into normal ooxml files, so I don't believe we should be treating them interchangeably > Fix the detection of protected OOXML files > ------------------------------------------ > > Key: TIKA-791 > URL: https://issues.apache.org/jira/browse/TIKA-791 > Project: Tika > Issue Type: Improvement > Components: mime > Affects Versions: 1.1 > Environment: Windows 7 64 bit > Reporter: Antoni Mylka > Attachments: tika-791-ver2.zip, tika-791.zip > > > TIKA-437 patch allowed Tika to work with OOXML files protected with the > default VelvetSweatshop password. I feel there is room for improvement. > # The POIFSContainerDetector lies when it sees such a file. It should be able > to mark it as x-tika-ooxml > # The OOXMLParser can't work with such a file. It should: > ## If it's protected with the default password - it should be decrypted and > processed normally. > ## If it's protected with a non-default password - the file should be marked > as protected, no weird exceptions should appear. > Therefore I'd like to add an 'if' to POIFSContainerDetector which returns > x-tika-ooxml, and some code to OOXMLParser, which would be similar to the > code currently residing in OfficeParser. After this improvement both the > OfficeParser and the OOXMLParser will treat such files in the same way. > When I have that, I can add a hack in my application, which will say "If the > type is x-tika-ooxml and the name-based detection is a specialization of > ooxml, then use the name-based detection". This will be a workaround for the > fact that in MimeTypes, magic always trumps the name. With that, the > encrypted DOCX files will appear with the normal DOCX mimetype in my app. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira