[ https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoni Mylka updated TIKA-791: ------------------------------ Attachment: tika-791-ver2.zip Attached an updated patch which uses a new media type "application/x-tika-ooxml-protected" for protected OOXML files with the OLE2 magic. This allowed me to implement my own detector which does: {noformat} if mimeTypes says ms-office and poi says ooxml-protected and name implies an ooxml subtype then return the type implied by name {noformat} Right now it can't be done with any of the built-in Tika Detectors. If you think it would be a good idea then perhaps, this would warrant a new issue. > Fix the detection of protected OOXML files > ------------------------------------------ > > Key: TIKA-791 > URL: https://issues.apache.org/jira/browse/TIKA-791 > Project: Tika > Issue Type: Improvement > Components: mime > Affects Versions: 1.1 > Environment: Windows 7 64 bit > Reporter: Antoni Mylka > Attachments: tika-791-ver2.zip, tika-791.zip > > > TIKA-437 patch allowed Tika to work with OOXML files protected with the > default VelvetSweatshop password. I feel there is room for improvement. > # The POIFSContainerDetector lies when it sees such a file. It should be able > to mark it as x-tika-ooxml > # The OOXMLParser can't work with such a file. It should: > ## If it's protected with the default password - it should be decrypted and > processed normally. > ## If it's protected with a non-default password - the file should be marked > as protected, no weird exceptions should appear. > Therefore I'd like to add an 'if' to POIFSContainerDetector which returns > x-tika-ooxml, and some code to OOXMLParser, which would be similar to the > code currently residing in OfficeParser. After this improvement both the > OfficeParser and the OOXMLParser will treat such files in the same way. > When I have that, I can add a hack in my application, which will say "If the > type is x-tika-ooxml and the name-based detection is a specialization of > ooxml, then use the name-based detection". This will be a workaround for the > fact that in MimeTypes, magic always trumps the name. With that, the > encrypted DOCX files will appear with the normal DOCX mimetype in my app. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira