[ 
https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158444#comment-13158444
 ] 

Nick Burch commented on TIKA-791:
---------------------------------

One thing - I'm not sure that we should be returning the same mimetype for a 
regular .xlsx file and a password protected .xlsx file. One is zip based, one 
is encrypted ole2. I'd say it's a similar situation to us not returning the 
same mimetype for .tar and .tar.gz - while they are both technically tar files, 
one is directly tar and the other is a wrapper tar that needs unpacking first. 
In this case, the protected ooxml files need special handling before they turn 
into normal ooxml files, so I don't believe we should be treating them 
interchangeably 
                
> Fix the detection of protected OOXML files
> ------------------------------------------
>
>                 Key: TIKA-791
>                 URL: https://issues.apache.org/jira/browse/TIKA-791
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.1
>         Environment: Windows 7 64 bit
>            Reporter: Antoni Mylka
>         Attachments: tika-791-ver2.zip, tika-791.zip
>
>
> TIKA-437 patch allowed Tika to work with OOXML files protected with the 
> default VelvetSweatshop password. I feel there is room for improvement.
> # The POIFSContainerDetector lies when it sees such a file. It should be able 
> to mark it as x-tika-ooxml
> # The OOXMLParser can't work with such a file. It should:
> ## If it's protected with the default password - it should be decrypted and 
> processed normally.
> ## If it's protected with a non-default password - the file should be marked 
> as protected, no weird exceptions should appear.
> Therefore I'd like to add an 'if' to POIFSContainerDetector which returns 
> x-tika-ooxml, and some code to OOXMLParser, which would be similar to the 
> code currently residing in OfficeParser. After this improvement both the 
> OfficeParser and the OOXMLParser will treat such files in the same way.
> When I have that, I can add a hack in my application, which will say "If the 
> type is x-tika-ooxml and the name-based detection is a specialization of 
> ooxml, then use the name-based detection". This will be a workaround for the 
> fact that in MimeTypes, magic always trumps the name. With that, the 
> encrypted DOCX files will appear with the normal DOCX mimetype in my app.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to