[ 
https://issues.apache.org/jira/browse/TIKA-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521063#comment-17521063
 ] 

Tim Allison edited comment on TIKA-3666 at 4/12/22 10:24 AM:
-------------------------------------------------------------

Is there any mime magic that we can identify for these files?  Are they all OLE 
files?  This site 
(https://web-in-security.blogspot.com/2016/07/how-to-break-microsoft-rights.html)
 and their referenced codebase suggest OLE for MSOffice and PDF with an entry 
titled "EncryptedPackage".  

We already look for "EncryptedPackage" and then return "OLE", which would 
explain how the file is getting routed to the OfficeParser with no extracted 
contents.

Can you run POI's POIFSViewer on your private files and tell us what the 
contents look like?

{code:java}
java -cp tika-app-2.3.0.jar org.apache.poi.poifs.dev.POIFSViewer myfile.docx
{code}



was (Author: talli...@mitre.org):
Is there any mime magic that we can identify for these files?  Are they all OLE 
files?  This site 
(https://web-in-security.blogspot.com/2016/07/how-to-break-microsoft-rights.html)
 and their referenced codebase suggest OLE for MSOffice and PDF with an entry 
titled "EncryptedPackage".

Can you run POI's POIFSViewer on your private files and tell us what the 
contents look like?

{code:java}
java -cp tika-app-2.3.0.jar org.apache.poi.poifs.dev.POIFSViewer myfile.docx
{code}


> Detect and indicate file encrypted with Rights Management Service RMS/IRM
> -------------------------------------------------------------------------
>
>                 Key: TIKA-3666
>                 URL: https://issues.apache.org/jira/browse/TIKA-3666
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: August Valera
>            Priority: Major
>
> Rights Management Service (RMS), implemented in MS Office as Information 
> Rights Management (IRM), allows organizations to set file permissions that 
> are stored within the file. In most cases, this will result in the file 
> getting a new extension (with a prefix p, such as {{.txt}} becoming 
> {{{}.ptxt{}}}), but in the case of MS Office and PDF files, which support 
> this natively, the implementation results in the file contents being 
> encrypted without any extension change. 
> h4. Current behavior
> Running such files through Tika produces results as if it was an empty file 
> ran through {{DefaultParser}} and {{{}OfficeParser{}}}.
> h4. Expected behavior
> Extract more metadata about necessary permissions to view (if possible), and 
> throwing {{EncryptedDocumentException}} as is the case with Office files 
> encrypted in the more traditional manner.
> Reference: 
> [https://docs.microsoft.com/en-us/azure/information-protection/rms-client/clientv2-admin-guide-file-types#supported-file-types-for-classification-and-protection]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to