[ https://issues.apache.org/jira/browse/TIKA-4082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732773#comment-17732773 ]
Tim Allison edited comment on TIKA-4082 at 6/14/23 10:23 PM: ------------------------------------------------------------- Looks like we can be fairly general and throw an EncryptedDocumentException if the container file is a "Collection" and one of the embedded files has an Associated File Relationship (AFRelationship) of EncryptedPayload? What I don't like about this solution is that there theoretically could be other interesting content that someone might want from a document including other un-encrypted embedded files. Fellow devs (esp. [~tilman]), any thoughts? was (Author: talli...@mitre.org): Looks like we can be fairly general and throw an EncryptedDocumentException if the container file is a "Collection" and one of the embedded files has an Associated File Relationship (AFRelationship) of EncryptedPayload? What I don't like about this solution is that there theoretically could be other interesting content that someone might want from a document including other un-encrypted embedded files. Fellow devs, any thoughts? > Extraction from Microsoft Sharepoint protected PDFs doesn't expose exception > like other parsers. > ------------------------------------------------------------------------------------------------ > > Key: TIKA-4082 > URL: https://issues.apache.org/jira/browse/TIKA-4082 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.4.1 > Reporter: Carey Halton > Priority: Minor > Attachments: MSFT Transcript FY23-Q3.pdf, > PasswordProtectedWorkbook.xlsx, Screenshot from 2023-06-14 18-13-21.png, > password protected pdf exception.txt, password protected xlsx exception.txt > > > I have attached a PDF file (see "MSFT Transcript FY23-Q3.pdf") that we are > currently attempting to extract content from using Tika 2.4.1 (via Tika > server), but since the file has had password protection added to it via > Microsoft Sharepoint service, instead of getting the actual file content, we > get content that says this: > " > _This PDF file is protected_ > __ > _You'll need a different reader in order to view this content:_ > _Download a compatible PDF reader._ > __ > _This PDF Document has been protected._ > _The reader you are using does not support opening files protected by > Microsoft Office_ > _http://go.microsoft.com/fwlink/?LinkID=231373_ > " > Which is fine since the original content can obviously not be accessed > without the password. It also throws an exception that we can see in > "X-TIKA:EXCEPTION:embedded_warning" that is attached in the file "password > protected pdf exception.txt". > But we were surprised that we see any content at all as we have a similar > document (see attached "PasswordProtectedWorkbook.xlsx") that we test with > that is password protected in a similar way, albeit a XLSX instead of a PDF, > that doesn't return any content and throws an exception in > "X-TIKA:EXCEPTION:container_exception" (attached in "password protected > xlsx.txt"), which we currently treat as a failure mode. whereas we don't > currently treat "X-TIKA:EXCEPTION:embedded_warning" as a failure. > I realize these are different parsers, but since it is a very similar > scenario, should they not be treated in the same way, at least voiding all > content and emitting a proper failing exception instead of just what appears > to be considered a warning? We are hesitant to make all instances of > "X-TIKA:EXCEPTION:embedded_warning" as failures as we are unsure what other > kinds of errors can be surfaced in that way. But it is clear to us that > password protected files should be considered as failed to process. Thoughts? -- This message was sent by Atlassian Jira (v8.20.10#820010)