[
https://issues.apache.org/jira/browse/TIKA-4082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17733232#comment-17733232
]
Tim Allison edited comment on TIKA-4082 at 6/15/23 8:45 PM:
------------------------------------------------------------
A bit late, but I finally broke out the new spec, and 7.6.7 suggests that we
should check for a Collection dictionary with view of {{H}} (hidden) as well in
addition to our current approach of looking at the AFRelationship.
As I look back at TIKA-3666, MS is doing to encrypting PDF what they did to
encrypting ooxml. Wrap the encrypted document in an un-encrypted package. So,
this is in the spec, and it is somewhat logically consistent with MS formats.
was (Author: [email protected]):
A bit late, but I finally broke out the new spec, and 7.6.7 suggests that we
should check for a Collection dictionary with view of {{H}} (hidden) as well as
our current approach of looking at the AFRelationship.
> Extraction from Microsoft Sharepoint protected PDFs doesn't expose exception
> like other parsers.
> ------------------------------------------------------------------------------------------------
>
> Key: TIKA-4082
> URL: https://issues.apache.org/jira/browse/TIKA-4082
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.4.1
> Reporter: Carey Halton
> Priority: Minor
> Attachments: MSFT Transcript FY23-Q3.pdf,
> PasswordProtectedWorkbook.xlsx, Screenshot from 2023-06-14 18-13-21.png,
> password protected pdf exception.txt, password protected xlsx exception.txt,
> screenshot-1.png
>
>
> I have attached a PDF file (see "MSFT Transcript FY23-Q3.pdf") that we are
> currently attempting to extract content from using Tika 2.4.1 (via Tika
> server), but since the file has had password protection added to it via
> Microsoft Sharepoint service, instead of getting the actual file content, we
> get content that says this:
> "
> _This PDF file is protected_
> __
> _You'll need a different reader in order to view this content:_
> _Download a compatible PDF reader._
> __
> _This PDF Document has been protected._
> _The reader you are using does not support opening files protected by
> Microsoft Office_
> _http://go.microsoft.com/fwlink/?LinkID=231373_
> "
> Which is fine since the original content can obviously not be accessed
> without the password. It also throws an exception that we can see in
> "X-TIKA:EXCEPTION:embedded_warning" that is attached in the file "password
> protected pdf exception.txt".
> But we were surprised that we see any content at all as we have a similar
> document (see attached "PasswordProtectedWorkbook.xlsx") that we test with
> that is password protected in a similar way, albeit a XLSX instead of a PDF,
> that doesn't return any content and throws an exception in
> "X-TIKA:EXCEPTION:container_exception" (attached in "password protected
> xlsx.txt"), which we currently treat as a failure mode. whereas we don't
> currently treat "X-TIKA:EXCEPTION:embedded_warning" as a failure.
> I realize these are different parsers, but since it is a very similar
> scenario, should they not be treated in the same way, at least voiding all
> content and emitting a proper failing exception instead of just what appears
> to be considered a warning? We are hesitant to make all instances of
> "X-TIKA:EXCEPTION:embedded_warning" as failures as we are unsure what other
> kinds of errors can be surfaced in that way. But it is clear to us that
> password protected files should be considered as failed to process. Thoughts?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)