[
https://issues.apache.org/jira/browse/TIKA-4082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17733212#comment-17733212
]
Tim Allison commented on TIKA-4082:
-----------------------------------
[~chalton] are you ok if we add the file you attached to our unit tests? Any
chance you could create a smaller file?
> Extraction from Microsoft Sharepoint protected PDFs doesn't expose exception
> like other parsers.
> ------------------------------------------------------------------------------------------------
>
> Key: TIKA-4082
> URL: https://issues.apache.org/jira/browse/TIKA-4082
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.4.1
> Reporter: Carey Halton
> Priority: Minor
> Attachments: MSFT Transcript FY23-Q3.pdf,
> PasswordProtectedWorkbook.xlsx, Screenshot from 2023-06-14 18-13-21.png,
> password protected pdf exception.txt, password protected xlsx exception.txt,
> screenshot-1.png
>
>
> I have attached a PDF file (see "MSFT Transcript FY23-Q3.pdf") that we are
> currently attempting to extract content from using Tika 2.4.1 (via Tika
> server), but since the file has had password protection added to it via
> Microsoft Sharepoint service, instead of getting the actual file content, we
> get content that says this:
> "
> _This PDF file is protected_
> __
> _You'll need a different reader in order to view this content:_
> _Download a compatible PDF reader._
> __
> _This PDF Document has been protected._
> _The reader you are using does not support opening files protected by
> Microsoft Office_
> _http://go.microsoft.com/fwlink/?LinkID=231373_
> "
> Which is fine since the original content can obviously not be accessed
> without the password. It also throws an exception that we can see in
> "X-TIKA:EXCEPTION:embedded_warning" that is attached in the file "password
> protected pdf exception.txt".
> But we were surprised that we see any content at all as we have a similar
> document (see attached "PasswordProtectedWorkbook.xlsx") that we test with
> that is password protected in a similar way, albeit a XLSX instead of a PDF,
> that doesn't return any content and throws an exception in
> "X-TIKA:EXCEPTION:container_exception" (attached in "password protected
> xlsx.txt"), which we currently treat as a failure mode. whereas we don't
> currently treat "X-TIKA:EXCEPTION:embedded_warning" as a failure.
> I realize these are different parsers, but since it is a very similar
> scenario, should they not be treated in the same way, at least voiding all
> content and emitting a proper failing exception instead of just what appears
> to be considered a warning? We are hesitant to make all instances of
> "X-TIKA:EXCEPTION:embedded_warning" as failures as we are unsure what other
> kinds of errors can be surfaced in that way. But it is clear to us that
> password protected files should be considered as failed to process. Thoughts?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)