[ 
https://issues.apache.org/jira/browse/TIKA-4082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732777#comment-17732777
 ] 

Tim Allison commented on TIKA-4082:
-----------------------------------

[~chalton] given that there might be other useful info in the file, are you ok 
if we throw the encrypted document exception after processing the parts that 
aren't encrypted?

> Extraction from Microsoft Sharepoint protected PDFs doesn't expose exception 
> like other parsers.
> ------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-4082
>                 URL: https://issues.apache.org/jira/browse/TIKA-4082
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.4.1
>            Reporter: Carey Halton
>            Priority: Minor
>         Attachments: MSFT Transcript FY23-Q3.pdf, 
> PasswordProtectedWorkbook.xlsx, Screenshot from 2023-06-14 18-13-21.png, 
> password protected pdf exception.txt, password protected xlsx exception.txt
>
>
> I have attached a PDF file (see "MSFT Transcript FY23-Q3.pdf") that we are 
> currently attempting to extract content from using Tika 2.4.1 (via Tika 
> server), but since the file has had password protection added to it via 
> Microsoft Sharepoint service, instead of getting the actual file content, we 
> get content that says this:
> "
> _This PDF file is protected_ 
>   __  
> _You'll need a different reader in order to view this content:_ 
> _Download a compatible PDF reader._ 
>   __  
> _This PDF Document has been protected._ 
> _The reader you are using does not support opening files protected by 
> Microsoft Office_ 
> _http://go.microsoft.com/fwlink/?LinkID=231373_
> "
> Which is fine since the original content can obviously not be accessed 
> without the password. It also throws an exception that we can see in 
> "X-TIKA:EXCEPTION:embedded_warning" that is attached in the file "password 
> protected pdf exception.txt".
> But we were surprised that we see any content at all as we have a similar 
> document (see attached "PasswordProtectedWorkbook.xlsx") that we test with 
> that is password protected in a similar way, albeit a XLSX instead of a PDF, 
> that doesn't return any content and throws an exception in 
> "X-TIKA:EXCEPTION:container_exception" (attached in "password protected 
> xlsx.txt"), which we currently treat as a failure mode. whereas we don't 
> currently treat "X-TIKA:EXCEPTION:embedded_warning" as a failure.
> I realize these are different parsers, but since it is a very similar 
> scenario, should they not be treated in the same way, at least voiding all 
> content and emitting a proper failing exception instead of just what appears 
> to be considered a warning? We are hesitant to make all instances of 
> "X-TIKA:EXCEPTION:embedded_warning" as failures as we are unsure what other 
> kinds of errors can be surfaced in that way. But it is clear to us that 
> password protected files should be considered as failed to process. Thoughts?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to