[
https://issues.apache.org/jira/browse/TIKA-4082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17733606#comment-17733606
]
Hudson commented on TIKA-4082:
------------------------------
SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1121 (See
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1121/])
TIKA-4082 (#1196) (github:
[https://github.com/apache/tika/commit/ba77113802157ad81c93a231fae562fcadd1140f])
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) CHANGES.txt
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* (add)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testMicrosoftIRMServices.pdf
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
TIKA-4082 -- add annotation (tallison:
[https://github.com/apache/tika/commit/1e9f360efacb406e70a20781414a43181e3326ed])
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
> Extraction from Microsoft Sharepoint protected PDFs doesn't expose exception
> like other parsers.
> ------------------------------------------------------------------------------------------------
>
> Key: TIKA-4082
> URL: https://issues.apache.org/jira/browse/TIKA-4082
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.4.1
> Reporter: Carey Halton
> Priority: Minor
> Fix For: 2.8.1
>
> Attachments: MSFT Transcript FY23-Q3.pdf,
> PasswordProtectedWorkbook.xlsx, Screenshot from 2023-06-14 18-13-21.png,
> password protected pdf exception.txt, password protected xlsx exception.txt,
> screenshot-1.png
>
>
> I have attached a PDF file (see "MSFT Transcript FY23-Q3.pdf") that we are
> currently attempting to extract content from using Tika 2.4.1 (via Tika
> server), but since the file has had password protection added to it via
> Microsoft Sharepoint service, instead of getting the actual file content, we
> get content that says this:
> "
> _This PDF file is protected_
> __
> _You'll need a different reader in order to view this content:_
> _Download a compatible PDF reader._
> __
> _This PDF Document has been protected._
> _The reader you are using does not support opening files protected by
> Microsoft Office_
> _http://go.microsoft.com/fwlink/?LinkID=231373_
> "
> Which is fine since the original content can obviously not be accessed
> without the password. It also throws an exception that we can see in
> "X-TIKA:EXCEPTION:embedded_warning" that is attached in the file "password
> protected pdf exception.txt".
> But we were surprised that we see any content at all as we have a similar
> document (see attached "PasswordProtectedWorkbook.xlsx") that we test with
> that is password protected in a similar way, albeit a XLSX instead of a PDF,
> that doesn't return any content and throws an exception in
> "X-TIKA:EXCEPTION:container_exception" (attached in "password protected
> xlsx.txt"), which we currently treat as a failure mode. whereas we don't
> currently treat "X-TIKA:EXCEPTION:embedded_warning" as a failure.
> I realize these are different parsers, but since it is a very similar
> scenario, should they not be treated in the same way, at least voiding all
> content and emitting a proper failing exception instead of just what appears
> to be considered a warning? We are hesitant to make all instances of
> "X-TIKA:EXCEPTION:embedded_warning" as failures as we are unsure what other
> kinds of errors can be surfaced in that way. But it is clear to us that
> password protected files should be considered as failed to process. Thoughts?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)