encrypted files aren't handled properly
---------------------------------------
Key: TIKA-267
URL: https://issues.apache.org/jira/browse/TIKA-267
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.4
Environment: Ubuntu Linux 8.10, JRE 1.5
Reporter: Sascha Szott
Priority: Critical
While I was working on extracting full texts out of a bunch of pdf documents, I
realized an odd behaviour of Tika when processing encrypted documents (those
documents that restrict the execution of specific actions, e.g. editing or
printing). To extract content from an encrypted pdf document you do not have to
decrypt the document in every case. For instance, when creating an (encrypted)
pdf document the author can decide to allow content extraction without the need
of providing a password. Unfortunately, Tika's pdf parser isn't aware of this
at the moment. Therefore, I suggest a minor change inside the parse method in
class org.apache.tika.parser.pdf.PDFParser by introducing an additional check
("is copying allowed") before trying to decrypt the document.
To be more precise, I'll provide a code snippet:
public void parse(...) throws ... {
PDDocument pdfDocument = PDDocument.load(stream);
try {
//decrypt document only if copying is not allowed
if (!pdfDocument.getCurrentAccessPermission().canExtractContent()) {
if (pdfDocument.isEncrypted()) {
try {
pdfDocument.decrypt("");
} catch (Exception e) {
// Ignore
}
}
}
...
Another solution to this problem would be to eliminate the "isEncrypted" check
since PDFBox seems to handle the extraction of content out of encrypted
documents correctly (and throws an IOException in case of failure).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.