ClassCastException in PdfParser on encrypted PDF with empty password
--------------------------------------------------------------------

                 Key: NUTCH-643
                 URL: https://issues.apache.org/jira/browse/NUTCH-643
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.9.0
         Environment: This problem affects the current trunk too.
            Reporter: Guillaume Smet


Hi,

If a PDF document is encrypted with an empty password, the PdfParser should 
decrypt it using the empty password.

This behaviour is implemented with the following code:
      if (pdf.isEncrypted()) {
        DocumentEncryption decryptor = new DocumentEncryption(pdf);
        //Just try using the default password and move on
        decryptor.decryptDocument("");
      }
It uses a deprecated API and moreover it seems there is a bug in PDFBox in this 
deprecated API (we have a ClassCastException in PDFBox) as we have the 
following error:

2008-08-07 19:15:56,860 WARN  parse.pdf - General exception in PDF parser: 
org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
org.pdfbox.pdmodel.encryption.PDStandardEncryption
2008-08-07 19:15:56,862 WARN  parse.pdf - java.lang.ClassCastException: 
org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
org.pdfbox.pdmodel.encryption.PDStandardEncryption
2008-08-07 19:15:56,862 WARN  parse.pdf - at 
org.pdfbox.encryption.DocumentEncryption.decryptDocument(DocumentEncryption.java:197)
2008-08-07 19:15:56,862 WARN  parse.pdf - at 
org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:98)
2008-08-07 19:15:56,862 WARN  parse.pdf - at 
org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
2008-08-07 19:15:56,862 WARN  parse.pdf - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
2008-08-07 19:15:56,862 WARN  parse.pdf - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
2008-08-07 19:15:56,874 WARN  fetcher.Fetcher - Error parsing: 
http://www2.culture.gouv.fr/deps/fr/stateurope071.pdf: failed(2,0): Can't be 
handled as pdf document. java.lang.ClassCastException: 
org.pdfbox.pdmodel.encryption.PDEncryptionDictionary cannot be cast to 
org.pdfbox.pdmodel.encryption.PDStandardEncryption

Using the new security API, we don't have any error parsing this document and 
we can get its content:
                        if (pdf.isEncrypted()) {
                                // Just try using the default password and move 
on
                                pdf.openProtection(new 
StandardDecryptionMaterial(""));
                        }

I attached the patch fixing this problem: it works perfectly with the above 
document and get rids of the deprecated API.

Regards,

-- 
Guillaume

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to