[ 
https://issues.apache.org/jira/browse/PDFBOX-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572238#comment-13572238
 ] 

Maruan Sahyoun commented on PDFBOX-1510:
----------------------------------------

try the NonSequentialParser using PDDocument.loadNonSeq. Works fine for me.

Background: the standard parser works sequentially from start to end of a pdf 
file i.e. all objects are read regardless of the fact if they are referenced by 
the Xref or not. This can cause access to no longer valid objects. The 
NonSequentialParser parses objects following the Xref making sure that only 
referenced objects are parsed.
                
> PDF gets corrupted when trying to extract it from the embedded files
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-1510
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1510
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.7.1
>            Reporter: Andriy
>            Priority: Critical
>         Attachments: doesnt_work.pdf, works2.pdf
>
>
> When a PDF is attached to another PDF it gets corrupted when retrieved 
> through PDEmbeddedFile.getByteArray() method call. For some reason the 
> returned array has less data than the original file that has been attached to 
> the PDF.
> This affects some of the documents and not another. Below is the test code 
> the replicates the issue.
> PDF that has an attachment that gets corrupted will be attached to the issue.
> public class PDFEmbeddedFiles {
>       private PDFEmbeddedFiles() {
>       }
>       public static void main(String[] args) throws Exception {
>               if (args.length != 1) {
>                       usage();
>                       System.exit(1);
>               } else {
>                       PDDocument document = null;
>                       try {
>                               File pdfFile = new File(args[0]);
>                               /*
>                               String filePath = pdfFile.getParent()
>                                               + 
> System.getProperty("file.separator");
>                               */
>                               document = PDDocument.load(pdfFile);
>                               if (document.isEncrypted()) {
>                                       try {
>                                               document.decrypt("");
>                                       } catch (InvalidPasswordException e) {
>                                               System.err.println("Error: The 
> document is encrypted.");
>                                       } catch 
> (org.apache.pdfbox.exceptions.CryptographyException e) {
>                                               e.printStackTrace();
>                                       }
>                               }
>                               
>                               PDDocumentNameDictionary namesDictionary = 
> document.getDocumentCatalog().getNames(); //new 
> PDDocumentNameDictionary(document.getDocumentCatalog());
>                               PDEmbeddedFilesNameTreeNode efTree = 
> namesDictionary.getEmbeddedFiles();
>                               if (efTree != null) {
>                                       Map<String, Object> names = 
> efTree.getNames();
>                                       Iterator<String> namesKeys = 
> names.keySet().iterator();
>                                       while (namesKeys.hasNext()) {
>                                               String filename = 
> namesKeys.next();
>                                               PDComplexFileSpecification 
> fileSpec = (PDComplexFileSpecification) names
>                                                               .get(filename);
>                                               PDEmbeddedFile embeddedFile = 
> fileSpec
>                                                               
> .getEmbeddedFile();
>                                               String embeddedFilename = 
> filename;//filePath + filename;
>                                               File file = new 
> File(filename);//filePath + filename);
>                                               System.out.println("Writing " + 
> embeddedFilename);
>                                               FileOutputStream fos = new 
> FileOutputStream(file);
>                                               
>                                               
> fos.write(embeddedFile.getByteArray());
>                                               fos.close();
>                                       }
>                               }
>                       } finally {
>                               if (document != null) {
>                                       document.close();
>                               }
>                       }
>               }
>       }
>       /**
>        * This will print the usage for this program.
>        */
>       private static void usage() {
>               System.err.println("Usage: java "
>                               + PDFEmbeddedFiles.class.getName() + " 
> <input-pdf>");
>       }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to