[ 
https://issues.apache.org/jira/browse/PDFBOX-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evgeny Chesnokov updated PDFBOX-3238:
-------------------------------------
    Description: 
Attached is a sample file with a single image on the 1st page in it. When I 
append the 1st page of a loaded document to a new document, the new document 
does not have an image in it (displayed as a blank page; Acrobat Reader says 
the file is broken).

Steps to reproduce:
1. load an attached PDF file using PdfBox (checked versions 1.8.11 and 
2.0.0-RC2, tried both {{#load()}} and {{#loadNonSeq()}})
2. create a new document
3. add a page from a loaded document to a new document
4. save a document to a new file.

Expected: a new PDF file gets created, when opened, it contains an image on the 
1st page.
Actual behaviour: a new PDF file gets created, when opened, the 1st page is 
empty and Acrobat Reader reports an error ("An error exists on this page. 
Acrobat may not display the page correctly.").

Code to reproduce the issue for version 1.8.11:
{code}
        PDDocument source = PDDocument.load(new File("Welding Fixture 
Model.dwg.pdf"));
        PDPage page = (PDPage) source.getDocumentCatalog().getAllPages().get(0);
        
        PDDocument destination = new PDDocument();
        destination.addPage(page);

        destination.save("Welding Fixture Model.dwg.page0.pdf");
        destination.close();
{code}

==========

Research summary: I've decoded the attached PDF using {{qpdf}} utility and  
investigated its structure. Basically, there's no {{/Resources}} declaration in 
a {{/Page}} object, so it should get inherited from a {{/Pages}} object. 
Instead it is replaced with an empty resources object, so when saved, it does 
not have an image in it.

Research details:

Below are pieces of a decoded structure of the attached PDF.

*Pages list declaration:*
{noformat}
3 0 obj
<<
  /Count 1
  /Kids [
    4 0 R
  ]
  /Resources 5 0 R
  /Type /Pages
>>
endobj
{noformat}
Explanation:
 - {{/Type /Pages}} says this object is a list of pages;
 - {{/Kids}} is an array of references to the individual page objects. In this 
case, object #4 is the only page in a document;
 - {{/Resources 5 0 R}} stores a reference to a single resource that is used by 
the {{/Pages}} object. This is object #5, an image.

*1st page declaration:*
{noformat}
4 0 obj
<<
  /Contents 6 0 R
  /MediaBox [
    0
    0
    1984
    2551
  ]
  /Parent 3 0 R
  /Type /Page
>>
endobj
{noformat}
Explanation:
 - {{/Type /Page}} says it's a page (duh);
 - {{/Contents 6 0 R}} references an object #6 that is used to render the 
content of the page (I won't provide it but it uses the image object #5 
mentioned above);
 - {{/Parent 3 0 R}} is a reference to a {{/Pages}} object described above.

An important thing here is that this object does not have a {{/Resources}} 
section of its own. In this case, PDF spec says:
bq. (Required; inheritable) A dictionary containing any resources required by 
the page (see 7.8.3, "Resource Dictionaries"). If the page requires no 
resources, the value of this entry shall be an empty dictionary. *Omitting the 
entry entirely indicates that the resources shall be inherited from an ancestor 
node in the page tree*.

This last sentence means that Page 1 has the same list of resources as its 
parent /Pages object, and this is where PdfBox misbehaves. When exporting a 
page with no {{/Resources}} tag, it uses an **EMPTY** list of resources instead 
of an inherited one.

To verify this, I've added {{/Resources 5 0 R}} line to the sample PDF 1st page 
declaration:
{noformat}
4 0 obj
<<
  /Contents 6 0 R
  /MediaBox [
    0
    0
    1984
    2551
  ]
  /Parent 3 0 R
  /Resources 5 0 R
  /Type /Page
>>
endobj
{noformat}
After I did this, PdfBox successfully extracted the 1st page of this document 
and it correctly displayed an image.

  was:
Attached is a sample file with a single image on the 1st page in it. When I 
append the 1st page of a loaded document to a new document, the new document 
does not have an image in it (displayed as a blank page; Acrobat Reader says 
the file is broken).

Steps to reproduce:
1. load an attached PDF file using PdfBox (checked versions 1.8.11 and 
2.0.0-RC2)
2. create a new document
3. add a page from a loaded document to a new document
4. save a document to a new file.

Expected: a new PDF file gets created, when opened, it contains an image on the 
1st page.
Actual behaviour: a new PDF file gets created, when opened, the 1st page is 
empty and Acrobat Reader reports an error ("An error exists on this page. 
Acrobat may not display the page correctly.").

Code to reproduce the issue for version 1.8.11:
{code}
        PDDocument source = PDDocument.load(new File("Welding Fixture 
Model.dwg.pdf"));
        PDPage page = (PDPage) source.getDocumentCatalog().getAllPages().get(0);
        
        PDDocument destination = new PDDocument();
        destination.addPage(page);

        destination.save("Welding Fixture Model.dwg.page0.pdf");
        destination.close();
{code}

==========

Research summary: I've decoded the attached PDF using {{qpdf}} utility and  
investigated its structure. Basically, there's no {{/Resources}} declaration in 
a {{/Page}} object, so it should get inherited from a {{/Pages}} object. 
Instead it is replaced with an empty resources object, so when saved, it does 
not have an image in it.

Research details:

Below are pieces of a decoded structure of the attached PDF.

*Pages list declaration:*
{noformat}
3 0 obj
<<
  /Count 1
  /Kids [
    4 0 R
  ]
  /Resources 5 0 R
  /Type /Pages
>>
endobj
{noformat}
Explanation:
 - {{/Type /Pages}} says this object is a list of pages;
 - {{/Kids}} is an array of references to the individual page objects. In this 
case, object #4 is the only page in a document;
 - {{/Resources 5 0 R}} stores a reference to a single resource that is used by 
the {{/Pages}} object. This is object #5, an image.

*1st page declaration:*
{noformat}
4 0 obj
<<
  /Contents 6 0 R
  /MediaBox [
    0
    0
    1984
    2551
  ]
  /Parent 3 0 R
  /Type /Page
>>
endobj
{noformat}
Explanation:
 - {{/Type /Page}} says it's a page (duh);
 - {{/Contents 6 0 R}} references an object #6 that is used to render the 
content of the page (I won't provide it but it uses the image object #5 
mentioned above);
 - {{/Parent 3 0 R}} is a reference to a {{/Pages}} object described above.

An important thing here is that this object does not have a {{/Resources}} 
section of its own. In this case, PDF spec says:
bq. (Required; inheritable) A dictionary containing any resources required by 
the page (see 7.8.3, "Resource Dictionaries"). If the page requires no 
resources, the value of this entry shall be an empty dictionary. *Omitting the 
entry entirely indicates that the resources shall be inherited from an ancestor 
node in the page tree*.

This last sentence means that Page 1 has the same list of resources as its 
parent /Pages object, and this is where PdfBox misbehaves. When exporting a 
page with no {{/Resources}} tag, it uses an **EMPTY** list of resources instead 
of an inherited one.

To verify this, I've added {{/Resources 5 0 R}} line to the sample PDF 1st page 
declaration:
{noformat}
4 0 obj
<<
  /Contents 6 0 R
  /MediaBox [
    0
    0
    1984
    2551
  ]
  /Parent 3 0 R
  /Resources 5 0 R
  /Type /Page
>>
endobj
{noformat}
After I did this, PdfBox successfully extracted the 1st page of this document 
and it correctly displayed an image.


> Page resources are not inherited from an ancestor node in the page tree
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-3238
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3238
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.8.11, 2.0.0
>         Environment: Found on Windows 7 x64
>            Reporter: Evgeny Chesnokov
>         Attachments: Welding Fixture Model.dwg.pdf
>
>
> Attached is a sample file with a single image on the 1st page in it. When I 
> append the 1st page of a loaded document to a new document, the new document 
> does not have an image in it (displayed as a blank page; Acrobat Reader says 
> the file is broken).
> Steps to reproduce:
> 1. load an attached PDF file using PdfBox (checked versions 1.8.11 and 
> 2.0.0-RC2, tried both {{#load()}} and {{#loadNonSeq()}})
> 2. create a new document
> 3. add a page from a loaded document to a new document
> 4. save a document to a new file.
> Expected: a new PDF file gets created, when opened, it contains an image on 
> the 1st page.
> Actual behaviour: a new PDF file gets created, when opened, the 1st page is 
> empty and Acrobat Reader reports an error ("An error exists on this page. 
> Acrobat may not display the page correctly.").
> Code to reproduce the issue for version 1.8.11:
> {code}
>         PDDocument source = PDDocument.load(new File("Welding Fixture 
> Model.dwg.pdf"));
>         PDPage page = (PDPage) 
> source.getDocumentCatalog().getAllPages().get(0);
>         
>         PDDocument destination = new PDDocument();
>         destination.addPage(page);
>         destination.save("Welding Fixture Model.dwg.page0.pdf");
>         destination.close();
> {code}
> ==========
> Research summary: I've decoded the attached PDF using {{qpdf}} utility and  
> investigated its structure. Basically, there's no {{/Resources}} declaration 
> in a {{/Page}} object, so it should get inherited from a {{/Pages}} object. 
> Instead it is replaced with an empty resources object, so when saved, it does 
> not have an image in it.
> Research details:
> Below are pieces of a decoded structure of the attached PDF.
> *Pages list declaration:*
> {noformat}
> 3 0 obj
> <<
>   /Count 1
>   /Kids [
>     4 0 R
>   ]
>   /Resources 5 0 R
>   /Type /Pages
> >>
> endobj
> {noformat}
> Explanation:
>  - {{/Type /Pages}} says this object is a list of pages;
>  - {{/Kids}} is an array of references to the individual page objects. In 
> this case, object #4 is the only page in a document;
>  - {{/Resources 5 0 R}} stores a reference to a single resource that is used 
> by the {{/Pages}} object. This is object #5, an image.
> *1st page declaration:*
> {noformat}
> 4 0 obj
> <<
>   /Contents 6 0 R
>   /MediaBox [
>     0
>     0
>     1984
>     2551
>   ]
>   /Parent 3 0 R
>   /Type /Page
> >>
> endobj
> {noformat}
> Explanation:
>  - {{/Type /Page}} says it's a page (duh);
>  - {{/Contents 6 0 R}} references an object #6 that is used to render the 
> content of the page (I won't provide it but it uses the image object #5 
> mentioned above);
>  - {{/Parent 3 0 R}} is a reference to a {{/Pages}} object described above.
> An important thing here is that this object does not have a {{/Resources}} 
> section of its own. In this case, PDF spec says:
> bq. (Required; inheritable) A dictionary containing any resources required by 
> the page (see 7.8.3, "Resource Dictionaries"). If the page requires no 
> resources, the value of this entry shall be an empty dictionary. *Omitting 
> the entry entirely indicates that the resources shall be inherited from an 
> ancestor node in the page tree*.
> This last sentence means that Page 1 has the same list of resources as its 
> parent /Pages object, and this is where PdfBox misbehaves. When exporting a 
> page with no {{/Resources}} tag, it uses an **EMPTY** list of resources 
> instead of an inherited one.
> To verify this, I've added {{/Resources 5 0 R}} line to the sample PDF 1st 
> page declaration:
> {noformat}
> 4 0 obj
> <<
>   /Contents 6 0 R
>   /MediaBox [
>     0
>     0
>     1984
>     2551
>   ]
>   /Parent 3 0 R
>   /Resources 5 0 R
>   /Type /Page
> >>
> endobj
> {noformat}
> After I did this, PdfBox successfully extracted the 1st page of this document 
> and it correctly displayed an image.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to