[ 
https://issues.apache.org/jira/browse/PDFBOX-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158494#comment-17158494
 ] 

Andreas Lehmkühler commented on PDFBOX-4915:
--------------------------------------------

[~mkl] Thanks for the details.

PDFBox stumbles upon the incorrect "0" offsets. It checks all offsets and 
triggers a brute force search due to the wrong offsets. The brute force search 
mixes the object 2 0 up as there are two versions of it, unfortunately PDFBox 
chooses the broken one. I tend to skip those objects with an offset of "0" when 
parsing the xref table. It works for the given object, but I'm not sure if it 
would work for every (broken) pdf.

> "Page tree root must be a dictionary" on PDDocument.load
> --------------------------------------------------------
>
>                 Key: PDFBOX-4915
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4915
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.19
>            Reporter: Gauthier Roebroeck
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>         Attachments: Black Bullet - Volume 01 - Those Who Would Be Gods [Yen 
> Press][Kobo_Kitzoku].pdf, Screenshot 2020-07-14 at 20.19.40.png
>
>
> Hi,
> i have a PDF file that throws the following exception:
> {{java.io.IOException: Page tree root must be a 
> dictionaryjava.io.IOException: Page tree root must be a dictionary at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198) 
> ~[pdfbox-2.0.19.jar:2.0.19] at 
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) 
> ~[pdfbox-2.0.19.jar:2.0.19] at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1222) 
> ~[pdfbox-2.0.19.jar:2.0.19] at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1122) 
> ~[pdfbox-2.0.19.jar:2.0.19]}}
> This happens when loading the document from an InputStream.
> The document can be opened properly using Preview on Mac.
>  
> I have checked the PDF structure (even though i don't know it very well), 
> from what i can see it could be because the /Pages is not the first element 
> under the /Root.
>  
> !Screenshot 2020-07-14 at 20.19.40.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to