hi,

I have encountered an issue attempting to open a pdf file generated by
word365 with old version of pdfbox 2.x.

I was able to reproduce the same error it on latest version of pdfbox
3.0.6+ with a sample code that just tries to open such pdf file.

the error reports:
Jan 10, 2026 2:13:23 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 827917, length: 2118,
expected end position: 830035
Jan 10, 2026 2:13:23 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 934902, length: 1097,
expected end position: 935999
Exception in thread "main" java.io.IOException: Page tree root must be a
dictionary
at org.apache.pdfbox.pdfparser.COSParser.checkPages(COSParser.java:1416)
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:120)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:171)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:136)
at org.apache.pdfbox.Loader.loadPDF(Loader.java:483)
at org.apache.pdfbox.Loader.loadPDF(Loader.java:359)

looking further at the code, it decides to do 'brute force' parsing ... and
does not find expected Pages entry.

What happens is that
method org.apache.pdfbox.pdfparser.PDFXrefStreamParser#parse () is called
and, at some stage records an Xref configured with offset 0 ...
such recording is later verified by
the org.apache.pdfbox.pdfparser.COSParser#validateXrefOffsets () that
cannot resolve object for such offset
(see org.apache.pdfbox.pdfparser.COSParser#findObjectKey) ... and therefore
reset the parsing ... triggering the 'brute force' approach.

The org.apache.pdfbox.pdfparser.PDFXrefStreamParser#parse () method
currently do ...

...

// second field holds the offset (type 1) or the object stream number (type 2)
long offset = parseValue(currLine, w[0], w[1]);
// third filed may hold the generation number (type1) or the index
within a object stream (type2)
int thirdValue = (int) parseValue(currLine, w[0] + w[1], w[2]);

...


*Q1*: can we add some test in the code the exclude the recording of xref if
the offset if either less than 6
(org.apache.pdfbox.pdfparser.COSParser#MINIMUM_SEARCH_OFFSET) ... or if it
is 0 ... so that pdfbox can accept such incorrect file(s) ?

ie:

// second field holds the offset (type 1) or the object stream number (type 2)
long offset = parseValue(currLine, w[0], w[1]);


*if (0 == offset){    // found some incorrect PDF file that were
showing such xref entry*

*    continue;*


*}*// third filed may hold the generation number (type1) or the index
within a object stream (type2)
int thirdValue = (int) parseValue(currLine, w[0] + w[1], w[2]);


If pdfbox cannot be change to accommodate such file ...

*Q2*: would you have any recommandation to share ?

thank you,

Reply via email to