[jira] [Updated] (PDFBOX-6201) Enhance XRef Brute Force to keep valid entries

Stefan Ziegler (Jira) Thu, 30 Apr 2026 07:45:08 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stefan Ziegler updated PDFBOX-6201:
-----------------------------------
    Attachment: COSParser-1.patch

> Enhance XRef Brute Force to keep valid entries
> ----------------------------------------------
>
>                 Key: PDFBOX-6201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6201
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 3.0.7 PDFBox
>            Reporter: Stefan Ziegler
>            Priority: Major
>         Attachments: COSParser-1.java, COSParser-1.patch, COSParser.java, 
> COSParser.patch, pdfbox-6201-1.pdf, pdfbox-6201-2.pdf
>
>
> PDFBox Bug: Form field values lost when loading PDFs with many incremental 
> saves
> ================================================================================
> Component: pdfbox - COSParser, BruteForceParser
> Affects:   3.0.7 (confirmed); likely all prior versions
> Severity:  Major - visible data loss (form field values silently set to empty)
> SYMPTOM
> -------
> Loading a PDF with many incremental saves (e.g. 1948 startxref/%%EOF sections)
> causes PDFBox to silently lose form field values. The original PDF, when 
> viewed in
> Adobe Acrobat, Chrome, or qpdf, correctly shows filled-in values such as
> "xxxx", "xxxx", "xxxx", "xxxx". After loading with
> PDFBox and saving, all fields are empty.
> qpdf and Ghostscript process the same PDF without errors or warnings.
> Running "qpdf" on the PDF beforehand produces a clean file that
> PDFBox handles correctly.
> ROOT CAUSE
> ----------
> The bug is in COSParser.checkXrefOffsets() (called from parseXref(), lenient 
> mode only).
> Step-by-step trace:
> 1. parseXref() correctly traverses the full /Prev chain (5 XRef streams):
>      Depth 0: XRef@7165114  /Size=721   8 entries   /Prev=7148230
>      Depth 1: XRef@7148230  /Size=715  10 entries   /Prev=7144285
>      Depth 2: XRef@7144285  /Size=708  340 entries  /Prev=116       <- has 
> Obj 185
>      Depth 3: XRef@116       /Size=159  131 entries  /Prev=128867
>      Depth 4: XRef@128867    /Size=28   28 entries   /Prev=none
>    After setStartxref(), xrefTrailerResolver.getXrefTable() has 384 entries.
>    Obj 185 -> offset 2523997, which contains /V (xxxx). CORRECT.
> 2. checkXrefOffsets() is called (lenient mode). It calls 
> validateXrefOffsets().
> 3. validateXrefOffsets() iterates over all 384 entries. At the FIRST entry 
> whose
>    offset cannot be dereferenced (findObjectKey returns null), it immediately
>    returns false -- without checking the remaining entries.
> 4. Back in checkXrefOffsets(), because validateXrefOffsets() returned false:
>        xrefOffset.clear();                        // DESTROYS all 384 correct 
> entries
>        xrefOffset.putAll(bfCOSObjectKeyOffsets);  // replaces with 
> brute-force results
> 5. BruteForceParser.getBFCOSObjectOffsets() scans the file linearly using 
> map.put()
>    (not putIfAbsent). For each "N 0 obj" marker found, it overwrites the 
> previous
>    entry for that object number. The LAST physical occurrence wins.
> 6. Object 185 appears 44 times physically. The last occurrence (offset 
> 7019154) is
>    an empty copy written by a later auto-save -- it has no /V entry.
> 7. PDFBox loads the empty object. Text4.getValueAsString() returns "".
> Verification:
>   Before checkXrefOffsets: xrefTable.size()=384, obj185=2523997  <- CORRECT
>   After  checkXrefOffsets: xrefTable.size()=85,  obj185=null     <- BUG
> THE FIX
> -----------------------
> FIX 1: COSParser.checkXrefOffsets()
> Replace the all-or-nothing logic with selective correction.
> Collect all invalid keys (don't stop at first failure), then only replace 
> those
> specific invalid entries with brute-force results. Leave valid entries 
> untouched.
> See attached COSParser.java for the full implementation:
> - checkXrefOffsets() now calls collectInvalidXrefKeys() instead of 
> validateXrefOffsets()
> - collectInvalidXrefKeys() checks ALL entries and returns only the invalid 
> ones
> - Only invalid entries are corrected via brute force; valid ones are preserved
> Note: Fix 1 alone fully resolves the reported issue.
> BruteForceParser is not involved in this bug path at all -- obj 185 has a 
> valid
> XRef entry and is therefore never touched by the brute-force scan after the 
> fix.
> FULL XRef CHAIN
> ---------------
> Offset     Obj    Size  Entries  Prev      Contains obj 185?
> 7165114    720    721      8     7148230   no
> 7148230    714    715     10     7144285   no  (has obj 184, not 185)
> 7144285    707    708    340     116       YES -> offset 2523997
> 116         67    159    131     128867    no
> 128867       6     28     28     -         no
> All 5 XRef streams decompress without error. The chain is valid.
> PDFBox reads all 384 entries correctly before checkXrefOffsets() destroys 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-6201) Enhance XRef Brute Force to keep valid entries

Reply via email to