[
https://issues.apache.org/jira/browse/PDFBOX-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stefan Ziegler updated PDFBOX-6201:
-----------------------------------
Attachment: COSParser-1.patch
> Enhance XRef Brute Force to keep valid entries
> ----------------------------------------------
>
> Key: PDFBOX-6201
> URL: https://issues.apache.org/jira/browse/PDFBOX-6201
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 3.0.7 PDFBox
> Reporter: Stefan Ziegler
> Priority: Major
> Attachments: COSParser-1.java, COSParser-1.patch, COSParser.java,
> COSParser.patch, pdfbox-6201-1.pdf, pdfbox-6201-2.pdf
>
>
> PDFBox Bug: Form field values lost when loading PDFs with many incremental
> saves
> ================================================================================
> Component: pdfbox - COSParser, BruteForceParser
> Affects: 3.0.7 (confirmed); likely all prior versions
> Severity: Major - visible data loss (form field values silently set to empty)
> SYMPTOM
> -------
> Loading a PDF with many incremental saves (e.g. 1948 startxref/%%EOF sections)
> causes PDFBox to silently lose form field values. The original PDF, when
> viewed in
> Adobe Acrobat, Chrome, or qpdf, correctly shows filled-in values such as
> "xxxx", "xxxx", "xxxx", "xxxx". After loading with
> PDFBox and saving, all fields are empty.
> qpdf and Ghostscript process the same PDF without errors or warnings.
> Running "qpdf" on the PDF beforehand produces a clean file that
> PDFBox handles correctly.
> ROOT CAUSE
> ----------
> The bug is in COSParser.checkXrefOffsets() (called from parseXref(), lenient
> mode only).
> Step-by-step trace:
> 1. parseXref() correctly traverses the full /Prev chain (5 XRef streams):
> Depth 0: XRef@7165114 /Size=721 8 entries /Prev=7148230
> Depth 1: XRef@7148230 /Size=715 10 entries /Prev=7144285
> Depth 2: XRef@7144285 /Size=708 340 entries /Prev=116 <- has
> Obj 185
> Depth 3: XRef@116 /Size=159 131 entries /Prev=128867
> Depth 4: XRef@128867 /Size=28 28 entries /Prev=none
> After setStartxref(), xrefTrailerResolver.getXrefTable() has 384 entries.
> Obj 185 -> offset 2523997, which contains /V (xxxx). CORRECT.
> 2. checkXrefOffsets() is called (lenient mode). It calls
> validateXrefOffsets().
> 3. validateXrefOffsets() iterates over all 384 entries. At the FIRST entry
> whose
> offset cannot be dereferenced (findObjectKey returns null), it immediately
> returns false -- without checking the remaining entries.
> 4. Back in checkXrefOffsets(), because validateXrefOffsets() returned false:
> xrefOffset.clear(); // DESTROYS all 384 correct
> entries
> xrefOffset.putAll(bfCOSObjectKeyOffsets); // replaces with
> brute-force results
> 5. BruteForceParser.getBFCOSObjectOffsets() scans the file linearly using
> map.put()
> (not putIfAbsent). For each "N 0 obj" marker found, it overwrites the
> previous
> entry for that object number. The LAST physical occurrence wins.
> 6. Object 185 appears 44 times physically. The last occurrence (offset
> 7019154) is
> an empty copy written by a later auto-save -- it has no /V entry.
> 7. PDFBox loads the empty object. Text4.getValueAsString() returns "".
> Verification:
> Before checkXrefOffsets: xrefTable.size()=384, obj185=2523997 <- CORRECT
> After checkXrefOffsets: xrefTable.size()=85, obj185=null <- BUG
> THE FIX
> -----------------------
> FIX 1: COSParser.checkXrefOffsets()
> Replace the all-or-nothing logic with selective correction.
> Collect all invalid keys (don't stop at first failure), then only replace
> those
> specific invalid entries with brute-force results. Leave valid entries
> untouched.
> See attached COSParser.java for the full implementation:
> - checkXrefOffsets() now calls collectInvalidXrefKeys() instead of
> validateXrefOffsets()
> - collectInvalidXrefKeys() checks ALL entries and returns only the invalid
> ones
> - Only invalid entries are corrected via brute force; valid ones are preserved
> Note: Fix 1 alone fully resolves the reported issue.
> BruteForceParser is not involved in this bug path at all -- obj 185 has a
> valid
> XRef entry and is therefore never touched by the brute-force scan after the
> fix.
> FULL XRef CHAIN
> ---------------
> Offset Obj Size Entries Prev Contains obj 185?
> 7165114 720 721 8 7148230 no
> 7148230 714 715 10 7144285 no (has obj 184, not 185)
> 7144285 707 708 340 116 YES -> offset 2523997
> 116 67 159 131 128867 no
> 128867 6 28 28 - no
> All 5 XRef streams decompress without error. The chain is valid.
> PDFBox reads all 384 entries correctly before checkXrefOffsets() destroys
> them.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]