[
https://issues.apache.org/jira/browse/PDFBOX-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18077369#comment-18077369
]
Maruan Sahyoun commented on PDFBOX-6201:
----------------------------------------
[~mkl] Thank you for your insights. Dealing with malformed PDFs is indeed
undefined, and the best approach often depends on the specific scenario.
Aligning with how Adobe Acrobat and other software handle such cases provides a
common ground for consistency.
However, without a standardized test corpus—where we could compare how
different vendors repair malformed files—it’s challenging to establish a
universal best practice. As a result, repairs are often handled on a
case-by-case basis with a best-effort approach.
Given the potential for non-deterministic behaviour in handling non-conforming
files, applying the patch for version 3.0 might disrupt the current workflow
for existing users. That said, the contribution is valuable and worth
considering for future iterations.
> Enhance XRef Brute Force to keep valid entries
> ----------------------------------------------
>
> Key: PDFBOX-6201
> URL: https://issues.apache.org/jira/browse/PDFBOX-6201
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 3.0.7 PDFBox
> Reporter: Stefan Ziegler
> Priority: Major
> Attachments: 0001-fix-for-PDFBOX-6201.patch,
> 0002-validateXrefOffsets-is-no-longer-needed.patch, COSParser-1.java,
> COSParser-1.patch, COSParser.java, COSParser.patch, pdfbox-6201-1.pdf,
> pdfbox-6201-2.pdf
>
>
> PDFBox Bug: Form field values lost when loading PDFs with many incremental
> saves
> ================================================================================
> Component: pdfbox - COSParser, BruteForceParser
> Affects: 3.0.7 (confirmed); likely all prior versions
> Severity: Major - visible data loss (form field values silently set to empty)
> SYMPTOM
> -------
> Loading a PDF with many incremental saves (e.g. 1948 startxref/%%EOF sections)
> causes PDFBox to silently lose form field values. The original PDF, when
> viewed in
> Adobe Acrobat, Chrome, or qpdf, correctly shows filled-in values such as
> "xxxx", "xxxx", "xxxx", "xxxx". After loading with
> PDFBox and saving, all fields are empty.
> qpdf and Ghostscript process the same PDF without errors or warnings.
> Running "qpdf" on the PDF beforehand produces a clean file that
> PDFBox handles correctly.
> ROOT CAUSE
> ----------
> The bug is in COSParser.checkXrefOffsets() (called from parseXref(), lenient
> mode only).
> Step-by-step trace:
> 1. parseXref() correctly traverses the full /Prev chain (5 XRef streams):
> Depth 0: XRef@7165114 /Size=721 8 entries /Prev=7148230
> Depth 1: XRef@7148230 /Size=715 10 entries /Prev=7144285
> Depth 2: XRef@7144285 /Size=708 340 entries /Prev=116 <- has
> Obj 185
> Depth 3: XRef@116 /Size=159 131 entries /Prev=128867
> Depth 4: XRef@128867 /Size=28 28 entries /Prev=none
> After setStartxref(), xrefTrailerResolver.getXrefTable() has 384 entries.
> Obj 185 -> offset 2523997, which contains /V (xxxx). CORRECT.
> 2. checkXrefOffsets() is called (lenient mode). It calls
> validateXrefOffsets().
> 3. validateXrefOffsets() iterates over all 384 entries. At the FIRST entry
> whose
> offset cannot be dereferenced (findObjectKey returns null), it immediately
> returns false -- without checking the remaining entries.
> 4. Back in checkXrefOffsets(), because validateXrefOffsets() returned false:
> xrefOffset.clear(); // DESTROYS all 384 correct
> entries
> xrefOffset.putAll(bfCOSObjectKeyOffsets); // replaces with
> brute-force results
> 5. BruteForceParser.getBFCOSObjectOffsets() scans the file linearly using
> map.put()
> (not putIfAbsent). For each "N 0 obj" marker found, it overwrites the
> previous
> entry for that object number. The LAST physical occurrence wins.
> 6. Object 185 appears 44 times physically. The last occurrence (offset
> 7019154) is
> an empty copy written by a later auto-save -- it has no /V entry.
> 7. PDFBox loads the empty object. Text4.getValueAsString() returns "".
> Verification:
> Before checkXrefOffsets: xrefTable.size()=384, obj185=2523997 <- CORRECT
> After checkXrefOffsets: xrefTable.size()=85, obj185=null <- BUG
> THE FIX
> -----------------------
> FIX 1: COSParser.checkXrefOffsets()
> Replace the all-or-nothing logic with selective correction.
> Collect all invalid keys (don't stop at first failure), then only replace
> those
> specific invalid entries with brute-force results. Leave valid entries
> untouched.
> See attached COSParser.java for the full implementation:
> - checkXrefOffsets() now calls collectInvalidXrefKeys() instead of
> validateXrefOffsets()
> - collectInvalidXrefKeys() checks ALL entries and returns only the invalid
> ones
> - Only invalid entries are corrected via brute force; valid ones are preserved
> Note: Fix 1 alone fully resolves the reported issue.
> BruteForceParser is not involved in this bug path at all -- obj 185 has a
> valid
> XRef entry and is therefore never touched by the brute-force scan after the
> fix.
> FULL XRef CHAIN
> ---------------
> Offset Obj Size Entries Prev Contains obj 185?
> 7165114 720 721 8 7148230 no
> 7148230 714 715 10 7144285 no (has obj 184, not 185)
> 7144285 707 708 340 116 YES -> offset 2523997
> 116 67 159 131 128867 no
> 128867 6 28 28 - no
> All 5 XRef streams decompress without error. The chain is valid.
> PDFBox reads all 384 entries correctly before checkXrefOffsets() destroys
> them.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]