[ 
https://issues.apache.org/jira/browse/PDFBOX-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18077333#comment-18077333
 ] 

Stefan Ziegler commented on PDFBOX-6201:
----------------------------------------

> Apart from the ghost objects which are not referenced at all, is there at
> least one reference in the xref which is at the wrong location?
 
Yes, exactly. There are 12 genuinely invalid XRef entries — all originating
from the oldest XRef at offset 116 (the base PDF, before any incremental
saves). These broken references are what triggers validateXrefOffsets() to
return false, which in turn activates the brute-force scan.
 
The 12 invalid entries:
 
  Obj  31: offset 6881282  → points into middle of a compressed stream (binary 
garbage)
  Obj  37: offset 4849666  → same, binary data
  Obj  43: offset 4325378  → points into a Form XObject stream, not an object 
header
  Obj  49: offset 8192002  → beyond end of file (file size = 7,165,472 bytes)
  Obj  83: offset 131072   → points to "706 0 obj" (wrong object)
  Obj 107: offset 131072   → same
  Obj 125: offset 131072   → same
  Obj 131: offset 131072   → same
  Obj 137: offset 131072   → same
  Obj 143: offset 131072   → same
  Obj 149: offset 131072   → same
  Obj 155: offset 131072   → same
 
The 8 objects (83, 107, 125, ..., 155) all pointing to offset 131072 suggest
they were originally stored in a compressed object stream at that location in
the base PDF. After 1948 incremental saves, the content at offset 131072 has
shifted — it now contains "706 0 obj" instead.
 
So to summarise the full picture:
 
  1. Base PDF has 12 broken XRef entries (stale offsets from object stream
     reorganisation).
 
  2. validateXrefOffsets() returns false on the very first invalid entry →
     the entire XRef table (384 correct entries, including obj 185 → 2523997)
     is discarded and replaced by the brute-force scan result.
 
  3. The brute-force scan finds object 185 physically 44 times. The last
     occurrence (offset 7019154, no /V) wins because the scanner uses
     Map.put() in a forward pass. This overwrites the correct entry.
 
  4. The correct entry for object 185 was never broken — it was collateral
     damage from the all-or-nothing replacement in step 2.
 
The patch avoids step 2 by only replacing the 12 genuinely broken entries
with brute-force results, leaving the 372 valid entries (including obj 185)
untouched.

> PDFBox Bug: Form field values lost when loading PDFs with many incremental 
> saves
> --------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 3.0.7 PDFBox
>            Reporter: Stefan Ziegler
>            Priority: Major
>         Attachments: COSParser.java, COSParser.patch, pdfbox-6201-1.pdf, 
> pdfbox-6201-2.pdf
>
>
> PDFBox Bug: Form field values lost when loading PDFs with many incremental 
> saves
> ================================================================================
> Component: pdfbox - COSParser, BruteForceParser
> Affects:   3.0.7 (confirmed); likely all prior versions
> Severity:  Major - visible data loss (form field values silently set to empty)
> SYMPTOM
> -------
> Loading a PDF with many incremental saves (e.g. 1948 startxref/%%EOF sections)
> causes PDFBox to silently lose form field values. The original PDF, when 
> viewed in
> Adobe Acrobat, Chrome, or qpdf, correctly shows filled-in values such as
> "xxxx", "xxxx", "xxxx", "xxxx". After loading with
> PDFBox and saving, all fields are empty.
> qpdf and Ghostscript process the same PDF without errors or warnings.
> Running "qpdf" on the PDF beforehand produces a clean file that
> PDFBox handles correctly.
> ROOT CAUSE
> ----------
> The bug is in COSParser.checkXrefOffsets() (called from parseXref(), lenient 
> mode only).
> Step-by-step trace:
> 1. parseXref() correctly traverses the full /Prev chain (5 XRef streams):
>      Depth 0: XRef@7165114  /Size=721   8 entries   /Prev=7148230
>      Depth 1: XRef@7148230  /Size=715  10 entries   /Prev=7144285
>      Depth 2: XRef@7144285  /Size=708  340 entries  /Prev=116       <- has 
> Obj 185
>      Depth 3: XRef@116       /Size=159  131 entries  /Prev=128867
>      Depth 4: XRef@128867    /Size=28   28 entries   /Prev=none
>    After setStartxref(), xrefTrailerResolver.getXrefTable() has 384 entries.
>    Obj 185 -> offset 2523997, which contains /V (xxxx). CORRECT.
> 2. checkXrefOffsets() is called (lenient mode). It calls 
> validateXrefOffsets().
> 3. validateXrefOffsets() iterates over all 384 entries. At the FIRST entry 
> whose
>    offset cannot be dereferenced (findObjectKey returns null), it immediately
>    returns false -- without checking the remaining entries.
> 4. Back in checkXrefOffsets(), because validateXrefOffsets() returned false:
>        xrefOffset.clear();                        // DESTROYS all 384 correct 
> entries
>        xrefOffset.putAll(bfCOSObjectKeyOffsets);  // replaces with 
> brute-force results
> 5. BruteForceParser.getBFCOSObjectOffsets() scans the file linearly using 
> map.put()
>    (not putIfAbsent). For each "N 0 obj" marker found, it overwrites the 
> previous
>    entry for that object number. The LAST physical occurrence wins.
> 6. Object 185 appears 44 times physically. The last occurrence (offset 
> 7019154) is
>    an empty copy written by a later auto-save -- it has no /V entry.
> 7. PDFBox loads the empty object. Text4.getValueAsString() returns "".
> Verification:
>   Before checkXrefOffsets: xrefTable.size()=384, obj185=2523997  <- CORRECT
>   After  checkXrefOffsets: xrefTable.size()=85,  obj185=null     <- BUG
> THE FIX
> -----------------------
> FIX 1: COSParser.checkXrefOffsets()
> Replace the all-or-nothing logic with selective correction.
> Collect all invalid keys (don't stop at first failure), then only replace 
> those
> specific invalid entries with brute-force results. Leave valid entries 
> untouched.
> See attached COSParser.java for the full implementation:
> - checkXrefOffsets() now calls collectInvalidXrefKeys() instead of 
> validateXrefOffsets()
> - collectInvalidXrefKeys() checks ALL entries and returns only the invalid 
> ones
> - Only invalid entries are corrected via brute force; valid ones are preserved
> Note: Fix 1 alone fully resolves the reported issue.
> BruteForceParser is not involved in this bug path at all -- obj 185 has a 
> valid
> XRef entry and is therefore never touched by the brute-force scan after the 
> fix.
> FULL XRef CHAIN
> ---------------
> Offset     Obj    Size  Entries  Prev      Contains obj 185?
> 7165114    720    721      8     7148230   no
> 7148230    714    715     10     7144285   no  (has obj 184, not 185)
> 7144285    707    708    340     116       YES -> offset 2523997
> 116         67    159    131     128867    no
> 128867       6     28     28     -         no
> All 5 XRef streams decompress without error. The chain is valid.
> PDFBox reads all 384 entries correctly before checkXrefOffsets() destroys 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to