Stefan Ziegler created PDFBOX-6201:
--------------------------------------

             Summary: PDFBox Bug: Form field values lost when loading PDFs with 
many incremental saves
                 Key: PDFBOX-6201
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6201
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 3.0.7 PDFBox
            Reporter: Stefan Ziegler
         Attachments: COSParser.java, COSParser.patch

PDFBox Bug: Form field values lost when loading PDFs with many incremental saves
================================================================================

Component: pdfbox - COSParser, BruteForceParser
Affects:   3.0.7 (confirmed); likely all prior versions
Severity:  Major - visible data loss (form field values silently set to empty)


SYMPTOM
-------
Loading a PDF with many incremental saves (e.g. 1948 startxref/%%EOF sections)
causes PDFBox to silently lose form field values. The original PDF, when viewed 
in
Adobe Acrobat, Chrome, or qpdf, correctly shows filled-in values such as
"xxxx", "xxxx", "xxxx", "xxxx". After loading with
PDFBox and saving, all fields are empty.

qpdf and Ghostscript process the same PDF without errors or warnings.
Running "qpdf" on the PDF beforehand produces a clean file that
PDFBox handles correctly.


ROOT CAUSE
----------
The bug is in COSParser.checkXrefOffsets() (called from parseXref(), lenient 
mode only).

Step-by-step trace:

1. parseXref() correctly traverses the full /Prev chain (5 XRef streams):

     Depth 0: XRef@7165114  /Size=721   8 entries   /Prev=7148230
     Depth 1: XRef@7148230  /Size=715  10 entries   /Prev=7144285
     Depth 2: XRef@7144285  /Size=708  340 entries  /Prev=116       <- has Obj 
185
     Depth 3: XRef@116       /Size=159  131 entries  /Prev=128867
     Depth 4: XRef@128867    /Size=28   28 entries   /Prev=none

   After setStartxref(), xrefTrailerResolver.getXrefTable() has 384 entries.
   Obj 185 -> offset 2523997, which contains /V (xxxx). CORRECT.

2. checkXrefOffsets() is called (lenient mode). It calls validateXrefOffsets().

3. validateXrefOffsets() iterates over all 384 entries. At the FIRST entry whose
   offset cannot be dereferenced (findObjectKey returns null), it immediately
   returns false -- without checking the remaining entries.

4. Back in checkXrefOffsets(), because validateXrefOffsets() returned false:

       xrefOffset.clear();                        // DESTROYS all 384 correct 
entries
       xrefOffset.putAll(bfCOSObjectKeyOffsets);  // replaces with brute-force 
results

5. BruteForceParser.getBFCOSObjectOffsets() scans the file linearly using 
map.put()
   (not putIfAbsent). For each "N 0 obj" marker found, it overwrites the 
previous
   entry for that object number. The LAST physical occurrence wins.

6. Object 185 appears 44 times physically. The last occurrence (offset 7019154) 
is
   an empty copy written by a later auto-save -- it has no /V entry.

7. PDFBox loads the empty object. Text4.getValueAsString() returns "".

Verification:
  Before checkXrefOffsets: xrefTable.size()=384, obj185=2523997  <- CORRECT
  After  checkXrefOffsets: xrefTable.size()=85,  obj185=null     <- BUG


THE FIX
-----------------------

FIX 1: COSParser.checkXrefOffsets()

Replace the all-or-nothing logic with selective correction.
Collect all invalid keys (don't stop at first failure), then only replace those
specific invalid entries with brute-force results. Leave valid entries 
untouched.

See attached COSParser.java for the full implementation:
- checkXrefOffsets() now calls collectInvalidXrefKeys() instead of 
validateXrefOffsets()
- collectInvalidXrefKeys() checks ALL entries and returns only the invalid ones
- Only invalid entries are corrected via brute force; valid ones are preserved

Note: Fix 1 alone fully resolves the reported issue.
BruteForceParser is not involved in this bug path at all -- obj 185 has a valid
XRef entry and is therefore never touched by the brute-force scan after the fix.

FULL XRef CHAIN
---------------
Offset     Obj    Size  Entries  Prev      Contains obj 185?
7165114    720    721      8     7148230   no
7148230    714    715     10     7144285   no  (has obj 184, not 185)
7144285    707    708    340     116       YES -> offset 2523997
116         67    159    131     128867    no
128867       6     28     28     -         no

All 5 XRef streams decompress without error. The chain is valid.
PDFBox reads all 384 entries correctly before checkXrefOffsets() destroys them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to