[ 
https://issues.apache.org/jira/browse/PDFBOX-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18077324#comment-18077324
 ] 

Stefan Ziegler edited comment on PDFBOX-6201 at 4/30/26 12:42 PM:
------------------------------------------------------------------

Response to PDFBox team questions
==================================

> Is object 185 at offset 7019154 referenced in an xref, or is it referenced
> but with a wrong location?

It is NOT referenced by any XRef table in the update chain. The XRef chain
correctly references object 185 at offset 2523997, which contains
/V (XXXXXX). The occurrence at offset 7019154 is a ghost object —
physically present in the byte stream but not pointed to by any XRef entry.

> What you are proposing is that we discard that as there are other instances
> of object 185 which are referenced by a valid xref. Correct? And that's
> what your patch does.

Exactly. The patch leaves the XRef entry for object 185 (offset 2523997,
valid) untouched, because collectInvalidXrefKeys() recognises it as valid.
The physically later occurrence at offset 7019154 is never consulted.

> Was the value of object 185 empty when auto-save saved it? So did it save
> the correct state but wrongly, or did it not even save the correct state?

Neither — the 6 empty occurrences are structurally different from the 38
filled ones. They are template objects from the base PDF (the original blank
form), not a mis-saved copy of the filled state.

The 38 occurrences WITH a value look like this:

  185 0 obj
  <</M (D:20260423093304+02'00') /Type /Annot /V (XXXXX)
    /T (Text4) /AP <</N 635 0 R>> ...>>

The 6 occurrences WITHOUT a value have no /M (no timestamp), no /V, and
reference a different appearance stream:

  185 0 obj
  <</AP <</N 499 0 R>> /BS <</S /I>> /F 4 /FT /Tx /MK <<>>
    /P 30 0 R /Rect [...] /T (Text4) /Type /Annot>>

Obj 499 is the appearance stream of the blank template field; obj 635 is the
appearance stream of the filled field. So the empty occurrences are verbatim
copies of the original blank template object, not a snapshot of the filled
state.

Our interpretation: during certain auto-save cycles, PDF Expert re-embedded
large blocks of the original template (possibly for XRef reconstruction or
padding) without substituting the current field values. The correct state was
never lost from the file — it was always present at offset 2523997 and
referenced by the XRef chain — but the ghost copies of the blank template
confused the brute-force scanner.

> Which application has the auto-save defect?

PDF Expert 7.25.4.1276 on iOS. Every startxref/%%EOF section in the file is
preceded by the comment:

  % PDF Expert 7.25.4.1276 iOS d7182b6d2d
  % Smart Incremental Update Autosave

The base PDF was originally created with Adobe Acrobat (the file contains
Adobe XMP Core 9.1 metadata), then filled and repeatedly auto-saved by
PDF Expert on iOS, producing 1948 incremental updates.


was (Author: JIRAUSER289113):
Response to PDFBox team questions
==================================

> Is object 185 at offset 7019154 referenced in an xref, or is it referenced
> but with a wrong location?

It is NOT referenced by any XRef table in the update chain. The XRef chain
correctly references object 185 at offset 2523997, which contains
/V (XXXXXX). The occurrence at offset 7019154 is a ghost object —
physically present in the byte stream but not pointed to by any XRef entry.


> What you are proposing is that we discard that as there are other instances
> of object 185 which are referenced by a valid xref. Correct? And that's
> what your patch does.

Exactly. The patch leaves the XRef entry for object 185 (offset 2523997,
valid) untouched, because collectInvalidXrefKeys() recognises it as valid.
The physically later occurrence at offset 7019154 is never consulted.


> Was the value of object 185 empty when auto-save saved it? So did it save
> the correct state but wrongly, or did it not even save the correct state?

Neither — the 6 empty occurrences are structurally different from the 38
filled ones. They are template objects from the base PDF (the original blank
form), not a mis-saved copy of the filled state.

The 38 occurrences WITH a value look like this:

  185 0 obj
  <</M (D:20260423093304+02'00') /Type /Annot /V (XXXXX)
    /T (Text4) /AP <</N 635 0 R>> ...>>

The 6 occurrences WITHOUT a value have no /M (no timestamp), no /V, and
reference a different appearance stream:

  185 0 obj
  <</AP <</N 499 0 R>> /BS <</S /I>> /F 4 /FT /Tx /MK <<>>
    /P 30 0 R /Rect [...] /T (Text4) /Type /Annot>>

Obj 499 is the appearance stream of the blank template field; obj 635 is the
appearance stream of the filled field. So the empty occurrences are verbatim
copies of the original blank template object, not a snapshot of the filled
state.

Our interpretation: during certain auto-save cycles, PDF Expert re-embedded
large blocks of the original template (possibly for XRef reconstruction or
padding) without substituting the current field values. The correct state was
never lost from the file — it was always present at offset 2523997 and
referenced by the XRef chain — but the ghost copies of the blank template
confused the brute-force scanner.


> Which application has the auto-save defect?

PDF Expert 7.25.4.1276 on iOS. Every startxref/%%EOF section in the file is
preceded by the comment:

  % PDF Expert 7.25.4.1276 iOS d7182b6d2d

The base PDF was originally created with Adobe Acrobat (the file contains
Adobe XMP Core 9.1 metadata), then filled and repeatedly auto-saved by
PDF Expert on iOS, producing 1948 incremental updates.

> PDFBox Bug: Form field values lost when loading PDFs with many incremental 
> saves
> --------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 3.0.7 PDFBox
>            Reporter: Stefan Ziegler
>            Priority: Major
>         Attachments: COSParser.java, COSParser.patch, pdfbox-6201-1.pdf, 
> pdfbox-6201-2.pdf
>
>
> PDFBox Bug: Form field values lost when loading PDFs with many incremental 
> saves
> ================================================================================
> Component: pdfbox - COSParser, BruteForceParser
> Affects:   3.0.7 (confirmed); likely all prior versions
> Severity:  Major - visible data loss (form field values silently set to empty)
> SYMPTOM
> -------
> Loading a PDF with many incremental saves (e.g. 1948 startxref/%%EOF sections)
> causes PDFBox to silently lose form field values. The original PDF, when 
> viewed in
> Adobe Acrobat, Chrome, or qpdf, correctly shows filled-in values such as
> "xxxx", "xxxx", "xxxx", "xxxx". After loading with
> PDFBox and saving, all fields are empty.
> qpdf and Ghostscript process the same PDF without errors or warnings.
> Running "qpdf" on the PDF beforehand produces a clean file that
> PDFBox handles correctly.
> ROOT CAUSE
> ----------
> The bug is in COSParser.checkXrefOffsets() (called from parseXref(), lenient 
> mode only).
> Step-by-step trace:
> 1. parseXref() correctly traverses the full /Prev chain (5 XRef streams):
>      Depth 0: XRef@7165114  /Size=721   8 entries   /Prev=7148230
>      Depth 1: XRef@7148230  /Size=715  10 entries   /Prev=7144285
>      Depth 2: XRef@7144285  /Size=708  340 entries  /Prev=116       <- has 
> Obj 185
>      Depth 3: XRef@116       /Size=159  131 entries  /Prev=128867
>      Depth 4: XRef@128867    /Size=28   28 entries   /Prev=none
>    After setStartxref(), xrefTrailerResolver.getXrefTable() has 384 entries.
>    Obj 185 -> offset 2523997, which contains /V (xxxx). CORRECT.
> 2. checkXrefOffsets() is called (lenient mode). It calls 
> validateXrefOffsets().
> 3. validateXrefOffsets() iterates over all 384 entries. At the FIRST entry 
> whose
>    offset cannot be dereferenced (findObjectKey returns null), it immediately
>    returns false -- without checking the remaining entries.
> 4. Back in checkXrefOffsets(), because validateXrefOffsets() returned false:
>        xrefOffset.clear();                        // DESTROYS all 384 correct 
> entries
>        xrefOffset.putAll(bfCOSObjectKeyOffsets);  // replaces with 
> brute-force results
> 5. BruteForceParser.getBFCOSObjectOffsets() scans the file linearly using 
> map.put()
>    (not putIfAbsent). For each "N 0 obj" marker found, it overwrites the 
> previous
>    entry for that object number. The LAST physical occurrence wins.
> 6. Object 185 appears 44 times physically. The last occurrence (offset 
> 7019154) is
>    an empty copy written by a later auto-save -- it has no /V entry.
> 7. PDFBox loads the empty object. Text4.getValueAsString() returns "".
> Verification:
>   Before checkXrefOffsets: xrefTable.size()=384, obj185=2523997  <- CORRECT
>   After  checkXrefOffsets: xrefTable.size()=85,  obj185=null     <- BUG
> THE FIX
> -----------------------
> FIX 1: COSParser.checkXrefOffsets()
> Replace the all-or-nothing logic with selective correction.
> Collect all invalid keys (don't stop at first failure), then only replace 
> those
> specific invalid entries with brute-force results. Leave valid entries 
> untouched.
> See attached COSParser.java for the full implementation:
> - checkXrefOffsets() now calls collectInvalidXrefKeys() instead of 
> validateXrefOffsets()
> - collectInvalidXrefKeys() checks ALL entries and returns only the invalid 
> ones
> - Only invalid entries are corrected via brute force; valid ones are preserved
> Note: Fix 1 alone fully resolves the reported issue.
> BruteForceParser is not involved in this bug path at all -- obj 185 has a 
> valid
> XRef entry and is therefore never touched by the brute-force scan after the 
> fix.
> FULL XRef CHAIN
> ---------------
> Offset     Obj    Size  Entries  Prev      Contains obj 185?
> 7165114    720    721      8     7148230   no
> 7148230    714    715     10     7144285   no  (has obj 184, not 185)
> 7144285    707    708    340     116       YES -> offset 2523997
> 116         67    159    131     128867    no
> 128867       6     28     28     -         no
> All 5 XRef streams decompress without error. The chain is valid.
> PDFBox reads all 384 entries correctly before checkXrefOffsets() destroys 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to