[ 
https://issues.apache.org/jira/browse/PDFBOX-6201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18077299#comment-18077299
 ] 

Stefan Ziegler commented on PDFBOX-6201:
----------------------------------------

The extensive text above is my analysis based on a PDF that PDFBox couldn't 
read correctly. A brute-force parser inserted PDF objects that shouldn't have 
been there, resulting in lost values ​​in certain form fields. This was because 
the brute-force parser used objects with an empty /V value. I can't attach the 
original PDF here, but I've created synthetic PDFs that demonstrate the 
problem. (PDFBox version 3.0.7). The problem can be solved with a small 
modification to COSParser. Test-PDFs and the patch itself is attached.

Whether this is an improvement or a bug, one could argue from both sides. 
PDFBox was unable to read the PDF and retrieve the necessary information from 
the forms. I also opened and processed the PDF in many other PDF programs, 
including Chrome, Adobe Reader, QPDF, Ghostscript, and PDFJS. In all these 
programs, I could see the form fields and they had the correct values; only 
with PDFBox were the fields empty. However, this patch corrects that.

"Object 185 appears 44 times physically." It's a larger PDF, that has been 
updated multiple times, incrementally...

> PDFBox Bug: Form field values lost when loading PDFs with many incremental 
> saves
> --------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 3.0.7 PDFBox
>            Reporter: Stefan Ziegler
>            Priority: Major
>         Attachments: COSParser.java, COSParser.patch, pdfbox-6201-1.pdf, 
> pdfbox-6201-2.pdf
>
>
> PDFBox Bug: Form field values lost when loading PDFs with many incremental 
> saves
> ================================================================================
> Component: pdfbox - COSParser, BruteForceParser
> Affects:   3.0.7 (confirmed); likely all prior versions
> Severity:  Major - visible data loss (form field values silently set to empty)
> SYMPTOM
> -------
> Loading a PDF with many incremental saves (e.g. 1948 startxref/%%EOF sections)
> causes PDFBox to silently lose form field values. The original PDF, when 
> viewed in
> Adobe Acrobat, Chrome, or qpdf, correctly shows filled-in values such as
> "xxxx", "xxxx", "xxxx", "xxxx". After loading with
> PDFBox and saving, all fields are empty.
> qpdf and Ghostscript process the same PDF without errors or warnings.
> Running "qpdf" on the PDF beforehand produces a clean file that
> PDFBox handles correctly.
> ROOT CAUSE
> ----------
> The bug is in COSParser.checkXrefOffsets() (called from parseXref(), lenient 
> mode only).
> Step-by-step trace:
> 1. parseXref() correctly traverses the full /Prev chain (5 XRef streams):
>      Depth 0: XRef@7165114  /Size=721   8 entries   /Prev=7148230
>      Depth 1: XRef@7148230  /Size=715  10 entries   /Prev=7144285
>      Depth 2: XRef@7144285  /Size=708  340 entries  /Prev=116       <- has 
> Obj 185
>      Depth 3: XRef@116       /Size=159  131 entries  /Prev=128867
>      Depth 4: XRef@128867    /Size=28   28 entries   /Prev=none
>    After setStartxref(), xrefTrailerResolver.getXrefTable() has 384 entries.
>    Obj 185 -> offset 2523997, which contains /V (xxxx). CORRECT.
> 2. checkXrefOffsets() is called (lenient mode). It calls 
> validateXrefOffsets().
> 3. validateXrefOffsets() iterates over all 384 entries. At the FIRST entry 
> whose
>    offset cannot be dereferenced (findObjectKey returns null), it immediately
>    returns false -- without checking the remaining entries.
> 4. Back in checkXrefOffsets(), because validateXrefOffsets() returned false:
>        xrefOffset.clear();                        // DESTROYS all 384 correct 
> entries
>        xrefOffset.putAll(bfCOSObjectKeyOffsets);  // replaces with 
> brute-force results
> 5. BruteForceParser.getBFCOSObjectOffsets() scans the file linearly using 
> map.put()
>    (not putIfAbsent). For each "N 0 obj" marker found, it overwrites the 
> previous
>    entry for that object number. The LAST physical occurrence wins.
> 6. Object 185 appears 44 times physically. The last occurrence (offset 
> 7019154) is
>    an empty copy written by a later auto-save -- it has no /V entry.
> 7. PDFBox loads the empty object. Text4.getValueAsString() returns "".
> Verification:
>   Before checkXrefOffsets: xrefTable.size()=384, obj185=2523997  <- CORRECT
>   After  checkXrefOffsets: xrefTable.size()=85,  obj185=null     <- BUG
> THE FIX
> -----------------------
> FIX 1: COSParser.checkXrefOffsets()
> Replace the all-or-nothing logic with selective correction.
> Collect all invalid keys (don't stop at first failure), then only replace 
> those
> specific invalid entries with brute-force results. Leave valid entries 
> untouched.
> See attached COSParser.java for the full implementation:
> - checkXrefOffsets() now calls collectInvalidXrefKeys() instead of 
> validateXrefOffsets()
> - collectInvalidXrefKeys() checks ALL entries and returns only the invalid 
> ones
> - Only invalid entries are corrected via brute force; valid ones are preserved
> Note: Fix 1 alone fully resolves the reported issue.
> BruteForceParser is not involved in this bug path at all -- obj 185 has a 
> valid
> XRef entry and is therefore never touched by the brute-force scan after the 
> fix.
> FULL XRef CHAIN
> ---------------
> Offset     Obj    Size  Entries  Prev      Contains obj 185?
> 7165114    720    721      8     7148230   no
> 7148230    714    715     10     7144285   no  (has obj 184, not 185)
> 7144285    707    708    340     116       YES -> offset 2523997
> 116         67    159    131     128867    no
> 128867       6     28     28     -         no
> All 5 XRef streams decompress without error. The chain is valid.
> PDFBox reads all 384 entries correctly before checkXrefOffsets() destroys 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to