[ 
https://issues.apache.org/jira/browse/PDFBOX-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902760#action_12902760
 ] 

Andreas Lehmkühler commented on PDFBOX-796:
-------------------------------------------

The patch looks good to me. But in the long run we should add support for 
incremental updates and signed documents so that this workaround will become 
redundant.

> Objects from streams overwrite objects already read with the same 
> ID/Generation
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-796
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-796
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>         Environment: 32-bit Windows Vista, Java 1.5, PDFBox head tag
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-796.patch
>
>
> When trying to merge some documents (using the PDFMergerUtility class) I got 
> a NullPointerException and the merge failed.  I traced through to eventually 
> discover that some objects were being overwritten when the PDFParser called 
> document.dereferenceObjectStreams(); (line 207 of PDFParser.java).
> Having multiple objects with the same object ID is a violation of the PDF 
> specification, so how this should be dealt with is undefined.  The "use the 
> first object" mentality enabled my file to be processed and it is consistent 
> with the other code in PDFBox.  For another example of where PDFBox deals 
> with reading in an object which already exists, you can see PDFParser (on 
> line 541) checks to see if the object has already been read and put in the 
> pool.  If not, it adds it to the list of conflicts.  Later, when 
> resolveConflicts() is called, it overwrites the object only if it's 
> specifically referenced in the xref table.  This is a reasonable way to 
> resolve conflicts because if the object isn't in the xref table, it is likely 
> the wrong one.
> Since we're reading from a stream of compressed data, we can not give a 
> particular byte offset.  This means we can't add these conflicts to the 
> conflict list and try to determine if this object is legitimate or not.  It's 
> best to use the data we've already read, as using the one from the stream has 
> been confirmed to cause problems.  I've done regression testing with other 
> files which have this problem, including the file from PDFBOX-720 and have 
> not seen any issues.
> Unfortunately I can not provide the PDF which demonstrates this problem and 
> solution as it contains information I'm not authorized to release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to