Christian Appl created PDFBOX-5263:
--------------------------------------

             Summary: Suggestion: Signing actual document changes - Enhancing 
incremental saving
                 Key: PDFBOX-5263
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5263
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing, PDModel, Writing
    Affects Versions: 3.0.0 PDFBox
            Reporter: Christian Appl
             Fix For: 3.0.0 PDFBox
         Attachments: Enhanced_incremental_saving_PDFBox3.patch, 
image-2021-08-23-14-55-24-077.png

*TL;DR:*
Currently it is rather tedious to create incremental changes in between 
signatures via PDFBox. I attempted to simplify that and wrote a patch.
This is rather a POC, than an actual suggestion for direct inclusion. (For 
reasons explained later.)

*Signatures and incremental PDF documents:*
A typical reason for wanting to sign a document multiple times (creating an 
incremental PDF) is , that in between signatures the document changed and the 
additional signature shall sign the new state of the document.

If one wanted to implement such incremental changes using PDFBox, he would 
find, that most of the time made changes are completly ignored, when calling 
"saveIncremental".
As documented for the "saveIncremental" methods and especially the matching 
constructors in "COSWriter", this would require, to identify the "path" of all 
made changes and one would need to set the "needToBeUpdated" flag of all 
elements of that path.

*But:*
As documented one would have to have exact understanding of what he did and how 
the PDF standard does implement this, he would have to identify said structures 
and the more complex the changes were, the more tedious this would become.

*Also:*
Because of the implementation of incremental saving in COSWriter, the whole 
path must be informed that it required an update.
Resulting in unnecessary large increments, as not all ancestors might actually 
have changed.
e.g. If one added an image to a preexisting page of the document - the 
contentstream, the resources of the page and the page dictionary would have 
changed. But the "pages" array and all it's ancestors would not have changed a 
bit, but still would have to be informed and included.

*Assumptions that lead to this patch:*
- COSWriter should not stop iterating a COSTree just because a parent element 
did not change. It's descendants still could have changed!

- Externally managing an object´s update state is tedious and error-prone.
Objects that implement "COSUpdateInfo" should know and manage by themselves 
whether they were freshly created or altered
(e.g.: A COSDictionary should be able to remember, that a setter had been 
called).

- If "COSUpdateInfo" objects were self aware and would solve this by 
themselves, it would not be necessary anymore to set update states manually.

*Problems:*
The first and obvious problem is, that the initial loading of a document is 
creating and altering new COS structures and we obviously don't want objects to 
observe and remember those changes. An object that is created during document 
initialization must be treated as preexisting.

However: COSBase is not context aware - it does know it's descendants, but 
neither does it know it's parent, nor does it know it's root.
If it was, that actually would present the optimal solution, as in that case 
the Object could ask it's root for the current load state and therefore would 
be able to ignore said changes caused by the initial loading of a document.
But it is not. (My opinion is - it should be! But more on that later.)

Therefore a a helper named COSUpdateInfoList was implemented, which was capable 
of finding COSUpdateInfo objects in a COS structure, and that allowed resetting 
their update state after loading was completed.

*Description of the patch:*
The patch implements selfaware COSUpdateInfo objects, which the COSWriter has 
been adapted to process. PDFBox therefore is capable of monitoring changes in 
realtime and to automatically include altered structures in an incremental save 
of the document, therefore creating increments (or an increment), that a 
signature would sign.

*Result:*
Using this patch documents could be created:
incrementally adding pages, adding contents to pages, adding annotations, 
altering structures, removing structures.
As far as has been initially tested the resulting documents were valid, 
viewable in a reader and the objects overwritten in increments seemed correct.

*But -* *Caveat:*
This patch does introduce atleast one ugly class (most likely you will be able 
to point out more, that could be optimized :)) and that is "COSUpdateInfoList" 
- as already explained: In my opinion such a class should not exist, the 
COSUpdateInfo objects should be context aware and should be capable of 
regulating their own behaviour.
Whenever the alternatives are to either manage an object externally, or to 
"teach" an object to solve problems autonomously, I will tend to prefer the 
latter... but I did not dare to do that.

This would require, that either further constructors or setters would have to 
be introduced for COSBase objects, that allowed setting parent/root/context for 
the object.
Which would result in further massive changes for using applications and PDFBox 
itself - as all instantiations of COSBase objects (PDObjects) would have to be 
adapted.

However: I would prefer if COSBase objects actually were context aware.
But as stated... I did not dare to touch it and instead chose the ugly 
workarround, that would introduce yet another iteration over the whole 
COSDocument structure.
Eliminating COSUpdateInfoList would be preferable!

*Suggestion:*
As PDFBox 3 is already changing how documents and objects are handled, I would 
suggest, that also COSBase objects should be made context and selfaware in 
PDFBox 3.
This would allow simplifying handling COS objects using PDFBox and it would 
allow for an easier and automized handling of incremental saving.

*Usage example:*
The following "pseudo code" (actually using simplified Helper classes) 
demonstrates the intended usage:
!image-2021-08-23-14-55-24-077.png!

*As always:* Thank you very much for your work and support! I hope this 
suggestion is to your liking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to