[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406598#comment-17406598
 ] 

Christian Appl edited comment on PDFBOX-5263 at 8/30/21, 7:29 AM:
------------------------------------------------------------------

*Update:*
 I had to redesign and rethink much of what I presented above, but I made some 
progress concerning this. I don't know whether I will make any progress this 
week concerning this issue, therefore I will upload the current suggestion.

*This update:* [^Enhanced_incremental_saving_.patch]
 - Does not dereference objects, that have not been loaded by interactions with 
the user. (assuming such objects could not possibly have changed.)
 - It also does not change constructors of COS or PD objects - but still is 
keeping track of the parent document and update state.
 - It automatically includes changed and new objects in increments of that 
document.
 - It does not require the whole path to be updated, only including objects, 
that require inclusion.
 - The preexisting "hasToBeUpdated" logic remains functional and can be used to 
add objects to increments "manually".

*Caveat:*
 There still is one remaining test failure, that I could not resolve yet and 
most likely won't find the time to fix this week.
 "pdfbox-examples:" TestCreateSignature.testPDFBox4784:763->signEncrypted:1051 
» ArrayIndexOutOfBounds fails.

*Current state:*
 Test failures are not acceptable, therefore the patch is not ready yet. But it 
adresses a lot of the initial issues and once the remaining issue is resolved 
(and the JavaDoc has been extended), this could be suggested for inclusion.


was (Author: capsvd):
*Update:*
I had to redesign and rethink much of what I presented above, but I made some 
progress concerning this. I don't know whether I will make any progress this 
week concerning this issue, therefore I will upload the current suggestion.

*This sugesstion:* [^Enhanced_incremental_saving_.patch] 
- Does not dereference objects, that have not been loaded by interactions with 
the user. (assuming such objects could not possibly have changed.
- It also does not change constructors of COS or PD objects - but still keeping 
track of the document and update state.
- It automatically includes changed and new objects in increments of that 
document.
- It does not require the whole path to be updated, only including objects, 
that require inclusion.
- The preexisting "hasToBeUpdated" logic remains functional and can be used to 
add objects to increments "manually".

*Caveat:*
There still is one remaining test failure, that I could not resolve yet and 
most likely won't find the time to fix this week.
"pdfbox-examples:" TestCreateSignature.testPDFBox4784:763->signEncrypted:1051 » 
ArrayIndexOutOfBounds fails.

*Current state:*
Test failures are not acceptable, therefore the patch is not ready yet. But it 
adresses a lot of the initial issues and once the remaining issue is resolved 
(and the JavaDoc has been extended), this could be suggested for inclusion.

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-5263
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5263
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, PDModel, Writing
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Christian Appl
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know and manage by themselves 
> whether they were freshly created or altered
> (e.g.: A COSDictionary should be able to remember, that a setter had been 
> called).
> - If "COSUpdateInfo" objects were self aware and would solve this by 
> themselves, it would not be necessary anymore to set update states manually.
> *Problems:*
> The first and obvious problem is, that the initial loading of a document is 
> creating and altering new COS structures and we obviously don't want objects 
> to observe and remember those changes. An object that is created during 
> document initialization must be treated as preexisting.
> However: COSBase is not context aware - it does know it's descendants, but 
> neither does it know it's parent, nor does it know it's root.
> If it was, that actually would present the optimal solution, as in that case 
> the Object could ask it's root for the current load state and therefore would 
> be able to ignore said changes caused by the initial loading of a document.
> But it is not. (My opinion is - it should be! But more on that later.)
> Therefore a a helper named COSUpdateInfoList was implemented, which was 
> capable of finding COSUpdateInfo objects in a COS structure, and that allowed 
> resetting their update state after loading was completed.
> *Description of the patch:*
> The patch implements selfaware COSUpdateInfo objects, which the COSWriter has 
> been adapted to process. PDFBox therefore is capable of monitoring changes in 
> realtime and to automatically include altered structures in an incremental 
> save of the document, therefore creating increments (or an increment), that a 
> signature would sign.
> *Result:*
> Using this patch documents could be created:
> incrementally adding pages, adding contents to pages, adding annotations, 
> altering structures, removing structures.
> As far as has been initially tested the resulting documents were valid, 
> viewable in a reader and the objects overwritten in increments seemed correct.
> *But -* *Caveat:*
> This patch does introduce atleast one ugly class (most likely you will be 
> able to point out more, that could be optimized :)) and that is 
> "COSUpdateInfoList" - as already explained: In my opinion such a class should 
> not exist, the COSUpdateInfo objects should be context aware and should be 
> capable of regulating their own behaviour.
> Whenever the alternatives are to either manage an object externally, or to 
> "teach" an object to solve problems autonomously, I will tend to prefer the 
> latter... but I did not dare to do that.
> This would require, that either further constructors or setters would have to 
> be introduced for COSBase objects, that allowed setting parent/root/context 
> for the object.
> Which would result in further massive changes for using applications and 
> PDFBox itself - as all instantiations of COSBase objects (PDObjects) would 
> have to be adapted.
> However: I would prefer if COSBase objects actually were context aware.
> But as stated... I did not dare to touch it and instead chose the ugly 
> workarround, that would introduce yet another iteration over the whole 
> COSDocument structure.
> Eliminating COSUpdateInfoList would be preferable!
> *Suggestion:*
> As PDFBox 3 is already changing how documents and objects are handled, I 
> would suggest, that also COSBase objects should be made context and selfaware 
> in PDFBox 3.
> This would allow simplifying handling COS objects using PDFBox and it would 
> allow for an easier and automized handling of incremental saving.
> *Usage example:*
> The following "pseudo code" (actually using simplified Helper classes) 
> demonstrates the intended usage:
> !image-2021-08-23-14-55-24-077.png!
> *As always:* Thank you very much for your work and support! I hope this 
> suggestion is to your liking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to