[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425602#comment-17425602
 ] 

Maruan Sahyoun commented on PDFBOX-5263:
----------------------------------------

!profiling-2021-10-07 16-27-06.png!

with prepareIncrement -> parseObjectDynamically ... taking most of the time. 
Why does prepareIncrement need to parse? 

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-5263
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5263
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, PDModel, Writing
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Christian Appl
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know and manage by themselves 
> whether they were freshly created or altered
> (e.g.: A COSDictionary should be able to remember, that a setter had been 
> called).
> - If "COSUpdateInfo" objects were self aware and would solve this by 
> themselves, it would not be necessary anymore to set update states manually.
> *Problems:*
> The first and obvious problem is, that the initial loading of a document is 
> creating and altering new COS structures and we obviously don't want objects 
> to observe and remember those changes. An object that is created during 
> document initialization must be treated as preexisting.
> However: COSBase is not context aware - it does know it's descendants, but 
> neither does it know it's parent, nor does it know it's root.
> If it was, that actually would present the optimal solution, as in that case 
> the Object could ask it's root for the current load state and therefore would 
> be able to ignore said changes caused by the initial loading of a document.
> But it is not. (My opinion is - it should be! But more on that later.)
> Therefore a a helper named COSUpdateInfoList was implemented, which was 
> capable of finding COSUpdateInfo objects in a COS structure, and that allowed 
> resetting their update state after loading was completed.
> *Description of the patch:*
> The patch implements selfaware COSUpdateInfo objects, which the COSWriter has 
> been adapted to process. PDFBox therefore is capable of monitoring changes in 
> realtime and to automatically include altered structures in an incremental 
> save of the document, therefore creating increments (or an increment), that a 
> signature would sign.
> *Result:*
> Using this patch documents could be created:
> incrementally adding pages, adding contents to pages, adding annotations, 
> altering structures, removing structures.
> As far as has been initially tested the resulting documents were valid, 
> viewable in a reader and the objects overwritten in increments seemed correct.
> *But -* *Caveat:*
> This patch does introduce atleast one ugly class (most likely you will be 
> able to point out more, that could be optimized :)) and that is 
> "COSUpdateInfoList" - as already explained: In my opinion such a class should 
> not exist, the COSUpdateInfo objects should be context aware and should be 
> capable of regulating their own behaviour.
> Whenever the alternatives are to either manage an object externally, or to 
> "teach" an object to solve problems autonomously, I will tend to prefer the 
> latter... but I did not dare to do that.
> This would require, that either further constructors or setters would have to 
> be introduced for COSBase objects, that allowed setting parent/root/context 
> for the object.
> Which would result in further massive changes for using applications and 
> PDFBox itself - as all instantiations of COSBase objects (PDObjects) would 
> have to be adapted.
> However: I would prefer if COSBase objects actually were context aware.
> But as stated... I did not dare to touch it and instead chose the ugly 
> workarround, that would introduce yet another iteration over the whole 
> COSDocument structure.
> Eliminating COSUpdateInfoList would be preferable!
> *Suggestion:*
> As PDFBox 3 is already changing how documents and objects are handled, I 
> would suggest, that also COSBase objects should be made context and selfaware 
> in PDFBox 3.
> This would allow simplifying handling COS objects using PDFBox and it would 
> allow for an easier and automized handling of incremental saving.
> *Usage example:*
> The following "pseudo code" (actually using simplified Helper classes) 
> demonstrates the intended usage:
> !image-2021-08-23-14-55-24-077.png!
> *As always:* Thank you very much for your work and support! I hope this 
> suggestion is to your liking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to