[ https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425681#comment-17425681 ]
Christian Appl commented on PDFBOX-5263: ---------------------------------------- [~msahyoun] forget about everything I said - you are right, I am wrong, and I obviously fail to understand simple diagrams.... COSincrement has nothing to do with this and of course prepareIncrement is the one dereferencing objects here... (Even though: nice... COSIncrement does not cause objects to be dereferenced unnecessarily - so atleast that worked.) sorry.... Will have a closer look at prepareIncrement now. > Suggestion: Signing actual document changes - Enhancing incremental saving > -------------------------------------------------------------------------- > > Key: PDFBOX-5263 > URL: https://issues.apache.org/jira/browse/PDFBOX-5263 > Project: PDFBox > Issue Type: Improvement > Components: Parsing, PDModel, Writing > Affects Versions: 3.0.0 PDFBox > Reporter: Christian Appl > Priority: Major > Fix For: 3.0.0 PDFBox > > Attachments: Enhanced_incremental_saving_.patch, > Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, > Observer_based_incremental_saving(09-13-2021).patch, > Observer_based_incremental_saving_(09-09-2021-09-09).patch, > Observer_based_incremental_saving_(12-00-2021-09-08).patch, > Observer_based_incremental_saving_(15-19-2021-09-08).patch, > Observer_based_incremental_saving_(17-00-2021-09-07).patch, > Observer_based_incremental_saving_(fixed_).patch, > Observer_based_incremental_saving_-_still_eroneous.patch, > Observer_based_incremental_saving_.patch, > PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, > PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, > PDFBOX-5263_Introduce_COSReferenceInfo.patch, > PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, > PDFBOX-5263_Introduce_state_based_increment_handling_.patch, > PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, > PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, > Prototype_COSChangeObserver_(code_base).patch, > Prototype__Document_and_reference_holder_aware_COSContext_.patch, > Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, > image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, > image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, > image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, > image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, > image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, > image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, > image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, > image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, > image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, > image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, > image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, > image-2021-10-07-17-46-41-793.png, image-2021-10-07-18-13-05-737.png, > out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png > > > *TL;DR:* > Currently it is rather tedious to create incremental changes in between > signatures via PDFBox. I attempted to simplify that and wrote a patch. > This is rather a POC, than an actual suggestion for direct inclusion. (For > reasons explained later.) > *Signatures and incremental PDF documents:* > A typical reason for wanting to sign a document multiple times (creating an > incremental PDF) is , that in between signatures the document changed and the > additional signature shall sign the new state of the document. > If one wanted to implement such incremental changes using PDFBox, he would > find, that most of the time made changes are completly ignored, when calling > "saveIncremental". > As documented for the "saveIncremental" methods and especially the matching > constructors in "COSWriter", this would require, to identify the "path" of > all made changes and one would need to set the "needToBeUpdated" flag of all > elements of that path. > *But:* > As documented one would have to have exact understanding of what he did and > how the PDF standard does implement this, he would have to identify said > structures and the more complex the changes were, the more tedious this would > become. > *Also:* > Because of the implementation of incremental saving in COSWriter, the whole > path must be informed that it required an update. > Resulting in unnecessary large increments, as not all ancestors might > actually have changed. > e.g. If one added an image to a preexisting page of the document - the > contentstream, the resources of the page and the page dictionary would have > changed. But the "pages" array and all it's ancestors would not have changed > a bit, but still would have to be informed and included. > *Assumptions that lead to this patch:* > - COSWriter should not stop iterating a COSTree just because a parent element > did not change. It's descendants still could have changed! > - Externally managing an object´s update state is tedious and error-prone. > Objects that implement "COSUpdateInfo" should know and manage by themselves > whether they were freshly created or altered > (e.g.: A COSDictionary should be able to remember, that a setter had been > called). > - If "COSUpdateInfo" objects were self aware and would solve this by > themselves, it would not be necessary anymore to set update states manually. > *Problems:* > The first and obvious problem is, that the initial loading of a document is > creating and altering new COS structures and we obviously don't want objects > to observe and remember those changes. An object that is created during > document initialization must be treated as preexisting. > However: COSBase is not context aware - it does know it's descendants, but > neither does it know it's parent, nor does it know it's root. > If it was, that actually would present the optimal solution, as in that case > the Object could ask it's root for the current load state and therefore would > be able to ignore said changes caused by the initial loading of a document. > But it is not. (My opinion is - it should be! But more on that later.) > Therefore a a helper named COSUpdateInfoList was implemented, which was > capable of finding COSUpdateInfo objects in a COS structure, and that allowed > resetting their update state after loading was completed. > *Description of the patch:* > The patch implements selfaware COSUpdateInfo objects, which the COSWriter has > been adapted to process. PDFBox therefore is capable of monitoring changes in > realtime and to automatically include altered structures in an incremental > save of the document, therefore creating increments (or an increment), that a > signature would sign. > *Result:* > Using this patch documents could be created: > incrementally adding pages, adding contents to pages, adding annotations, > altering structures, removing structures. > As far as has been initially tested the resulting documents were valid, > viewable in a reader and the objects overwritten in increments seemed correct. > *But -* *Caveat:* > This patch does introduce atleast one ugly class (most likely you will be > able to point out more, that could be optimized :)) and that is > "COSUpdateInfoList" - as already explained: In my opinion such a class should > not exist, the COSUpdateInfo objects should be context aware and should be > capable of regulating their own behaviour. > Whenever the alternatives are to either manage an object externally, or to > "teach" an object to solve problems autonomously, I will tend to prefer the > latter... but I did not dare to do that. > This would require, that either further constructors or setters would have to > be introduced for COSBase objects, that allowed setting parent/root/context > for the object. > Which would result in further massive changes for using applications and > PDFBox itself - as all instantiations of COSBase objects (PDObjects) would > have to be adapted. > However: I would prefer if COSBase objects actually were context aware. > But as stated... I did not dare to touch it and instead chose the ugly > workarround, that would introduce yet another iteration over the whole > COSDocument structure. > Eliminating COSUpdateInfoList would be preferable! > *Suggestion:* > As PDFBox 3 is already changing how documents and objects are handled, I > would suggest, that also COSBase objects should be made context and selfaware > in PDFBox 3. > This would allow simplifying handling COS objects using PDFBox and it would > allow for an easier and automized handling of incremental saving. > *Usage example:* > The following "pseudo code" (actually using simplified Helper classes) > demonstrates the intended usage: > !image-2021-08-23-14-55-24-077.png! > *As always:* Thank you very much for your work and support! I hope this > suggestion is to your liking. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org