[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422579#comment-17422579
 ] 

Christian Appl edited comment on PDFBOX-5263 at 9/30/21, 8:29 AM:
------------------------------------------------------------------

You are right about that.
 Okay here it is: 
-[^PDFBOX-5263_Introduce_state_based_increment_handling__(POC).patch]-

*Edit:*
 Minor improvement concerning "origins":
 *[^PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch]*

*Caveats:*
 - No Javadoc whatsoever only some very basic comments
 - Not final and still in progress
 - It does succeed for the "old" TestCOSUpdateInfoObserver (now named 
TestCOSIncrement) - hence I assume it is creating usefull increments
 - It does fail TestCreateSignature.testPDFBox3811 however (Something is still 
missing, that the observer could do)
 - No guarantees at all, that this actually works :D

*Also:*
 I have thought about your findings concerning 7653 dereference events in a 
small document...
 The observer claimed this:
 !image-2021-09-30-09-33-04-449.png!

I wonder if that claim is correct....
 It again shows up in the new approach in method 
"COSIncrement.collect(COSObject)":
 !image-2021-09-30-09-34-19-175.png!
 Which now actually is checking for the update state of the object, which may 
be better or even worse....
 One thing is obvious: Objects should not be dereferenced, if that is not 
absolutely required.

*Scenarios:*
 The following two scenarios are leading to those assumptions:
 *1. adding pages to existing document:*
 I load a document, I add a new Page to the document. I expect, that the 
document's "pages" array will be dereferenced while doing so - it must not be 
added to the increment for now, as it has not changed. I expect the single 
"page" objects in the "pages" array to not be dereferenced, as I didn´t access 
those.
 I expect the pages array to be updated when I actually add the new page to the 
array.
*Edit:*
In this scenario it is expected, that dereferencing does not occur, as all 
added structures should already be loaded (as they are freshly created).

*2. Copying a page from another document:*
 I access a page from a document and add it to another document. I expect that 
the page and all it's contained objects and resources are being dereferenced 
and added to an increment - I require the fonts, images etc. and the old 
document can not already contain those, therefore the contained objects are 
also "new" and must be expanded.

Is there a flaw in that logic?
 Is there another scenario I am missing and should consider?
 Is the "new" approach actually fullfilling scenario 2? O.o


was (Author: capsvd):
You are right about that.
 Okay here it is: 
-[^PDFBOX-5263_Introduce_state_based_increment_handling__(POC).patch]-

*Edit:*
Minor improvement concerning "origins":
 *[^PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch]*

*Caveats:*
 - No Javadoc whatsoever only some very basic comments
 - Not final and still in progress
 - It does succeed for the "old" TestCOSUpdateInfoObserver (now named 
TestCOSIncrement) - hence I assume it is creating usefull increments
 - It does fail TestCreateSignature.testPDFBox3811 however (Something is still 
missing, that the observer could do)
 - No guarantees at all, that this actually works :D

*Also:*
 I have thought about your findings concerning 7653 dereference events in a 
small document...
 The observer claimed this:
 !image-2021-09-30-09-33-04-449.png!

I wonder if that claim is correct....
 It again shows up in the new approach in method 
"COSIncrement.collect(COSObject)":
 !image-2021-09-30-09-34-19-175.png!
 Which now actually is checking for the update state of the object, which may 
be better or even worse....
 One thing is obvious: Objects should not be dereferenced, if that is not 
absolutely required.

*Scenarios:*
 The following two scenarios are leading to those assumptions:
 *1. adding pages to existing document:*
 I load a document, I add a new Page to the document. I expect, that the 
document's "pages" array will be dereferenced while doing so - it must not be 
added to the increment for now, as it has not changed. I expect the single 
"page" objects in the "pages" array to not be dereferenced, as I didn´t access 
those.
 I expect the pages array to be updated when I actually add the new page to the 
array.

*2. Copying a page from another document:*
 I access a page from a document and add it to another document. I expect that 
the page and all it's contained objects and resources are being dereferenced 
and added to an increment - I require the fonts, images etc. and the old 
document can not already contain those, therefore the contained objects are 
also "new" and must be expanded.

Is there a flaw in that logic?
 Is there another scenario I am missing and should consider?
 Is the "new" approach actually fullfilling scenario 2? O.o

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-5263
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5263
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, PDModel, Writing
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Christian Appl
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> out.pdf, out2.pdf
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know and manage by themselves 
> whether they were freshly created or altered
> (e.g.: A COSDictionary should be able to remember, that a setter had been 
> called).
> - If "COSUpdateInfo" objects were self aware and would solve this by 
> themselves, it would not be necessary anymore to set update states manually.
> *Problems:*
> The first and obvious problem is, that the initial loading of a document is 
> creating and altering new COS structures and we obviously don't want objects 
> to observe and remember those changes. An object that is created during 
> document initialization must be treated as preexisting.
> However: COSBase is not context aware - it does know it's descendants, but 
> neither does it know it's parent, nor does it know it's root.
> If it was, that actually would present the optimal solution, as in that case 
> the Object could ask it's root for the current load state and therefore would 
> be able to ignore said changes caused by the initial loading of a document.
> But it is not. (My opinion is - it should be! But more on that later.)
> Therefore a a helper named COSUpdateInfoList was implemented, which was 
> capable of finding COSUpdateInfo objects in a COS structure, and that allowed 
> resetting their update state after loading was completed.
> *Description of the patch:*
> The patch implements selfaware COSUpdateInfo objects, which the COSWriter has 
> been adapted to process. PDFBox therefore is capable of monitoring changes in 
> realtime and to automatically include altered structures in an incremental 
> save of the document, therefore creating increments (or an increment), that a 
> signature would sign.
> *Result:*
> Using this patch documents could be created:
> incrementally adding pages, adding contents to pages, adding annotations, 
> altering structures, removing structures.
> As far as has been initially tested the resulting documents were valid, 
> viewable in a reader and the objects overwritten in increments seemed correct.
> *But -* *Caveat:*
> This patch does introduce atleast one ugly class (most likely you will be 
> able to point out more, that could be optimized :)) and that is 
> "COSUpdateInfoList" - as already explained: In my opinion such a class should 
> not exist, the COSUpdateInfo objects should be context aware and should be 
> capable of regulating their own behaviour.
> Whenever the alternatives are to either manage an object externally, or to 
> "teach" an object to solve problems autonomously, I will tend to prefer the 
> latter... but I did not dare to do that.
> This would require, that either further constructors or setters would have to 
> be introduced for COSBase objects, that allowed setting parent/root/context 
> for the object.
> Which would result in further massive changes for using applications and 
> PDFBox itself - as all instantiations of COSBase objects (PDObjects) would 
> have to be adapted.
> However: I would prefer if COSBase objects actually were context aware.
> But as stated... I did not dare to touch it and instead chose the ugly 
> workarround, that would introduce yet another iteration over the whole 
> COSDocument structure.
> Eliminating COSUpdateInfoList would be preferable!
> *Suggestion:*
> As PDFBox 3 is already changing how documents and objects are handled, I 
> would suggest, that also COSBase objects should be made context and selfaware 
> in PDFBox 3.
> This would allow simplifying handling COS objects using PDFBox and it would 
> allow for an easier and automized handling of incremental saving.
> *Usage example:*
> The following "pseudo code" (actually using simplified Helper classes) 
> demonstrates the intended usage:
> !image-2021-08-23-14-55-24-077.png!
> *As always:* Thank you very much for your work and support! I hope this 
> suggestion is to your liking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to