[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425404#comment-17425404
 ] 

Christian Appl edited comment on PDFBOX-5263 at 10/7/21, 8:20 AM:
------------------------------------------------------------------

*TL;DR: I underlined the most central assumptions for COSIncrement, that 
possibly contain errors or space for improvement.*

*Loading:*
 I would actually expect the load method to be a lot faster, than the saving 
itself, as during loading the only thing happening is setting the update states 
and document context of updatable COS structures.
 If the information was determined during the loading or editing of the 
COSDocument (as it was for the observer) you would see that turned arround.
 This is actually by choice - as this way a "normal" save can ignore all this 
and is not slowed down by the determination of information, that it does not 
require.

If loading time shall be sped up: currently the most costly operation during 
loading is the determination of a COSUpdateInfo´s context 
(COSUpdateState#setOriginDocumentState()). Instead the COSParser could set the 
update states of updatable structures, without evaluating or providing a 
context at all. Which might be a lot of work for possibly not that much of a 
gain (160ms). If one wanted to do that anyway - that method is the place to 
look at and should best be eliminated - as it is the only additional iteration 
during loading, that possibly could be replaced with something else.

*Saving:*
 As suggested this does not do much during loading and postpones the heavy load 
of the actual increment creation to a point in time, when saving incrementally 
is actually required.
 To alter how this behaves +the only class that would have to be changed is 
{color:#172b4d}COSIncrement{color}+. No other class does contain code for the 
actual processing of a COSIncrement. If one wants to speed it up and eliminate 
unnecessary steps, that is the place to go.

*What and Why:*
 Originally a user marked a path of the document "manually" as updated starting 
at the trailer object of a document. The COSWriter would only detect and update 
objects if it could iterate that path, always finding the next updated child, 
until the last updated leaf is reached.
 Which leads to the inclusion of path elements, that had not actually changed, 
but still required inclusion in an increment, as otherwise the actually updated 
node would not have been found during writing.

*Identifying updated nodes:*
 {color:#172b4d}- ... Hence: +The first (and most costly) assumption of 
COSIncrement is: The whole loaded structure must be iterated to find actually 
updated and possibly isolated nodes that can not rely on their whole path being 
updated and may be contained in substructures, that have neither changed, nor 
have been informed to be updated. Only those updated nodes shall be part of an 
increment, which is avoiding to add their unaltered ancestry to an increment. 
The iteration shall start at a given node (in this case - always the documents 
trailer object).+{color}

The observer attempted to collect such nodes in the moment they were being 
updated - which would be an alternative and would avoid the iteration of the 
whole tree, as this would avoid searching for nodes altogether. But as it 
turned out, that had some even worse issues.

*Searching the tree:*
 +{color:#172b4d}- COSObject, COSArray and COSDictionary nodes are updatable - 
find those, check those, possibly include those in an 
increment.{color}+{color:#172b4d}Starting at a given node COSIncrement will 
iterate the children of such nodes and will only process instances of those 
three classes, using the matching collect(something) method.
 
 - If a COSDictionary is encountered and has been updated and it is not flagged 
as a "direct" node, it is added to the increment. If one of the children of a 
COSDictionary is a direct node and updated, it is updating the containing 
COSDictionary instead (possibly causing the inclusion of the parent{color} in 
an increment), also such direct nodes are excluded from the increment. +It 
shall be assumed, that COSDictionaries could be added as top level objects.+
 - If a COSArray is encountered it is by default assuming to be written 
directly (only exception - if it had already been contained in a COSObject.).

- +The updatable children of COSDictionaries and COSArray shall be iterated, as 
they could be updated or contain updated descendants themselves.+

- +If a COSObject is encountered and the COSObject wrapper itself is marked as 
having been updated, the object is actively dereferenced. Otherwise COSObjects 
that have not already been dereferenced are skipped.+
 - +COSObjects that shall be updated must be dereferenced.+
 - COSObjects that have been dereferenced shall be iterated, as they could be 
updated or contain updated descendants themselves.

+*Collecting results, avoiding repetitions:*+
 - Results are collected in an "objects" field (LinkedHashSet) - ensuring the 
results to be well ordered and hopefully increasing the speed of "contains" 
checks.+
 - +Excluded objects, that are either direct and shall update their parent 
instead or are otherwise not fit for inclusion in an increment are collected in 
"excluded" (HashSet)+
 - +COSObjects are collected in "processedObjects", as it shall be avoided to 
process COSObjects redundantly.+

 - +No object shall be included in an increment redundantly, hence the 
collections explained above and the used "contains" checks are required.+

*Optimization:*
 - Possibly the best point to start optimizations would be to rethink the 
collections, that are used by COSIncrement. All of them have a purpose and none 
of them are entirely superfluous, but possibly the information stored within 
them could be reorganized more efficiently? Possbily they could be eliminated, 
if the iteration would work differently?
 - The optimal solution would be to *not* iterate the tree at all, but to know 
which objects were updated instead. Collecting updated objects on the go 
requires even more contextual information for the individual updatable node... 
determining those seems to be even more costly - hence I gave up on that 
thought.

And that actually is all that COSIncrement really does.


was (Author: capsvd):
*TL;DR: I underlined the most central assumptions for COSIncrement, that 
possibly contain errors or space for improvement.*

*Loading:*
I would actually expect the load method to be a lot faster, than the saving 
itself, as during loading the only thing happening is setting the update states 
and document context of updatable COS structures.
If the information was determined during the loading or editing of the 
COSDocument (as it was for the observer) you would see that turned arround.
This is actually by choice - as this way a "normal" save can ignore all this 
and is not slowed down by the determination of information, that it does not 
require.

If loading time shall be sped up: currently the most costly operation during 
loading is the determination of a COSUpdateInfo´s context 
(COSUpdateState#setOriginDocumentState()). Instead the COSParser could set the 
update states of updatable structures, without evaluating or providing a 
context at all. Which might be a lot of work for possibly not that much of a 
gain (160ms). If one wanted to do that anyway - that method is the place to 
look at and should best be eliminated - as it is the only additional iteration 
during loading, that possibly could be replaced with something else.

*Saving:*
As suggested this does not do much during loading and postpones the heavy load 
of the actual increment creation to a point in time, when saving incrementally 
is actually required.
To alter how this behaves +the only class that would have to be changed is 
{color:#172b4d}COSIncrement{color}+. No other class does contain code for the 
actual processing of a COSIncrement. If one wants to speed it up and eliminate 
unnecessary steps, that is the place to go.

*What and Why:*
Originally a user marked a path of the document "manually" as updated starting 
at the trailer object of a document. The COSWriter would only detect and update 
objects if it could iterate that path, always finding the next updated child, 
until the last updated leaf is reached.
Which leads to the inclusion of path elements, that had not actually changed, 
but still required inclusion in an increment, as otherwise the actually updated 
node would not have been found during writing.

*Identifying updated nodes:*
{color:#172b4d}- ... Hence: +The first (and most costly) assumption of 
COSIncrement is: The whole loaded structure must be iterated to find actually 
updated and possibly isolated nodes that can not rely on their whole path being 
updated and may be contained in substructures, that have neither changed, nor 
have been informed to be updated. Only those updated nodes shall be part of an 
increment, which is avoiding to add their unaltered ancestry to an increment. 
The iteration shall start at a given node (in this case - always the documents 
trailer object).+{color}

 The observer attempted to collect such nodes in the moment they were being 
updated - which would be an alternative and would avoid the iteration of the 
whole tree, as this would avoid searching for nodes altogether. But as it 
turned out, that had some even worse issues.

*Searching the tree:*
+{color:#172b4d}- COSObject, COSArray and COSDictionary nodes are updatable - 
find those, check those, possibly include those in an increment.
{color}+{color:#172b4d}Starting at a given node COSIncrement will iterate the 
children of such nodes and will only process instances of those three classes, 
using the matching collect(something) method.

- If a COSDictionary is encountered and has been updated and it is not flagged 
as a "direct" node, it is added to the increment. If one of the children of a 
COSDictionary is a direct node and updated, it is updating the containing 
COSDictionary instead (possibly causing the inclusion of the parent{color} in 
an increment), also such direct nodes are excluded from the increment. +It 
shall be assumed, that COSDictionaries could be added as top level objects.+
- If a COSArray is encountered it is by default assuming to be written directly 
(only exception - if it had already been contained in a COSObject.).

+- The updatable children of COSDictionaries and COSArray shall be iterated, as 
they could be updated or contain updated descendants themselves.

+- If a COSObject is encountered and the COSObject wrapper itself is marked as 
having been updated, the object is actively dereferenced. Otherwise COSObjects 
that have not already been dereferenced are skipped.
+- COSObjects that shall be updated must be dereferenced.
- COSObjects that have been dereferenced shall be iterated, as they could be 
updated or contain updated descendants themselves.

+*Collecting results, avoiding repetitions:*+
- Results are collected in an "objects" field (LinkedHashSet) - ensuring the 
results to be well ordered and hopefully increasing the speed of "contains" 
checks.+
+- Excluded objects, that are either direct and shall update their parent 
instead or are otherwise not fit for inclusion in an increment are collected in 
"excluded" (HashSet)+
+- COSObjects are collected in "processedObjects", as it shall be avoided to 
process COSObjects redundantly.

- No object shall be included in an increment redundantly, hence the 
collections explained above and the used "contains" checks are required.+

*Optimization:*
- Possibly the best point to start optimizations would be to rethink the 
collections, that are used by COSIncrement. All of them have a purpose and none 
of them are entirely superfluous, but possibly the information stored within 
them could be reorganized more efficiently? Possbily they could be eliminated, 
if the iteration would work differently?
- The optimal solution would be to *not* iterate the tree at all, but to know 
which objects were updated instead. Collecting updated objects on the go 
requires even more contextual information for the individual updatable node... 
determining those seems to be even more costly - hence I gave up on that 
thought.

And that actually is all that COSIncrement really does.

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-5263
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5263
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, PDModel, Writing
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Christian Appl
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> out.pdf, out2.pdf
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know and manage by themselves 
> whether they were freshly created or altered
> (e.g.: A COSDictionary should be able to remember, that a setter had been 
> called).
> - If "COSUpdateInfo" objects were self aware and would solve this by 
> themselves, it would not be necessary anymore to set update states manually.
> *Problems:*
> The first and obvious problem is, that the initial loading of a document is 
> creating and altering new COS structures and we obviously don't want objects 
> to observe and remember those changes. An object that is created during 
> document initialization must be treated as preexisting.
> However: COSBase is not context aware - it does know it's descendants, but 
> neither does it know it's parent, nor does it know it's root.
> If it was, that actually would present the optimal solution, as in that case 
> the Object could ask it's root for the current load state and therefore would 
> be able to ignore said changes caused by the initial loading of a document.
> But it is not. (My opinion is - it should be! But more on that later.)
> Therefore a a helper named COSUpdateInfoList was implemented, which was 
> capable of finding COSUpdateInfo objects in a COS structure, and that allowed 
> resetting their update state after loading was completed.
> *Description of the patch:*
> The patch implements selfaware COSUpdateInfo objects, which the COSWriter has 
> been adapted to process. PDFBox therefore is capable of monitoring changes in 
> realtime and to automatically include altered structures in an incremental 
> save of the document, therefore creating increments (or an increment), that a 
> signature would sign.
> *Result:*
> Using this patch documents could be created:
> incrementally adding pages, adding contents to pages, adding annotations, 
> altering structures, removing structures.
> As far as has been initially tested the resulting documents were valid, 
> viewable in a reader and the objects overwritten in increments seemed correct.
> *But -* *Caveat:*
> This patch does introduce atleast one ugly class (most likely you will be 
> able to point out more, that could be optimized :)) and that is 
> "COSUpdateInfoList" - as already explained: In my opinion such a class should 
> not exist, the COSUpdateInfo objects should be context aware and should be 
> capable of regulating their own behaviour.
> Whenever the alternatives are to either manage an object externally, or to 
> "teach" an object to solve problems autonomously, I will tend to prefer the 
> latter... but I did not dare to do that.
> This would require, that either further constructors or setters would have to 
> be introduced for COSBase objects, that allowed setting parent/root/context 
> for the object.
> Which would result in further massive changes for using applications and 
> PDFBox itself - as all instantiations of COSBase objects (PDObjects) would 
> have to be adapted.
> However: I would prefer if COSBase objects actually were context aware.
> But as stated... I did not dare to touch it and instead chose the ugly 
> workarround, that would introduce yet another iteration over the whole 
> COSDocument structure.
> Eliminating COSUpdateInfoList would be preferable!
> *Suggestion:*
> As PDFBox 3 is already changing how documents and objects are handled, I 
> would suggest, that also COSBase objects should be made context and selfaware 
> in PDFBox 3.
> This would allow simplifying handling COS objects using PDFBox and it would 
> allow for an easier and automized handling of incremental saving.
> *Usage example:*
> The following "pseudo code" (actually using simplified Helper classes) 
> demonstrates the intended usage:
> !image-2021-08-23-14-55-24-077.png!
> *As always:* Thank you very much for your work and support! I hope this 
> suggestion is to your liking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to