[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426044#comment-17426044
 ] 

Christian Appl edited comment on PDFBOX-5263 at 10/8/21, 9:04 AM:
------------------------------------------------------------------

[~tilman] okay, I will fix that.

[~msahyoun] hmmm, I need to take a step back here.

*Concerning "our" goals:* 
 Currently my company is still using 2.0 (obviously as that is the released 
stable version). I developed this for both 3.0 and our local mirror of 2.0 - 
3.0 to contribute here - 2.0 to use it for our own development (at our own 
risk).
 2.0 does not organize COSObjects in this manner, so... you are right, possibly 
there would/should be ways to avoid dereferencing those objects at the end. But 
currently on our side this choice does not even exist. 2.0 always loads all 
objects during the initial parsing. As far as we see it the increment creation 
itself seems to work and the additional loading of unaltered structures would 
happen anyway.
 There is no way for our mirror to reduce that loading time, so this would not 
benefit our current development.

*Disclosure concerning "my" goals:*
 For those reasons - if I would like to contribute further to this, I will have 
to do so privately - which I would like to do.
 Which however comes with the caveat: I will have to work on this in my free 
time, which means from here on out things will have to move at a slower pace.

*Concerning the issue:*
 It is possible, that this is entirely solvable by patching the way the 
COSWriter attempts to prepare incremental saving.
 But I think that I might also have to touch the way "Object dereferencation" 
works - I will try to avoid that at all costs as I don't want to bother you, 
while work on that might still be in progress anyway.

*My thoughts on delayed object parsing*
 *(Nothing I will actively chase - just my train of thought and a bit of 
feedback):*
 Currently the new dereferencation logic has great benefits when one only needs 
reading access to a document. Say - I want to know if a document is signed - I 
want to extract some Metadata - some resource, from a specific page. Not having 
to read the whole document for this is a major benefit. I tip my hat to that.
 I would assume that this currently does not speed up or benefit writing 
operations in any way, as saving incrementally does what you found above and 
saving "normally" will have to access all objects anyway.

It would be great if objects would not need to be derefernced in the end, just 
for writing the document. aka: An object that had not been dereferenced 
previously and for that reason can not have changed - is parsed.
 I would assume: If during writing objects - that were not dereferenced and are 
still contained/must be written - could be copied bitwise (as a raw stream) - 
instead of being parsed - this would be a faster and perfect solution. (If it 
was possible - which I can't know... just guessing here.)

*Sort of back to topic:*
 As far as I can see it, "prepareIncrement" is mostly busy determining the keys 
and their matching actual objects and needs to dereference the COSObject 
wrappers for that reason. What I am asking myself: Why does it have to do that? 
And: Wouldn´t the wrapping COSObject (placeholder) be enough to reach the same 
goal? Don't the COSObjects already know (or could know) their assigned keys?

*Two things concerning keys.*
 - I have not checked yet, whether 3.0 still does that, but 2.0 had a tendency 
to redetermine keys and renumber COSObjects (as far as I remember) - why is 
that? As far as I can remember the specification is assuming keys to be 
constant and never be altered in between saves. This had not been a problem at 
the time I encountered it, but I always wondered why and whether it could be 
prevented.
 - Also: I have seen that COSObjects now have the ability to store their 
assigned keys - I really like that change! I wonder if that is something that 
could help with improving the matter at hand. We shall see.


was (Author: capsvd):
[~tilman] okay, I will fix that.

[~msahyoun] hmmm, I need to take a step back here.

*Concerning "our" goals:* 
Currently my company is still using 2.0 (obviously as that is the released 
stable version). I developed this for both 3.0 and our local mirror of 2.0 - 
3.0 to contribute here - 2.0 to use it for our own development (at our own 
risk).
2.0 does not organize COSObjects in this manner, so... you are right, possibly 
there would/should be ways to avoid dereferencing those objects at the end. But 
currently on our side this choice does not even exist. 2.0 always loads all 
objects during the initial parsing. As far as we see it the increment creation 
itself seems to work and the additional loading of unaltered structures would 
happen anyway.
There is no way for our mirror to reduce that loading time, so this would not 
benefit our current development.

*Disclosure concerning "my" goals:*
For those reasons - if I would like to contribute further to this, I will have 
to do so privately - which I would like to do.
 Which however comes with the caveat: I will have to work on this in my free 
time, which means from here on out things will have to move at a slower pace.

*Concerning the issue:*
It is possible, that this is entirely solvable by patching the way the 
COSWriter attempts to prepare incremental saving.
But I think that I might also have to touch the way "Object dereferencation" 
works - I will try to avoid that at all costs as I don't want to bother you, 
while work on that might still be in progress anyway.

*My thoughts on delayed object parsing*
*(Nothing I will actively chase - just my train of thought and a bit of 
feedback):*
Currently the new dereferencation logic has great benefits when one only needs 
reading access to a document. Say - I want to know if a document is signed - I 
want to extract some Metadata - some resource, from a specific page. Not having 
to read the whole document for this is a major benefit. I tip my hat to that.
 I would assume that this currently does not speed up or benefit writing 
operations in any way, as saving incrementally does what you found above and 
saving "normally" will have to access all objects anyway.

It would be great if objects would not need to be derefernced in the end, just 
for writing the document. aka: An object that had not been dereferenced 
previously and for that reason can not have changed - is parsed.
I would assume: If during writing objects - that were not dereferenced and are 
still contained/must be written - could be copied bitwise - instead of being 
parsed - this would be a faster and perfect solution. (If it was possible - 
which I can't know... just guessing here.)

*Sort of back to topic:*
As far as I can see it, "prepareIncrement" is mostly busy determining the keys 
and their matching actual objects and needs to dereference the COSObject 
wrappers for that reason. What I am asking myself: Why does it have to do that? 
And: Wouldn´t the wrapping COSObject (placeholder) be enough to reach the same 
goal? Don't the COSObjects already know (or could know) their assigned keys?

*Two things concerning keys.*
- I have not checked yet, whether 3.0 still does that, but 2.0 had a tendency 
to redetermine keys and renumber COSObjects (as far as I remember) - why is 
that? As far as I can remember the specification is assuming keys to be 
constant and never be altered in between saves. This had not been a problem at 
the time I encountered it, but I always wondered why and whether it could be 
prevented.
- Also: I have seen that COSObjects now have the ability to store their 
assigned keys - I really like that change! I wonder if that is something that 
could help with improving the matter at hand. We shall see.

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-5263
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5263
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, PDModel, Writing
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Christian Appl
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> image-2021-10-07-17-46-41-793.png, image-2021-10-07-18-13-05-737.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know and manage by themselves 
> whether they were freshly created or altered
> (e.g.: A COSDictionary should be able to remember, that a setter had been 
> called).
> - If "COSUpdateInfo" objects were self aware and would solve this by 
> themselves, it would not be necessary anymore to set update states manually.
> *Problems:*
> The first and obvious problem is, that the initial loading of a document is 
> creating and altering new COS structures and we obviously don't want objects 
> to observe and remember those changes. An object that is created during 
> document initialization must be treated as preexisting.
> However: COSBase is not context aware - it does know it's descendants, but 
> neither does it know it's parent, nor does it know it's root.
> If it was, that actually would present the optimal solution, as in that case 
> the Object could ask it's root for the current load state and therefore would 
> be able to ignore said changes caused by the initial loading of a document.
> But it is not. (My opinion is - it should be! But more on that later.)
> Therefore a a helper named COSUpdateInfoList was implemented, which was 
> capable of finding COSUpdateInfo objects in a COS structure, and that allowed 
> resetting their update state after loading was completed.
> *Description of the patch:*
> The patch implements selfaware COSUpdateInfo objects, which the COSWriter has 
> been adapted to process. PDFBox therefore is capable of monitoring changes in 
> realtime and to automatically include altered structures in an incremental 
> save of the document, therefore creating increments (or an increment), that a 
> signature would sign.
> *Result:*
> Using this patch documents could be created:
> incrementally adding pages, adding contents to pages, adding annotations, 
> altering structures, removing structures.
> As far as has been initially tested the resulting documents were valid, 
> viewable in a reader and the objects overwritten in increments seemed correct.
> *But -* *Caveat:*
> This patch does introduce atleast one ugly class (most likely you will be 
> able to point out more, that could be optimized :)) and that is 
> "COSUpdateInfoList" - as already explained: In my opinion such a class should 
> not exist, the COSUpdateInfo objects should be context aware and should be 
> capable of regulating their own behaviour.
> Whenever the alternatives are to either manage an object externally, or to 
> "teach" an object to solve problems autonomously, I will tend to prefer the 
> latter... but I did not dare to do that.
> This would require, that either further constructors or setters would have to 
> be introduced for COSBase objects, that allowed setting parent/root/context 
> for the object.
> Which would result in further massive changes for using applications and 
> PDFBox itself - as all instantiations of COSBase objects (PDObjects) would 
> have to be adapted.
> However: I would prefer if COSBase objects actually were context aware.
> But as stated... I did not dare to touch it and instead chose the ugly 
> workarround, that would introduce yet another iteration over the whole 
> COSDocument structure.
> Eliminating COSUpdateInfoList would be preferable!
> *Suggestion:*
> As PDFBox 3 is already changing how documents and objects are handled, I 
> would suggest, that also COSBase objects should be made context and selfaware 
> in PDFBox 3.
> This would allow simplifying handling COS objects using PDFBox and it would 
> allow for an easier and automized handling of incremental saving.
> *Usage example:*
> The following "pseudo code" (actually using simplified Helper classes) 
> demonstrates the intended usage:
> !image-2021-08-23-14-55-24-077.png!
> *As always:* Thank you very much for your work and support! I hope this 
> suggestion is to your liking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to