[GitHub] [pdfbox] THausherr commented on pull request #107: potential memory leaks and small performance improvements

2021-10-07 Thread GitBox


THausherr commented on pull request #107:
URL: https://github.com/apache/pdfbox/pull/107#issuecomment-938320827


   > 
   > 
   > PDSignature.getByteRange able to return empty array but getContents(...) 
methods have no checks. A comment in getContents() method says "@throws 
IOException if the pdfFile can't be read" but no message "@throws 
IndexOutOfBoundsException if ..."
   
   I've added a "@throws".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5291) Need API for the following PDF conversions

2021-10-07 Thread Joseph (Jira)
Joseph created PDFBOX-5291:
--

 Summary: Need API for the following PDF conversions
 Key: PDFBOX-5291
 URL: https://issues.apache.org/jira/browse/PDFBOX-5291
 Project: PDFBox
  Issue Type: New Feature
Reporter: Joseph


Hi,

It would be great if anyone could develop and put a API for the following PDF 
conversions so that anyone can integrate it or write a java swing program or 
any program to execute and get the appropriate result. Currently it looks very 
tough...A general implementation will help many

#1 - Twin Booklet to Booklet
#2 - Twin Booklet to A4
#3 - A4 to Twin Booklet
#4 - Booklet to Twin Booklet

Thanks

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[GitHub] [pdfbox] valerybokov commented on pull request #107: potential memory leaks and small performance improvements

2021-10-07 Thread GitBox


valerybokov commented on pull request #107:
URL: https://github.com/apache/pdfbox/pull/107#issuecomment-937549360


   PDSignature.getByteRange able to return empty array but getContents(...) 
methods have no checks. A comment in getContents() method says "@throws 
IOException if the pdfFile can't be read" but no message "@throws 
IndexOutOfBoundsException if ..."


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build became unstable: PDFBox » PDFBox-Trunk-jdk17 #396

2021-10-07 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



Jenkins build became unstable: PDFBox » PDFBox-Trunk-jdk17 » Apache PDFBox examples #396

2021-10-07 Thread Apache Jenkins Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4892) Improve code quality (4)

2021-10-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425717#comment-17425717
 ] 

ASF subversion and git services commented on PDFBOX-4892:
-

Commit 1894001 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1894001 ]

PDFBOX-4892: improve javadoc, as suggested by valerybokov

> Improve code quality (4)
> 
>
> Key: PDFBOX-4892
> URL: https://issues.apache.org/jira/browse/PDFBOX-4892
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.20
>Reporter: Tilman Hausherr
>Priority: Minor
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube report|https://sonarcloud.io/project/issues?id=pdfbox-reactor], 
> hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-4071, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4892) Improve code quality (4)

2021-10-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425716#comment-17425716
 ] 

ASF subversion and git services commented on PDFBOX-4892:
-

Commit 1894000 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1894000 ]

PDFBOX-4892: improve javadoc, as suggested by valerybokov

> Improve code quality (4)
> 
>
> Key: PDFBOX-4892
> URL: https://issues.apache.org/jira/browse/PDFBOX-4892
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.20
>Reporter: Tilman Hausherr
>Priority: Minor
>
> This is a longterm issue for the task to improve code quality, by using the 
> [SonarQube report|https://sonarcloud.io/project/issues?id=pdfbox-reactor], 
> hints in different IDEs, the FindBugs tool and other code quality tools.
> This is a follow-up of PDFBOX-4071, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-07 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425697#comment-17425697
 ] 

Tilman Hausherr edited comment on PDFBOX-5290 at 10/7/21, 5:11 PM:
---

No they're the same (or rather, based on the same sub projects - the app is a 
merge of several jars). Please try a clean build / remove all old versions from 
the classpath, i.e. look into the directories what's there. If it still 
happens, please share the stack trace.


was (Author: tilman):
No they're the same. Please try a clean build / remove all old versions from 
the classpath, i.e. look into the directories what's there. If it still 
happens, please share the stack trace.

> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.20, 2.0.24
>Reporter: Eric R Manzitti
>Priority: Major
> Attachments: newBroke.pdf, newBroke.txt
>
>
> I am getting: 
>  
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be 
> cast to org.apache.pdfbox.cos.COSArray
> When executing the following code:
>  
> public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
> String UTF_8 = "UTF-8";
> PDFLibraryProperties pdfLibraryProperties = 
> PDFLibraryProperties.getInstance();
>  String regex = 
> pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
> byte[] bytesToReturn;
>  try {
>  FileInputStream fis = new FileInputStream(new File(fileNamePath));
>  PDDocument pdfDoc = PDDocument.load(fis);
>  PDFTextStripper pdfStripper = new PDFTextStripper();
>  String textFromPDF = pdfStripper.getText(pdfDoc);
>  pdfDoc.close();
>  bytesToReturn = textFromPDF.getBytes(UTF_8);
>  String textStr = new String(bytesToReturn).replaceAll(regex, 
> PDFLibraryConstants.BLANK_SPACE);
>  bytesToReturn = textStr.getBytes();
>  fis.close();
>  } catch (IOException e) {
>  pqUtilityLogger.logError(e.getMessage());
>  throw new PQException("e.getMessage());
>  }
>  return bytesToReturn;
>  }
>  
> It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-07 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425697#comment-17425697
 ] 

Tilman Hausherr commented on PDFBOX-5290:
-

No they're the same. Please try a clean build / remove all old versions from 
the classpath, i.e. look into the directories what's there. If it still 
happens, please share the stack trace.

> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.20, 2.0.24
>Reporter: Eric R Manzitti
>Priority: Major
> Attachments: newBroke.pdf, newBroke.txt
>
>
> I am getting: 
>  
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be 
> cast to org.apache.pdfbox.cos.COSArray
> When executing the following code:
>  
> public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
> String UTF_8 = "UTF-8";
> PDFLibraryProperties pdfLibraryProperties = 
> PDFLibraryProperties.getInstance();
>  String regex = 
> pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
> byte[] bytesToReturn;
>  try {
>  FileInputStream fis = new FileInputStream(new File(fileNamePath));
>  PDDocument pdfDoc = PDDocument.load(fis);
>  PDFTextStripper pdfStripper = new PDFTextStripper();
>  String textFromPDF = pdfStripper.getText(pdfDoc);
>  pdfDoc.close();
>  bytesToReturn = textFromPDF.getBytes(UTF_8);
>  String textStr = new String(bytesToReturn).replaceAll(regex, 
> PDFLibraryConstants.BLANK_SPACE);
>  bytesToReturn = textStr.getBytes();
>  fis.close();
>  } catch (IOException e) {
>  pqUtilityLogger.logError(e.getMessage());
>  throw new PQException("e.getMessage());
>  }
>  return bytesToReturn;
>  }
>  
> It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425691#comment-17425691
 ] 

Tilman Hausherr commented on PDFBOX-5263:
-

this assignment
{code}
if(object.isDereferenced() && (reference = object.getObject()) instanceof 
COSUpdateInfo)
{code}
should be fixed, it is confusing. The only place where this type of assignments 
makes sense is code like {{while ((c = read()) != -1)}} because everybody does 
it.

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> image-2021-10-07-17-46-41-793.png, image-2021-10-07-18-13-05-737.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and i

[jira] [Comment Edited] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425681#comment-17425681
 ] 

Christian Appl edited comment on PDFBOX-5263 at 10/7/21, 4:41 PM:
--

[~msahyoun] forget about everything I said - you are right, I am wrong, and I 
obviously fail to understand simple diagrams COSincrement has nothing to do 
with this and of course prepareIncrement is the one dereferencing objects 
here...

(Even though: nice... COSIncrement does not cause objects to be dereferenced 
unnecessarily - so atleast that worked.)

sorry
 Will have a closer look at prepareIncrement now.


I will have to have another look at that tomorrow, with a clear head, but I 
would tend to agree this dereferences all objects known to a document and 
it can not be removed it is central for the COSWriter to work, that this 
happens but what is the point of delaying the parsing of COSObjects at all, 
if that is the case?


was (Author: capsvd):
[~msahyoun] forget about everything I said - you are right, I am wrong, and I 
obviously fail to understand simple diagrams COSincrement has nothing to do 
with this and of course prepareIncrement is the one dereferencing objects 
here...

(Even though: nice... COSIncrement does not cause objects to be dereferenced 
unnecessarily - so atleast that worked.)

sorry
Will have a closer look at prepareIncrement now.

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> image-2021-10-07-17-46-41-793.png, image-2021-10-07-18-13-05-737.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the m

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425684#comment-17425684
 ] 

Maruan Sahyoun commented on PDFBOX-5263:


That's the nice thing about having 2 looking at the code. And I'm also learning 
about that part of PDFBox - I'm sure we will be getting there. And most of the 
actual work is on you

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> image-2021-10-07-17-46-41-793.png, image-2021-10-07-18-13-05-737.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425681#comment-17425681
 ] 

Christian Appl commented on PDFBOX-5263:


[~msahyoun] forget about everything I said - you are right, I am wrong, and I 
obviously fail to understand simple diagrams COSincrement has nothing to do 
with this and of course prepareIncrement is the one dereferencing objects 
here...

(Even though: nice... COSIncrement does not cause objects to be dereferenced 
unnecessarily - so atleast that worked.)

sorry
Will have a closer look at prepareIncrement now.

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> image-2021-10-07-17-46-41-793.png, image-2021-10-07-18-13-05-737.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would 

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425668#comment-17425668
 ] 

Maruan Sahyoun commented on PDFBOX-5263:


https://crossasia-books.ub.uni-heidelberg.de/xasia/reader/download/849/849-42-94772-1-10-20210818.pdf

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> image-2021-10-07-17-46-41-793.png, image-2021-10-07-18-13-05-737.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have ch

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425667#comment-17425667
 ] 

Christian Appl commented on PDFBOX-5263:


[~msahyoun] I was not able to reproduce the issue using my documents where 
can I find: {color:#172b4d}"849-42-94772-1-10-20210818.pdf"?{color}

 

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> image-2021-10-07-17-46-41-793.png, image-2021-10-07-18-13-05-737.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did

[jira] [Updated] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Appl updated PDFBOX-5263:
---
Attachment: image-2021-10-07-18-13-05-737.png

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> image-2021-10-07-17-46-41-793.png, image-2021-10-07-18-13-05-737.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "

[jira] [Comment Edited] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425641#comment-17425641
 ] 

Christian Appl edited comment on PDFBOX-5263 at 10/7/21, 3:46 PM:
--

[~msahyoun] O.o
 Oky That is not actually part of my patch, but actually would negate 
most of what I am doing here, if that was the case.

Could very well be, that I altered that in my first attempt... I actually did 
change a lot more about the COSWriter then
 Will debug a little and see what I come up with...

Actually "prepareIncrement()" is called after the increment is created... which 
in itself sounds wrong. But as it is atleast it can bes said, that 
"prepareIncrement" is not causing the odd behaviour of COSIncrement - that 
still is on me but will have a closer look at "prepareIncrement" anyway.

!image-2021-10-07-17-46-41-793.png!


was (Author: capsvd):
[~msahyoun] O.o
Oky That is not actually part of my patch, but actually would negate 
most of what I am doing here, if that was the case.

Could very well be, that I altered that in my first attempt... I actually did 
change a lot more about the COSWriter then
Will debug a little and see what I come up with...

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> image-2021-10-07-17-46-41-793.png, out.pdf, out2.pdf, profiling-2021-10-07 
> 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425641#comment-17425641
 ] 

Christian Appl commented on PDFBOX-5263:


[~msahyoun] O.o
Oky That is not actually part of my patch, but actually would negate 
most of what I am doing here, if that was the case.

Could very well be, that I altered that in my first attempt... I actually did 
change a lot more about the COSWriter then
Will debug a little and see what I come up with...

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425623#comment-17425623
 ] 

Maruan Sahyoun commented on PDFBOX-5263:


My suspicion (still digging into understanding all this)

COSWriter has this
{code:java}
private void prepareIncrement()
{
COSDocument cosDoc = pdDocument.getDocument();
Set keySet = cosDoc.getXrefTable().keySet();
for (COSObjectKey cosObjectKey : keySet)
{
COSBase object = cosDoc.getObjectFromPool(cosObjectKey).getObject();
if (object != null && cosObjectKey != null && !(object instanceof 
COSNumber))
{
// FIXME see PDFBOX-4997: objectKeys is (theoretically) risky 
because a COSName in
// different objects would appear only once. Rev 1092855 
considered this
// but only for COSNumber.
objectKeys.put(object, cosObjectKey);
keyObject.put(cosObjectKey, object);
}
}
}
{code}

with COSDocument
{code:java}
public COSObject getObjectFromPool(COSObjectKey key)
{
COSObject obj = null;
if( key != null )
{
// make "proxy" object if this was a forward reference
obj = objectPool.computeIfAbsent(key, k -> new COSObject(k, 
parser));
}
return obj;
}
{code}

Doesn't that cause all objects to be parsed at the end?

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require,

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425614#comment-17425614
 ] 

Christian Appl commented on PDFBOX-5263:


[~msahyoun]

"with prepareIncrement -> parseObjectDynamically ... taking most of the time. 
Why does prepareIncrement need to parse?"

*Stating the obvious (skip this):*
It would have to parse a "COSObject" instance, if that would have to be 
actively dereferenced.
In That case "getObject()" causes the "COSObject" to use it´s given 
"ICOSParser" to parse it´s "actual" contents - which is delayed until such a 
COSObject is actually accessed and requires parsing. A behaviour added with 3.0.
!image-2021-10-07-16-46-30-004.png!

*Why is COSIncrement causing this???*
The one and only call in COSIncrement to "COSObject#getObject()" is the 
following:
!image-2021-10-07-16-42-48-685.png!
Not a single other line in COSIncrement could cause this.

So this line is reached But _why_?
This says: 
- If the object already has been dereferenced, then have a look at it´s 
contained "actual" substructures. (something the COSWriter would do in that 
case aswell.)
That second condition can be ignored, as that object is already dereferenced 
and should not cause the ICOSParser to run. (except if "isDereferenced()" is 
actively lying to me here - which I tend to doubt.)

If the object has not already been dereferenced by a user interaction, the only 
case this would try to actively dereference the COSObject is:
If the update state of that object had been set to "true" which is more 
likely But _why_?
I could remove the first condition and that would take care of that... But 
_why_ is a child COSObject deemed to be updated (in any situation), that has 
not even been dereferenced yet
I can see why this would slow things down, but I fail to understand how this 
could happen
This is again what you already found for the observer to happen - I still 
entirely fail to understand what on earth is causing this issue

Will have a look into that.

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reaso

[jira] [Updated] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Appl updated PDFBOX-5263:
---
Attachment: image-2021-10-07-16-46-30-004.png

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, image-2021-10-07-16-46-30-004.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know and manage by themselves 
> whether they were 

[jira] [Updated] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Appl updated PDFBOX-5263:
---
Attachment: image-2021-10-07-16-42-48-685.png

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> image-2021-10-07-16-42-48-685.png, out.pdf, out2.pdf, profiling-2021-10-07 
> 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know and manage by themselves 
> whether they were freshly created or altered
> (e.g.:

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425602#comment-17425602
 ] 

Maruan Sahyoun commented on PDFBOX-5263:


!profiling-2021-10-07 16-27-06.png!

with prepareIncrement -> parseObjectDynamically ... taking most of the time. 
Why does prepareIncrement need to parse? 

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objec

[jira] [Updated] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Maruan Sahyoun (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-5263:
---
Attachment: profiling-2021-10-07 16-27-06.png

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> out.pdf, out2.pdf, profiling-2021-10-07 16-27-06.png
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know and manage by themselves 
> whether they were freshly created or altered
> (e.g.: A COSDictionary should be able to rem

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425557#comment-17425557
 ] 

Christian Appl commented on PDFBOX-5263:


*Notes to self - thinking about whether and how this could be improved*
*Rules:*
- Do not iterate COSStructures - _at all!_ (Iteration costs time, skip steps 
wherever possible. Determine all required information exactly in the moment a 
change is detected.)
- Everything that is not a COSUpdateInfo is irrelevant and shall be skipped, 
not handled, not even be traversed - those must not be touched and must not be 
seen.**
- Do not collect information, that is not required whatsoever.
- Do not include objects, that are not being updated! (for signatures that 
might be okay, but increments of the size of the original document are 
inacceptable.)
- Do not evaluate pathes and reference holders for indirect objects! (Indirect 
objects can have multiple parents - processing pathes and reference holders is 
costly.)
- Do not cause direct objects to become indirect! (Which somehow conflicts the 
previous point, as the parent must be known and updated for direct objects.)
- Collect updated objects in a flat list at a central place, that can be used 
by the COSWriter directly, without adapting it further!

Well I do not know whether that is a can-do...

*Thoughts:*
- Every object either is direct or indirect. In the end only indirect objects 
can be added to an increment, storing direct objects for later evaluation 
should be avoided at all costs, as it results in having to find the indirect 
object, that we would actually like to update.
- The document already knew all contained indirect objects of a document in 2.0 
- possibly that is something we could work with.
- If the parser can tell objects, which changes it causes (hidden 
method/state/flag/lock?). Those changes could be ignored, without the object 
even having to know it's document context. (However the parser would do that.)
- Dereferencing objects is not a special case, if parsers inform objects about 
their activity, as dereferencing is also handled by an ICOSParser, that can do 
the same.
- Not all COSUpdateInfo structures are relevant, if a COSUpdateInfo is direct, 
it should inform the next higher (containing) indirect object about the update 
instead.
- When a COSUpdateInfo is told to be direct, the containing indirect object 
could be given to the object - then the object itself could redirect all 
updates to the containing structure. (setNeedToBeUpdated)
- Indirect Objects must know a central place to report updates to (reduced core 
of COSIncrement) - which is collecting updated indirect objects without further 
questions, in an ordered manner. That structure would be creating the increment 
by doing so.
- The increment would again be created in realtime, without having to 
postprocess the document for the COSWriter, using structures that are as cheap 
as can be.

This sounds _somewhat_ plausible and possible
I will give it another go at the weekend - but I can give no guarantees at all, 
that I end up with a thing that actually works.

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-0

[jira] [Commented] (PDFBOX-5290) ClassCastException during Text Extraction

2021-10-07 Thread Eric R Manzitti (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425552#comment-17425552
 ] 

Eric R Manzitti commented on PDFBOX-5290:
-

I also double checked in my IDE that my "external dependency" to PDFBox was 
indeed 2.0.24.  It was.  Is it at all possible the app and the library are 
different?

> ClassCastException during Text Extraction
> -
>
> Key: PDFBOX-5290
> URL: https://issues.apache.org/jira/browse/PDFBOX-5290
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.20, 2.0.24
>Reporter: Eric R Manzitti
>Priority: Major
> Attachments: newBroke.pdf, newBroke.txt
>
>
> I am getting: 
>  
> java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be 
> cast to org.apache.pdfbox.cos.COSArray
> When executing the following code:
>  
> public byte[] extractTextPDFBox(String fileNamePath) throws PQException {
> String UTF_8 = "UTF-8";
> PDFLibraryProperties pdfLibraryProperties = 
> PDFLibraryProperties.getInstance();
>  String regex = 
> pdfLibraryProperties.getAsString(PDFLibraryConstants.REGEX_TO_REMOVE_FROM_EXTRACTED_TEXT);
> byte[] bytesToReturn;
>  try {
>  FileInputStream fis = new FileInputStream(new File(fileNamePath));
>  PDDocument pdfDoc = PDDocument.load(fis);
>  PDFTextStripper pdfStripper = new PDFTextStripper();
>  String textFromPDF = pdfStripper.getText(pdfDoc);
>  pdfDoc.close();
>  bytesToReturn = textFromPDF.getBytes(UTF_8);
>  String textStr = new String(bytesToReturn).replaceAll(regex, 
> PDFLibraryConstants.BLANK_SPACE);
>  bytesToReturn = textStr.getBytes();
>  fis.close();
>  } catch (IOException e) {
>  pqUtilityLogger.logError(e.getMessage());
>  throw new PQException("e.getMessage());
>  }
>  return bytesToReturn;
>  }
>  
> It dies on String textFromPDF = pdfStripper.getText(pdfDoc);
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425479#comment-17425479
 ] 

Christian Appl edited comment on PDFBOX-5263 at 10/7/21, 12:05 PM:
---

-H-
 -Iterating the COSTree is not that special and the COSWriter would do exactly 
that.-
 -The idea was to spare the COSWriter the trouble of iterating the tree, by 
creating the increment first, setting the objectsToWrite and relying on 
"addObjectsToWrite" to skip those objects, as they have already been added 
but I think this contains an obivous flaw Say the trailer and it's children 
are not actually updated, but some distant substructure is... the increment 
would add only the substructure to the objects already processed.-
 -Meaning the COSWriter would still iterate the ancestry of that, just to 
recognize (again) that those were not updated!?-

-If that is the case, it is a major flaw - COSIncrement shall result in the 
COSWriter not iterating the document at all, it shall produce a good to go 
increment the COSWriter simply must write, without having to iterate the 
structures, that have not changed-
 -If that is the case, this would still waste a lot of time-

*Edit:*
 Will try to have a look into that at the weekend.


was (Author: capsvd):
H
 Iterating the COSTree is not that special and the COSWriter would do exactly 
that.
 The idea was to spare the COSWriter the trouble of iterating the tree, by 
creating the increment first, setting the objectsToWrite and relying on 
"addObjectsToWrite" to skip those objects, as they have already been added 
but I think this contains an obivous flaw Say the trailer and it's children 
are not actually updated, but some distant substructure is... the increment 
would add only the substructure to the objects already processed.
 Meaning the COSWriter would still iterate the ancestry of that, just to 
recognize (again) that those were not updated!?

If that is the case, it is a major flaw - COSIncrement shall result in the 
COSWriter not iterating the document at all, it shall produce a good to go 
increment the COSWriter simply must write, without having to iterate the 
structures, that have not changed
 If that is the case, this would still waste a lot of time

*Edit:*
Will try to have a look into that at the weekend.

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> out.pdf, out2.pdf
>
>
> *TL;DR:*
> Curre

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425498#comment-17425498
 ] 

Christian Appl commented on PDFBOX-5263:


[~msahyoun] that I actually can answer:

*TL;DR:*
If some mechanism could identify and store updated nodes on the go, without 
requiring to iterate the tree at some point, that would be the superior 
solution and could be about as fast as it used to be - but the observer as I 
wrote it seemingly was not able to do that, without slowing this down even 
more, than iterating the tree does.
If the tree has to be iterated in it´s entirety to automate the detection of 
updated nodes, this will and can not be faster or as fast as the original logic.

*Explanation:*
COSWriter#addObjectToWrite(COSBase): (which is somewhat answering my own 
question above...)
!image-2021-10-07-13-06-40-576.png!
skips structures that have not been updated for an active COSWriter, that is 
writing an increment.

If you start the iteration at the root of the tree (trailer) and skip all 
nodes, that are not claiming to be updated, you (most of the time) skip the 
majority of the document and save a lot of time that way.
This is what 2.0 did and what 3.0 does without my changes.

*However:*
Scenario: In path "Trailer/A/B/C" Node C is the only node requiring an update 
and it alone justifies inclusion in an increment.
 But to fullfill the requirements of above method, the nodes Trailer, A and B 
also must be marked as updated, otherwise COSWriter will never reach node C. 
This is how adding a signature in 2.0 works (and I assume that has not changed 
3.0 much).

*Automation:*
- If you don't want to add such a path manually and also want to prevent adding 
nodes, that don't actually require inclusion in an increment, you would either 
have to know the nodes, that were updated (what the observer did)
- Or you would have to iterate the tree in it's entirety and would have to 
identify the nodes, that require an update. (what COSIncrement does)

*Alternative - loss of precision:*
If you however are okay with adding nodes to an increment, that were not 
actually updated, but require inclusion to enable the COSWriter to find and 
process such updates:
You would have to have the means to determine the path of a COSUpdateInfo, when 
it is updated (to update it´s whole path aswell). Which possibly could also be 
a solution, that is closer to the original solution and would also prevent 
searching for updated nodes (without the guarantee of this being faster however 
- would depend on how and how fast the path can be determined for a node).

*Therefore - back to TL;DR:*
If some mechanism could identify and store updated nodes on the go, without 
requiring to iterate the tree at some point, that would be the superior 
solution and could be about as fast as it used to be - but the observer as I 
wrote it seemingly was not able to do that, without slowing this down even 
more, than iterating the tree does.
If the tree has to be iterated in it´s entirety to automate the detection of 
updated nodes, this will and can not be faster or as fast as the original logic.

*Doubts:*
Because all of that I doubted whether I could find a solution, that would come 
close to the original solution.
Either you know what to update and can tell the COSWriter what to do.
Or you find what to update.
If that shall happen automatically - both solutions cost time and require 
reflecting upon the context of nodes.

The old solution did not require that, as someone told the COSWriter 
"hardcoded" what shall be done in some special case, without the necessity to 
provide a solution for all other possible cases.

*How it used to be:*
Excerpt from PDDocument#addSignature({color:#00}PDSignature{color}, 
{color:#00}SignatureInterface{color},{color:#00}SignatureOptions{color})
!image-2021-10-07-13-38-34-817.png!

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_bas

[jira] [Updated] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Appl updated PDFBOX-5263:
---
Attachment: image-2021-10-07-13-38-34-817.png

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, image-2021-10-07-13-38-34-817.png, 
> out.pdf, out2.pdf
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know and manage by themselves 
> whether they were freshly created or altered
> (e.g.: A COSDictionary should be able to remember, that a setter had been 
> ca

[jira] [Updated] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Appl updated PDFBOX-5263:
---
Attachment: image-2021-10-07-13-06-40-576.png

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> image-2021-10-07-13-06-40-576.png, out.pdf, out2.pdf
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know and manage by themselves 
> whether they were freshly created or altered
> (e.g.: A COSDictionary should be able to remember, that a setter had been 
> called).
> - If "COSUpdateInfo" objects 

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425483#comment-17425483
 ] 

Maruan Sahyoun commented on PDFBOX-5263:


[~capSVD] Won't have any time between today and tomorrow. When doing some 
benchmarks it turned out that saving a mid sized document and doing the same 
incrementally doesn't differ a lot in time where it really should. The initial 
parsing took only a fraction of a second and the pure copying from source to 
target for the original file too. And as there are no changes done to the 
document my initial thought was - well, why does it take so long - which led me 
to think that there might be an issue. So I'm really thankful for you looking 
into that.
 

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> out.pdf, out2.pdf
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to

[jira] [Comment Edited] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425479#comment-17425479
 ] 

Christian Appl edited comment on PDFBOX-5263 at 10/7/21, 10:53 AM:
---

H
 Iterating the COSTree is not that special and the COSWriter would do exactly 
that.
 The idea was to spare the COSWriter the trouble of iterating the tree, by 
creating the increment first, setting the objectsToWrite and relying on 
"addObjectsToWrite" to skip those objects, as they have already been added 
but I think this contains an obivous flaw Say the trailer and it's children 
are not actually updated, but some distant substructure is... the increment 
would add only the substructure to the objects already processed.
 Meaning the COSWriter would still iterate the ancestry of that, just to 
recognize (again) that those were not updated!?

If that is the case, it is a major flaw - COSIncrement shall result in the 
COSWriter not iterating the document at all, it shall produce a good to go 
increment the COSWriter simply must write, without having to iterate the 
structures, that have not changed
 If that is the case, this would still waste a lot of time

*Edit:*
Will try to have a look into that at the weekend.


was (Author: capsvd):
H
Iterating the COSTree is not that special and the COSWriter would do exactly 
that.
The idea was to spare the COSWriter the trouble of iterating the tree, by 
creating the increment first, setting the objectsToWrite and relying on 
"addObjectsToWrite" to skip those objects, as they have already been added 
but I think this contains an obivous flaw Say the trailer and it's children 
are not actually updated, but some distant substructure is... the increment 
would add only the substructure to the objects already processed.
Meaning the COSWriter would still iterate the ancestry of that, just to 
recognize (again) that those were not updated!?

If that is the case, it is a major flaw - COSIncrement shall result in the 
COSWriter not iterating the document at all, it shall produce a good to go 
increment the COSWriter simply must write, without having to iterate the 
structures, that have not changed
If that is the case, this would still waste a lot of time

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> out.pdf, out2.pdf
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This i

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425479#comment-17425479
 ] 

Christian Appl commented on PDFBOX-5263:


H
Iterating the COSTree is not that special and the COSWriter would do exactly 
that.
The idea was to spare the COSWriter the trouble of iterating the tree, by 
creating the increment first, setting the objectsToWrite and relying on 
"addObjectsToWrite" to skip those objects, as they have already been added 
but I think this contains an obivous flaw Say the trailer and it's children 
are not actually updated, but some distant substructure is... the increment 
would add only the substructure to the objects already processed.
Meaning the COSWriter would still iterate the ancestry of that, just to 
recognize (again) that those were not updated!?

If that is the case, it is a major flaw - COSIncrement shall result in the 
COSWriter not iterating the document at all, it shall produce a good to go 
increment the COSWriter simply must write, without having to iterate the 
structures, that have not changed
If that is the case, this would still waste a lot of time

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> out.pdf, out2.pdf
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in CO

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425468#comment-17425468
 ] 

Christian Appl commented on PDFBOX-5263:


the whole document tree, starting with the document trailer. (minus objects 
that would have to be dereferenced and are not updated - as those have not been 
loaded before and their substructures can not have changed)

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> out.pdf, out2.pdf
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know a

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425465#comment-17425465
 ] 

Maruan Sahyoun commented on PDFBOX-5263:


{quote}
 The first (and most costly) assumption of COSIncrement is: The whole loaded 
structure must be iterated ...
{quote}

What is "... the loaded structure ..." i.e. which variable, collection, ...?

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-2021-09-07-09-35-31-408.png, 
> image-2021-09-07-13-33-00-161.png, image-2021-09-07-15-40-59-080.png, 
> image-2021-09-08-10-23-44-036.png, image-2021-09-08-11-18-34-211.png, 
> image-2021-09-13-14-40-33-049.png, image-2021-09-13-14-41-13-206.png, 
> image-2021-09-30-09-33-04-449.png, image-2021-09-30-09-34-19-175.png, 
> out.pdf, out2.pdf
>
>
> *TL;DR:*
> Currently it is rather tedious to create incremental changes in between 
> signatures via PDFBox. I attempted to simplify that and wrote a patch.
> This is rather a POC, than an actual suggestion for direct inclusion. (For 
> reasons explained later.)
> *Signatures and incremental PDF documents:*
> A typical reason for wanting to sign a document multiple times (creating an 
> incremental PDF) is , that in between signatures the document changed and the 
> additional signature shall sign the new state of the document.
> If one wanted to implement such incremental changes using PDFBox, he would 
> find, that most of the time made changes are completly ignored, when calling 
> "saveIncremental".
> As documented for the "saveIncremental" methods and especially the matching 
> constructors in "COSWriter", this would require, to identify the "path" of 
> all made changes and one would need to set the "needToBeUpdated" flag of all 
> elements of that path.
> *But:*
> As documented one would have to have exact understanding of what he did and 
> how the PDF standard does implement this, he would have to identify said 
> structures and the more complex the changes were, the more tedious this would 
> become.
> *Also:*
> Because of the implementation of incremental saving in COSWriter, the whole 
> path must be informed that it required an update.
> Resulting in unnecessary large increments, as not all ancestors might 
> actually have changed.
> e.g. If one added an image to a preexisting page of the document - the 
> contentstream, the resources of the page and the page dictionary would have 
> changed. But the "pages" array and all it's ancestors would not have changed 
> a bit, but still would have to be informed and included.
> *Assumptions that lead to this patch:*
> - COSWriter should not stop iterating a COSTree just because a parent element 
> did not change. It's descendants still could have changed!
> - Externally managing an object´s update state is tedious and error-prone.
> Objects that implement "COSUpdateInfo" should know and manage by th

[jira] [Comment Edited] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425445#comment-17425445
 ] 

Christian Appl edited comment on PDFBOX-5263 at 10/7/21, 9:52 AM:
--

*A COSUpdateInfo is set to be updated if:*
 *Basic conditions / Accepting updates / Denying updates:*
 - A origin "COSDocumentState" has been set and is "accepting updates".
 - The "COSDocumentState" is accepting updates if: "parsing" is set to 
false/there is no active parser

All updates made before that are ignored and are treated as parsing operations 
/ Document initialization / loading.

*When a COSUpdateInfo is accepting updates, it itself shall be updated for the 
following triggers:*
 *COSDictionary / COSArray:*
 - An entry is set, replaced or removed.
 *COSObject:*
 - The referenced Base object is changed.
 - If an object is being dereferenced updates are suppressed for all added 
structures (As they are resulting from delayed parsing operations.)

*When a parent updates a child:*
 - A parent, that has a "COSDocumentState" will attempt to copy that state to a 
child, that is added to it's entries (removed children are ignored and dropped 
without further consequence for that child).
 - A child that already has a "COSDocumentState" will ignore the parent´s 
attempt to initialize it´s context and shall never receive an update in that 
way.
 - A child that receives a context intialization from a parent, shall only then 
be treated as updated, if the given context is accepting updates and that 
update is not resulting from a dereferencation of it´s parent.

As stated - dereferencing is treated as a delayed parsing operation and does 
not justify updates.
 A child that has a COSDocumentState has already been touched or may even 
originate from another document (such is handled during increment evaluation 
later).
 A child that receives a parent´s context that is accepting updates, is treated 
and seen as a freshly created structure, that has not been created by a parser 
and hence must be updated.

*The increment creation itself sets update states, for the following events:*
 - A parent is iterated, that contains a child with a "COSDocumentState" 
context, that is different from the COSIncrement´s origin, it is assumed, that 
that child is originating from another document and has been copied to the 
parent - that child must receive an update.
 - A child is marked as updated, that is also a COSArray or is otherwise marked 
as a "direct" structure - in that case instead of the child, the parent must be 
updated and added to the increment. (Even though the parent may not have been 
marked as updated before.)

If updates should occur, that are not covered by the statements above - those 
are not intended to happen.
 Those are the intended and expected cases, that I would claim to be correct. 
Possibly some of those claims are wrong or could be improved?


was (Author: capsvd):
*A COSUpdateInfo is set to be updated if:*
*Basic conditions / Accepting updates / Denying updates:*
- A origin "COSDocumentState" has been set and is "accepting updates".
- The "COSDocumentState" is accepting updates if: "parsing" is set to 
false/there is no active parser

All updates made before that are ignored and are treated as parsing operations 
/ Document initialization / loading.

*When a COSUpdateInfo is accepting updates, it itself shall be updated for the 
following triggers:*
*COSDictionary / COSArray:*
- An entry is set, replaced or removed.
*COSObject:*
- The referenced Base object is changed.
- If an object is being dereferenced updates are suppressed for all added 
structures (As they are resulting from delayed parsing operations.)

*When a parent updates a child:*
- A parent, that has a "COSDocumentState" will attempt to copy that state to a 
child, that is added to it's entries (removed children are ignored and dropped 
without further consequence for that child).
- A child that already has a "COSDocumentState" will ignore the parent´s 
attempt to initialize it´s context and shall never receive an update in that 
way.
- A child that receives a context intialization from a parent, shall only then 
be treated as updated, if the given context is accepting updates and that 
update is not resulting from a dereferencation of it´s parent.

As stated - dereferencing is treated as a delayed parsing operation and does 
not justify updates.
A child that has a COSDocumentState has already been touched or may even 
originate from another document (such is handled during increment evaluation 
later).
A child that receives a parent´s context that is accepting updates, is treated 
and seen as a freshly created structure, that has not been created by a parser 
and hence must be updated.

*The increment creation itself sets update states, for the following events:*
- A parent is iterated, that contains a child with a differe

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425445#comment-17425445
 ] 

Christian Appl commented on PDFBOX-5263:


*A COSUpdateInfo is set to be updated if:*
*Basic conditions / Accepting updates / Denying updates:*
- A origin "COSDocumentState" has been set and is "accepting updates".
- The "COSDocumentState" is accepting updates if: "parsing" is set to 
false/there is no active parser

All updates made before that are ignored and are treated as parsing operations 
/ Document initialization / loading.

*When a COSUpdateInfo is accepting updates, it itself shall be updated for the 
following triggers:*
*COSDictionary / COSArray:*
- An entry is set, replaced or removed.
*COSObject:*
- The referenced Base object is changed.
- If an object is being dereferenced updates are suppressed for all added 
structures (As they are resulting from delayed parsing operations.)

*When a parent updates a child:*
- A parent, that has a "COSDocumentState" will attempt to copy that state to a 
child, that is added to it's entries (removed children are ignored and dropped 
without further consequence for that child).
- A child that already has a "COSDocumentState" will ignore the parent´s 
attempt to initialize it´s context and shall never receive an update in that 
way.
- A child that receives a context intialization from a parent, shall only then 
be treated as updated, if the given context is accepting updates and that 
update is not resulting from a dereferencation of it´s parent.

As stated - dereferencing is treated as a delayed parsing operation and does 
not justify updates.
A child that has a COSDocumentState has already been touched or may even 
originate from another document (such is handled during increment evaluation 
later).
A child that receives a parent´s context that is accepting updates, is treated 
and seen as a freshly created structure, that has not been created by a parser 
and hence must be updated.

*The increment creation itself sets update states, for the following events:*
- A parent is iterated, that contains a child with a different 
"COSDocumentState" it is assumed, that that child is originating from another 
document and has been copied to the parent - that child must receive an update.
- A child is marked as updated, that is also a COSArray or is otherwise marked 
as a "direct" structure - in that case instead of the child, the parent must be 
updated and added to the increment. (Even though the parent may not have been 
marked as updated before.)


If updates should occur, that are not covered by the statements above - those 
are not intended to happen.
Those are the intended and expected cases, that I would claim to be correct. 
Possibly some of those claims are wrong or could be improved?

> Suggestion: Signing actual document changes - Enhancing incremental saving
> --
>
> Key: PDFBOX-5263
> URL: https://issues.apache.org/jira/browse/PDFBOX-5263
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel, Writing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Christian Appl
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: Enhanced_incremental_saving_.patch, 
> Enhanced_incremental_saving_PDFBox3.patch, NoSignatureFound.zip, 
> Observer_based_incremental_saving(09-13-2021).patch, 
> Observer_based_incremental_saving_(09-09-2021-09-09).patch, 
> Observer_based_incremental_saving_(12-00-2021-09-08).patch, 
> Observer_based_incremental_saving_(15-19-2021-09-08).patch, 
> Observer_based_incremental_saving_(17-00-2021-09-07).patch, 
> Observer_based_incremental_saving_(fixed_).patch, 
> Observer_based_incremental_saving_-_still_eroneous.patch, 
> Observer_based_incremental_saving_.patch, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_afterPatch.pdf, 
> PDFBOX-5263-YTW2VWJQTDAE67PGJT6GS7QSKW3GNUQR_beforePatch.pdf, 
> PDFBOX-5263_Introduce_COSReferenceInfo.patch, 
> PDFBOX-5263_Introduce_COSReferenceInfo_(LinkedHashMap).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling_.patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(POC2).patch, 
> PDFBOX-5263_Introduce_state_based_increment_handling__(simplified).patch, 
> Prototype_COSChangeObserver_(code_base).patch, 
> Prototype__Document_and_reference_holder_aware_COSContext_.patch, 
> Updating_context_management_.patch, image-2021-08-23-14-55-24-077.png, 
> image-2021-08-26-09-52-33-567.png, image-2021-08-26-09-54-24-897.png, 
> image-2021-08-26-10-00-07-383.png, image-2021-08-26-10-02-08-003.png, 
> image-2021-08-26-10-03-47-940.png, image-2021-08-26-10-06-42-925.png, 
> image-2021-08-26-10-09-12-698.png, image-2021-08-26-10-12-19-265.png, 
> image-2021-09-06-17-06-59-667.png, image-20

[jira] [Comment Edited] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425404#comment-17425404
 ] 

Christian Appl edited comment on PDFBOX-5263 at 10/7/21, 8:30 AM:
--

*TL;DR: I underlined the most central assumptions for COSIncrement, that 
possibly contain errors or space for improvement.*

*Loading:*
 I would actually expect the load method to be a lot faster, than the saving 
itself, as during loading the only thing happening is setting the update states 
and document context of updatable COS structures.
 If the information was determined during the loading or editing of the 
COSDocument (as it was for the observer) you would see that turned arround.
 This is actually by choice - as this way a "normal" save can ignore all this 
and is not slowed down by the determination of information, that it does not 
require.

If loading time shall be sped up: currently the most costly operation during 
loading is the determination of a COSUpdateInfo´s context 
(COSUpdateState#setOriginDocumentState()). Instead the COSParser could set the 
update states of updatable structures, without evaluating or providing a 
context at all. Which might be a lot of work for possibly not that much of a 
gain (160ms). If one wanted to do that anyway - that method is the place to 
look at and should best be eliminated - as it is the only additional iteration 
during loading, that possibly could be replaced with something else.

*Saving:*
 As suggested this does not do much during loading and postpones the heavy load 
of the actual increment creation to a point in time, when saving incrementally 
is actually required.
 To alter how this behaves +the only class that would have to be changed is 
{color:#172b4d}COSIncrement{color}+. No other class does contain code for the 
actual processing of a COSIncrement. If one wants to speed it up and eliminate 
unnecessary steps, that is the place to go.

*What and Why:*
 Originally a user marked a path of the document "manually" as updated starting 
at the trailer object of a document. The COSWriter would only detect and update 
objects if it could iterate that path, always finding the next updated child, 
until the last updated leaf is reached.
 Which leads to the inclusion of path elements, that had not actually changed, 
but still required inclusion in an increment, as otherwise the actually updated 
node would not have been found during writing.

*Identifying updated nodes:*
 {color:#172b4d}- ... Hence: +The first (and most costly) assumption of 
COSIncrement is: The whole loaded structure must be iterated to find actually 
updated and possibly isolated nodes that can not rely on their whole path being 
updated and may be contained in substructures, that have neither changed, nor 
have been informed to be updated. Only those updated nodes shall be part of an 
increment, which is avoiding to add their unaltered ancestry to an increment. 
The iteration shall start at a given node (in this case - always the documents 
trailer object).+{color}

The observer attempted to collect such nodes in the moment they were being 
updated - which would be an alternative and would avoid the iteration of the 
whole tree, as this would avoid searching for nodes altogether. But as it 
turned out, that had some even worse issues.

*Searching the tree:*
 +{color:#172b4d}- COSObject, COSArray and COSDictionary nodes are updatable - 
find those, check those, possibly include those in an 
increment.{color}+{color:#172b4d}Starting at a given node COSIncrement will 
iterate the children of such nodes and will only process instances of those 
three classes, using the matching collect(something) method.
 
 - If a COSDictionary is encountered and has been updated and it is not flagged 
as a "direct" node, it is added to the increment. If one of the children of a 
COSDictionary is a direct node and updated, it is updating the containing 
COSDictionary instead (possibly causing the inclusion of the parent{color} in 
an increment), also such direct nodes are excluded from the increment. +It 
shall be assumed, that COSDictionaries could be added as top level objects.+
 - If a COSArray is encountered it is by default assuming to be written 
directly (only exception - if it had already been contained in a COSObject.).

 - +The updatable children of COSDictionaries and COSArray shall be iterated, 
as they could be updated or contain updated descendants themselves.+

 - If a COSObject is encountered and the COSObject wrapper itself is marked as 
having been updated, the object is actively dereferenced. Otherwise COSObjects 
that have not already been dereferenced are skipped.
 - +COSObjects that shall be updated must be dereferenced.+
 - +COSObjects that have been dereferenced shall be iterated, as they could be 
updated or contain updated descendants themselves.+

+*Collecting results,

[jira] [Comment Edited] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425404#comment-17425404
 ] 

Christian Appl edited comment on PDFBOX-5263 at 10/7/21, 8:29 AM:
--

*TL;DR: I underlined the most central assumptions for COSIncrement, that 
possibly contain errors or space for improvement.*

*Loading:*
 I would actually expect the load method to be a lot faster, than the saving 
itself, as during loading the only thing happening is setting the update states 
and document context of updatable COS structures.
 If the information was determined during the loading or editing of the 
COSDocument (as it was for the observer) you would see that turned arround.
 This is actually by choice - as this way a "normal" save can ignore all this 
and is not slowed down by the determination of information, that it does not 
require.

If loading time shall be sped up: currently the most costly operation during 
loading is the determination of a COSUpdateInfo´s context 
(COSUpdateState#setOriginDocumentState()). Instead the COSParser could set the 
update states of updatable structures, without evaluating or providing a 
context at all. Which might be a lot of work for possibly not that much of a 
gain (160ms). If one wanted to do that anyway - that method is the place to 
look at and should best be eliminated - as it is the only additional iteration 
during loading, that possibly could be replaced with something else.

*Saving:*
 As suggested this does not do much during loading and postpones the heavy load 
of the actual increment creation to a point in time, when saving incrementally 
is actually required.
 To alter how this behaves +the only class that would have to be changed is 
{color:#172b4d}COSIncrement{color}+. No other class does contain code for the 
actual processing of a COSIncrement. If one wants to speed it up and eliminate 
unnecessary steps, that is the place to go.

*What and Why:*
 Originally a user marked a path of the document "manually" as updated starting 
at the trailer object of a document. The COSWriter would only detect and update 
objects if it could iterate that path, always finding the next updated child, 
until the last updated leaf is reached.
 Which leads to the inclusion of path elements, that had not actually changed, 
but still required inclusion in an increment, as otherwise the actually updated 
node would not have been found during writing.

*Identifying updated nodes:*
 {color:#172b4d}- ... Hence: +The first (and most costly) assumption of 
COSIncrement is: The whole loaded structure must be iterated to find actually 
updated and possibly isolated nodes that can not rely on their whole path being 
updated and may be contained in substructures, that have neither changed, nor 
have been informed to be updated. Only those updated nodes shall be part of an 
increment, which is avoiding to add their unaltered ancestry to an increment. 
The iteration shall start at a given node (in this case - always the documents 
trailer object).+{color}

The observer attempted to collect such nodes in the moment they were being 
updated - which would be an alternative and would avoid the iteration of the 
whole tree, as this would avoid searching for nodes altogether. But as it 
turned out, that had some even worse issues.

*Searching the tree:*
 +{color:#172b4d}- COSObject, COSArray and COSDictionary nodes are updatable - 
find those, check those, possibly include those in an 
increment.{color}+{color:#172b4d}Starting at a given node COSIncrement will 
iterate the children of such nodes and will only process instances of those 
three classes, using the matching collect(something) method.
 
 - If a COSDictionary is encountered and has been updated and it is not flagged 
as a "direct" node, it is added to the increment. If one of the children of a 
COSDictionary is a direct node and updated, it is updating the containing 
COSDictionary instead (possibly causing the inclusion of the parent{color} in 
an increment), also such direct nodes are excluded from the increment. +It 
shall be assumed, that COSDictionaries could be added as top level objects.+
 - If a COSArray is encountered it is by default assuming to be written 
directly (only exception - if it had already been contained in a COSObject.).

 - +The updatable children of COSDictionaries and COSArray shall be iterated, 
as they could be updated or contain updated descendants themselves.+

 - If a COSObject is encountered and the COSObject wrapper itself is marked as 
having been updated, the object is actively dereferenced. Otherwise COSObjects 
that have not already been dereferenced are skipped.
 - +COSObjects that shall be updated must be dereferenced.+
 - +COSObjects that have been dereferenced shall be iterated, as they could be 
updated or contain updated descendants themselves.+

+*Collecting results,

[jira] [Comment Edited] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425404#comment-17425404
 ] 

Christian Appl edited comment on PDFBOX-5263 at 10/7/21, 8:21 AM:
--

*TL;DR: I underlined the most central assumptions for COSIncrement, that 
possibly contain errors or space for improvement.*

*Loading:*
 I would actually expect the load method to be a lot faster, than the saving 
itself, as during loading the only thing happening is setting the update states 
and document context of updatable COS structures.
 If the information was determined during the loading or editing of the 
COSDocument (as it was for the observer) you would see that turned arround.
 This is actually by choice - as this way a "normal" save can ignore all this 
and is not slowed down by the determination of information, that it does not 
require.

If loading time shall be sped up: currently the most costly operation during 
loading is the determination of a COSUpdateInfo´s context 
(COSUpdateState#setOriginDocumentState()). Instead the COSParser could set the 
update states of updatable structures, without evaluating or providing a 
context at all. Which might be a lot of work for possibly not that much of a 
gain (160ms). If one wanted to do that anyway - that method is the place to 
look at and should best be eliminated - as it is the only additional iteration 
during loading, that possibly could be replaced with something else.

*Saving:*
 As suggested this does not do much during loading and postpones the heavy load 
of the actual increment creation to a point in time, when saving incrementally 
is actually required.
 To alter how this behaves +the only class that would have to be changed is 
{color:#172b4d}COSIncrement{color}+. No other class does contain code for the 
actual processing of a COSIncrement. If one wants to speed it up and eliminate 
unnecessary steps, that is the place to go.

*What and Why:*
 Originally a user marked a path of the document "manually" as updated starting 
at the trailer object of a document. The COSWriter would only detect and update 
objects if it could iterate that path, always finding the next updated child, 
until the last updated leaf is reached.
 Which leads to the inclusion of path elements, that had not actually changed, 
but still required inclusion in an increment, as otherwise the actually updated 
node would not have been found during writing.

*Identifying updated nodes:*
 {color:#172b4d}- ... Hence: +The first (and most costly) assumption of 
COSIncrement is: The whole loaded structure must be iterated to find actually 
updated and possibly isolated nodes that can not rely on their whole path being 
updated and may be contained in substructures, that have neither changed, nor 
have been informed to be updated. Only those updated nodes shall be part of an 
increment, which is avoiding to add their unaltered ancestry to an increment. 
The iteration shall start at a given node (in this case - always the documents 
trailer object).+{color}

The observer attempted to collect such nodes in the moment they were being 
updated - which would be an alternative and would avoid the iteration of the 
whole tree, as this would avoid searching for nodes altogether. But as it 
turned out, that had some even worse issues.

*Searching the tree:*
 +{color:#172b4d}- COSObject, COSArray and COSDictionary nodes are updatable - 
find those, check those, possibly include those in an 
increment.{color}+{color:#172b4d}Starting at a given node COSIncrement will 
iterate the children of such nodes and will only process instances of those 
three classes, using the matching collect(something) method.
 
 - If a COSDictionary is encountered and has been updated and it is not flagged 
as a "direct" node, it is added to the increment. If one of the children of a 
COSDictionary is a direct node and updated, it is updating the containing 
COSDictionary instead (possibly causing the inclusion of the parent{color} in 
an increment), also such direct nodes are excluded from the increment. +It 
shall be assumed, that COSDictionaries could be added as top level objects.+
 - If a COSArray is encountered it is by default assuming to be written 
directly (only exception - if it had already been contained in a COSObject.).

 - +The updatable children of COSDictionaries and COSArray shall be iterated, 
as they could be updated or contain updated descendants themselves.+

 - If a COSObject is encountered and the COSObject wrapper itself is marked as 
having been updated, the object is actively dereferenced. Otherwise COSObjects 
that have not already been dereferenced are skipped.
 - +COSObjects that shall be updated must be dereferenced.+
 - +COSObjects that have been dereferenced shall be iterated, as they could be 
updated or contain updated descendants themselves.+

+*Collecting results,

[jira] [Comment Edited] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425404#comment-17425404
 ] 

Christian Appl edited comment on PDFBOX-5263 at 10/7/21, 8:20 AM:
--

*TL;DR: I underlined the most central assumptions for COSIncrement, that 
possibly contain errors or space for improvement.*

*Loading:*
 I would actually expect the load method to be a lot faster, than the saving 
itself, as during loading the only thing happening is setting the update states 
and document context of updatable COS structures.
 If the information was determined during the loading or editing of the 
COSDocument (as it was for the observer) you would see that turned arround.
 This is actually by choice - as this way a "normal" save can ignore all this 
and is not slowed down by the determination of information, that it does not 
require.

If loading time shall be sped up: currently the most costly operation during 
loading is the determination of a COSUpdateInfo´s context 
(COSUpdateState#setOriginDocumentState()). Instead the COSParser could set the 
update states of updatable structures, without evaluating or providing a 
context at all. Which might be a lot of work for possibly not that much of a 
gain (160ms). If one wanted to do that anyway - that method is the place to 
look at and should best be eliminated - as it is the only additional iteration 
during loading, that possibly could be replaced with something else.

*Saving:*
 As suggested this does not do much during loading and postpones the heavy load 
of the actual increment creation to a point in time, when saving incrementally 
is actually required.
 To alter how this behaves +the only class that would have to be changed is 
{color:#172b4d}COSIncrement{color}+. No other class does contain code for the 
actual processing of a COSIncrement. If one wants to speed it up and eliminate 
unnecessary steps, that is the place to go.

*What and Why:*
 Originally a user marked a path of the document "manually" as updated starting 
at the trailer object of a document. The COSWriter would only detect and update 
objects if it could iterate that path, always finding the next updated child, 
until the last updated leaf is reached.
 Which leads to the inclusion of path elements, that had not actually changed, 
but still required inclusion in an increment, as otherwise the actually updated 
node would not have been found during writing.

*Identifying updated nodes:*
 {color:#172b4d}- ... Hence: +The first (and most costly) assumption of 
COSIncrement is: The whole loaded structure must be iterated to find actually 
updated and possibly isolated nodes that can not rely on their whole path being 
updated and may be contained in substructures, that have neither changed, nor 
have been informed to be updated. Only those updated nodes shall be part of an 
increment, which is avoiding to add their unaltered ancestry to an increment. 
The iteration shall start at a given node (in this case - always the documents 
trailer object).+{color}

The observer attempted to collect such nodes in the moment they were being 
updated - which would be an alternative and would avoid the iteration of the 
whole tree, as this would avoid searching for nodes altogether. But as it 
turned out, that had some even worse issues.

*Searching the tree:*
 +{color:#172b4d}- COSObject, COSArray and COSDictionary nodes are updatable - 
find those, check those, possibly include those in an 
increment.{color}+{color:#172b4d}Starting at a given node COSIncrement will 
iterate the children of such nodes and will only process instances of those 
three classes, using the matching collect(something) method.
 
 - If a COSDictionary is encountered and has been updated and it is not flagged 
as a "direct" node, it is added to the increment. If one of the children of a 
COSDictionary is a direct node and updated, it is updating the containing 
COSDictionary instead (possibly causing the inclusion of the parent{color} in 
an increment), also such direct nodes are excluded from the increment. +It 
shall be assumed, that COSDictionaries could be added as top level objects.+
 - If a COSArray is encountered it is by default assuming to be written 
directly (only exception - if it had already been contained in a COSObject.).

- +The updatable children of COSDictionaries and COSArray shall be iterated, as 
they could be updated or contain updated descendants themselves.+

- +If a COSObject is encountered and the COSObject wrapper itself is marked as 
having been updated, the object is actively dereferenced. Otherwise COSObjects 
that have not already been dereferenced are skipped.+
 - +COSObjects that shall be updated must be dereferenced.+
 - COSObjects that have been dereferenced shall be iterated, as they could be 
updated or contain updated descendants themselves.

+*Collecting results, a

[jira] [Commented] (PDFBOX-5263) Suggestion: Signing actual document changes - Enhancing incremental saving

2021-10-07 Thread Christian Appl (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425404#comment-17425404
 ] 

Christian Appl commented on PDFBOX-5263:


*TL;DR: I underlined the most central assumptions for COSIncrement, that 
possibly contain errors or space for improvement.*

*Loading:*
I would actually expect the load method to be a lot faster, than the saving 
itself, as during loading the only thing happening is setting the update states 
and document context of updatable COS structures.
If the information was determined during the loading or editing of the 
COSDocument (as it was for the observer) you would see that turned arround.
This is actually by choice - as this way a "normal" save can ignore all this 
and is not slowed down by the determination of information, that it does not 
require.

If loading time shall be sped up: currently the most costly operation during 
loading is the determination of a COSUpdateInfo´s context 
(COSUpdateState#setOriginDocumentState()). Instead the COSParser could set the 
update states of updatable structures, without evaluating or providing a 
context at all. Which might be a lot of work for possibly not that much of a 
gain (160ms). If one wanted to do that anyway - that method is the place to 
look at and should best be eliminated - as it is the only additional iteration 
during loading, that possibly could be replaced with something else.

*Saving:*
As suggested this does not do much during loading and postpones the heavy load 
of the actual increment creation to a point in time, when saving incrementally 
is actually required.
To alter how this behaves +the only class that would have to be changed is 
{color:#172b4d}COSIncrement{color}+. No other class does contain code for the 
actual processing of a COSIncrement. If one wants to speed it up and eliminate 
unnecessary steps, that is the place to go.

*What and Why:*
Originally a user marked a path of the document "manually" as updated starting 
at the trailer object of a document. The COSWriter would only detect and update 
objects if it could iterate that path, always finding the next updated child, 
until the last updated leaf is reached.
Which leads to the inclusion of path elements, that had not actually changed, 
but still required inclusion in an increment, as otherwise the actually updated 
node would not have been found during writing.

*Identifying updated nodes:*
{color:#172b4d}- ... Hence: +The first (and most costly) assumption of 
COSIncrement is: The whole loaded structure must be iterated to find actually 
updated and possibly isolated nodes that can not rely on their whole path being 
updated and may be contained in substructures, that have neither changed, nor 
have been informed to be updated. Only those updated nodes shall be part of an 
increment, which is avoiding to add their unaltered ancestry to an increment. 
The iteration shall start at a given node (in this case - always the documents 
trailer object).+{color}

 The observer attempted to collect such nodes in the moment they were being 
updated - which would be an alternative and would avoid the iteration of the 
whole tree, as this would avoid searching for nodes altogether. But as it 
turned out, that had some even worse issues.

*Searching the tree:*
+{color:#172b4d}- COSObject, COSArray and COSDictionary nodes are updatable - 
find those, check those, possibly include those in an increment.
{color}+{color:#172b4d}Starting at a given node COSIncrement will iterate the 
children of such nodes and will only process instances of those three classes, 
using the matching collect(something) method.

- If a COSDictionary is encountered and has been updated and it is not flagged 
as a "direct" node, it is added to the increment. If one of the children of a 
COSDictionary is a direct node and updated, it is updating the containing 
COSDictionary instead (possibly causing the inclusion of the parent{color} in 
an increment), also such direct nodes are excluded from the increment. +It 
shall be assumed, that COSDictionaries could be added as top level objects.+
- If a COSArray is encountered it is by default assuming to be written directly 
(only exception - if it had already been contained in a COSObject.).

+- The updatable children of COSDictionaries and COSArray shall be iterated, as 
they could be updated or contain updated descendants themselves.

+- If a COSObject is encountered and the COSObject wrapper itself is marked as 
having been updated, the object is actively dereferenced. Otherwise COSObjects 
that have not already been dereferenced are skipped.
+- COSObjects that shall be updated must be dereferenced.
- COSObjects that have been dereferenced shall be iterated, as they could be 
updated or contain updated descendants themselves.

+*Collecting results, avoiding repetitions:*+
- Results are collected in an "objects" fiel

[GitHub] [pdfbox] valerybokov commented on pull request #107: potential memory leaks and small performance improvements

2021-10-07 Thread GitBox


valerybokov commented on pull request #107:
URL: https://github.com/apache/pdfbox/pull/107#issuecomment-937549360


   PDSignature.getByteRange able to return empty array but getContents(...) 
methods have no checks. A comment in getContents() method says "@throws 
IOException if the pdfFile can't be read" but no message "@throws 
IndexOutOfBoundsException if ..."


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org