[jira] [Commented] (PDFBOX-5851) When this PDF is rendered with the "f" Operator, a black screen appears.

2024-07-19 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867312#comment-17867312
 ] 

Michael Klink commented on PDFBOX-5851:
---

{quote}/P12 is a pattern. What's missing here is to set this to a pattern 
colorspace. The PDF created by the CreatePatterns.java example looks like this:
{noformat}
50 500 200 200 re
/cs1 cs  < this is missing
/p1 scn
f{noformat}{quote}

That code is invalid!
Between defining the path and painting it no **cs** or **scn** is allowed.

> When this PDF is rendered with the "f" Operator, a black screen appears.
> 
>
> Key: PDFBOX-5851
> URL: https://issues.apache.org/jira/browse/PDFBOX-5851
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.31, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: Pattern
> Fix For: 2.0.32, 3.0.3 PDFBox, 4.0.0
>
> Attachments: image-2024-07-19-16-58-35-439.png, 
> image-2024-07-19-16-58-57-515.png, image-2024-07-19-17-41-18-618.png, 
> image2-scratch_unc.pdf, image2.pdf, screenshot-1.png
>
>
> [^image2.pdf]
> !image-2024-07-19-16-58-35-439.png|width=345,height=187!
> !image-2024-07-19-16-58-57-515.png|width=214,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5851) When this PDF is rendered with the "f" Operator, a black screen appears.

2024-07-19 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17867312#comment-17867312
 ] 

Michael Klink edited comment on PDFBOX-5851 at 7/19/24 12:46 PM:
-

{quote}/P12 is a pattern. What's missing here is to set this to a pattern 
colorspace. The PDF created by the CreatePatterns.java example looks like this:
{noformat}
50 500 200 200 re
/cs1 cs  < this is missing
/p1 scn
f{noformat}{quote}

That code is invalid!
Between defining the path and painting it no *cs* or *scn* is allowed.


was (Author: mkl):
{quote}/P12 is a pattern. What's missing here is to set this to a pattern 
colorspace. The PDF created by the CreatePatterns.java example looks like this:
{noformat}
50 500 200 200 re
/cs1 cs  < this is missing
/p1 scn
f{noformat}{quote}

That code is invalid!
Between defining the path and painting it no **cs** or **scn** is allowed.

> When this PDF is rendered with the "f" Operator, a black screen appears.
> 
>
> Key: PDFBOX-5851
> URL: https://issues.apache.org/jira/browse/PDFBOX-5851
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.31, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: Pattern
> Fix For: 2.0.32, 3.0.3 PDFBox, 4.0.0
>
> Attachments: image-2024-07-19-16-58-35-439.png, 
> image-2024-07-19-16-58-57-515.png, image-2024-07-19-17-41-18-618.png, 
> image2-scratch_unc.pdf, image2.pdf, screenshot-1.png
>
>
> [^image2.pdf]
> !image-2024-07-19-16-58-35-439.png|width=345,height=187!
> !image-2024-07-19-16-58-57-515.png|width=214,height=277!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5834) [PATCH] PDF split missing names from documentCatalog

2024-06-05 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852506#comment-17852506
 ] 

Michael Klink commented on PDFBOX-5834:
---

{quote}I didnt want to copy visible ones to avoid increasing filesize{quote}

But isn't that a very arbitrary choice? Maybe others would only need the 
visible named pages, not the invisible ones. Or some other subset.

How about a {{Splitter}} with a setter to set the selection of names of 
templates to copy? Or to set an acceptor callback?



I'm sorry if I sound very negative here. I merely think that if one wants to 
improve the {{Splitter}} (IMO it's not a _major bug_ that the {{Splitter}} does 
not copy most document level material), one should not blindly add an arbitrary 
amount of hidden named pages (and also of other named resources like document 
level JavaScript, document level file attachments, ...) to each output PDF. As 
a minimum one should add a switch whether or not to add such document level 
material, better even some acceptors to select. E.g. maybe one needs the 
templates only with the first splitter output, or each output needs only one of 
the file attachments, and so on.

> [PATCH] PDF split missing names from documentCatalog
> 
>
> Key: PDFBOX-5834
> URL: https://issues.apache.org/jira/browse/PDFBOX-5834
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Simon Steiner
>Priority: Major
> Attachments: tmp.patch
>
>
> java -jar app/target/pdfbox-app-2.0.32-SNAPSHOT.jar PDFSplit xxx.pdf
> I would expect to see the names dict inside the documentCatalog which is used 
> to store pdf templates



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5834) [PATCH] PDF split missing names from documentCatalog

2024-06-05 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17852472#comment-17852472
 ] 

Michael Klink commented on PDFBOX-5834:
---

Your patch looks very focused on a specific use case - copying invisible named 
pages while explicitly dropping visible ones. Why that?

Also you copy page-like objects (the invisible templates) between documents by 
simply adding them. If you look at {{Splitter.processPage}}, though, you'll see 
that regular pages are copied in a much different way.

I'd propose reconsidering the intention (why only copy invisible named pages) 
and making sure the eventual implementation handles the named pages with as 
much care as the remaining {{Splitter}} does.

> [PATCH] PDF split missing names from documentCatalog
> 
>
> Key: PDFBOX-5834
> URL: https://issues.apache.org/jira/browse/PDFBOX-5834
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Simon Steiner
>Priority: Major
> Attachments: tmp.patch
>
>
> java -jar app/target/pdfbox-app-2.0.32-SNAPSHOT.jar PDFSplit xxx.pdf
> I would expect to see the names dict inside the documentCatalog which is used 
> to store pdf templates



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5829) IOException: Error expected floating point numberactual='-12.-1'

2024-05-26 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849522#comment-17849522
 ] 

Michael Klink commented on PDFBOX-5829:
---

{quote}Registering some listener sounds like another complexity level {quote}

Yes, but at least all the interpretation of malformed data would be bundled.

If someone now asks you for the code that repairs data, you have to point here 
and there and elsewhere etc.

(Ok, admittedly, being more focused on signing use cases I'm interested in not 
having changing appearances because of different processors re-interpreting 
invalid data differently, while other users simply are happy to be able to 
process all incoming junk and to simply point to the original PDF producer if 
an repair suddenly makes a difference. So I may be a minority voice in this 
regard.)

> IOException: Error expected floating point numberactual='-12.-1'
> 
>
> Key: PDFBOX-5829
> URL: https://issues.apache.org/jira/browse/PDFBOX-5829
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.31, 3.0.2 PDFBox, 4.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.32, 3.0.3 PDFBox, 4.0.0
>
> Attachments: PDFBOX-5829.pdf
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5829) IOException: Error expected floating point numberactual='-12.-1'

2024-05-26 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849518#comment-17849518
 ] 

Michael Klink commented on PDFBOX-5829:
---

Are you sure you really want to force-interpret every bit of junk that is in a 
PDF instead of a number?

Chances always are you interpret it differently than originally intended. 
Already the PDFBOX-3500 interpretations are questionable, what makes you sure 
{{0.-262}} was meant to mean {{-0.262}} and not e.g. two numbers {{0. -262}}? 
Similarly here, is {{-12.-1}} actually {{12.1}} (minus times minus), {{-12.1}} 
(overeager minus addition), {{-12. -1}} (two numbers), or something else 
entirely?

Yes, you can of course look what Acrobat appears to interpret that and copy 
that behavior, but that Acrobat is allowed to be a moving target concerning its 
interpretation of invalid data.

As an alternative, what about an option to register some listener that allows 
customizing the handling of invalid numbers (or other data structures with 
invalid format, e.g. invalid dates)? PDFBox could already come with two 
implementations, a strict one that rejects all invalid stuff, and a more 
relaxed one that tries to fix in parallel to Acrobat. 

> IOException: Error expected floating point numberactual='-12.-1'
> 
>
> Key: PDFBOX-5829
> URL: https://issues.apache.org/jira/browse/PDFBOX-5829
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.31, 3.0.2 PDFBox, 4.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.32, 3.0.3 PDFBox, 4.0.0
>
> Attachments: PDFBOX-5829.pdf
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5788) ID References changes when saving PDFs.

2024-03-20 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829197#comment-17829197
 ] 

Michael Klink commented on PDFBOX-5788:
---

You should not expect distinctly saved PDF versions to be identical as byte 
streams. In general numerous details may differ, the second part of the ID, the 
modification date and time, even all encrypted data (as there are encryption 
algorithms that require random inputs).


> ID References changes when saving PDFs.
> ---
>
> Key: PDFBOX-5788
> URL: https://issues.apache.org/jira/browse/PDFBOX-5788
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.1 PDFBox, 3.0.2 PDFBox
>Reporter: Daniel Persson
>Priority: Minor
>
>  
> {code:java}
> private static void runPDF(String name) throws IOException, 
> NoSuchAlgorithmException {
> PDDocument doc = Loader.loadPDF(new File(name));
> File tmpFile = File.createTempFile("tmp", ".pdf");
> doc.save(tmpFile);
> byte[] data = Files.readAllBytes(Paths.get(tmpFile.getAbsolutePath()));
> byte[] hash = MessageDigest.getInstance("SHA256").digest(data);
> System.out.println(encodeHexString(hash));
> File tmpFile2 = File.createTempFile("tmp", ".pdf");
> doc.save(tmpFile2);
> byte[] data2 = Files.readAllBytes(Paths.get(tmpFile2.getAbsolutePath()));
> byte[] hash2 = MessageDigest.getInstance("SHA256").digest(data2);
> System.out.println(encodeHexString(hash2));
> } {code}
> Not sure, this might be expected behavior but it makes my testing framework a 
> bit less robust so I thought I'd report it here. In the newer versions 3.0.2 
> and 3.0.1 when you save a PDF the second time the reference ID's continue 
> incrementing which means that the PDF stored the first time is not identical 
> to the second time.
> In my test case depending on what thread executes first there might be 
> difference in the run and the expected result changes.
> I've not seen this with 3.0.0 and earlier versions of PDFBox.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5772) Inconsistent signature page handling when signing in existing signature fields

2024-02-22 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819596#comment-17819596
 ] 

Michael Klink commented on PDFBOX-5772:
---

Great!

I forwarded that information to the DSS issue.

> Inconsistent signature page handling when signing in existing signature fields
> --
>
> Key: PDFBOX-5772
> URL: https://issues.apache.org/jira/browse/PDFBOX-5772
> Project: PDFBox
>  Issue Type: Bug
>  Components: Signing
>Affects Versions: 2.0.30, 3.0.1 PDFBox
>Reporter: Michael Klink
>Priority: Major
> Fix For: 2.0.31, 3.0.2 PDFBox, 4.0.0
>
>
> In eSig DSS issue DSS-3269 - 
> [https://ec.europa.eu/digital-building-blocks/tracker/browse/DSS-3269] - it 
> became apparent that {{PDDocument.addSignature(PDSignature, 
> SignatureInterface, SignatureOptions)}} does not consistently handle the 
> signature page while signing in existing signature fields:
> On one hand that method does look for an existing signature field of the 
> given name and - if found - explicitly does not overwrite the link to the 
> page of the widget. This implies that existing signature fields are supported 
> as they are and are not to be re-located.
> On the other hand that method unconditionally adds the signature widget to 
> the page indicated by the signature option page number.
> In the linked DSS issue this causes signing an existing signature field on 
> different page than in the options to appear both on the original page and 
> the option page which is not desired.
> Expected behavior would be that in case of an existing signature field and 
> widget 
> * either the signature option page is ignored and only the current page of 
> the signature field is used
> * or the field widget consequentially is used to the signature option page, 
> removing it from its original page.
> The natural option would be the former one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5772) Inconsistent signature page handling when signing in existing signature fields

2024-02-19 Thread Michael Klink (Jira)
Michael Klink created PDFBOX-5772:
-

 Summary: Inconsistent signature page handling when signing in 
existing signature fields
 Key: PDFBOX-5772
 URL: https://issues.apache.org/jira/browse/PDFBOX-5772
 Project: PDFBox
  Issue Type: Bug
  Components: Signing
Affects Versions: 3.0.1 PDFBox, 2.0.30
Reporter: Michael Klink


In eSig DSS issue DSS-3269 - 
[https://ec.europa.eu/digital-building-blocks/tracker/browse/DSS-3269] - it 
became apparent that {{PDDocument.addSignature(PDSignature, SignatureInterface, 
SignatureOptions)}} does not consistently handle the signature page while 
signing in existing signature fields:

On one hand that method does look for an existing signature field of the given 
name and - if found - explicitly does not overwrite the link to the page of the 
widget. This implies that existing signature fields are supported as they are 
and are not to be re-located.

On the other hand that method unconditionally adds the signature widget to the 
page indicated by the signature option page number.

In the linked DSS issue this causes signing an existing signature field on 
different page than in the options to appear both on the original page and the 
option page which is not desired.

Expected behavior would be that in case of an existing signature field and 
widget 

* either the signature option page is ignored and only the current page of the 
signature field is used
* or the field widget consequentially is used to the signature option page, 
removing it from its original page.

The natural option would be the former one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5770) When adding watermark to PDF, there may be a native memory leak

2024-02-18 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818291#comment-17818291
 ] 

Michael Klink commented on PDFBOX-5770:
---

ok, thanks, I also saw that moving on.

 

> When adding watermark to PDF, there may be a native memory leak
> ---
>
> Key: PDFBOX-5770
> URL: https://issues.apache.org/jira/browse/PDFBOX-5770
> Project: PDFBox
>  Issue Type: Bug
>Reporter: weiteFeng
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When using the following code to add a watermark to a PDF file, the memory 
> usage of the Java process will gradually increase, even exceeding the limit 
> of the maximum heap memory usage. When the process uses memory exceeding the 
> maximum memory of the machine, the Java process will be killed by the 
> operating system.
> When analyzing the dumped memory, I found that when the Java process occupies 
> a large amount of memory (viewed through the top command), the heap memory of 
> the process actually does not occupy too much space, so I inferred that there 
> may be a native memory leak in this code, due to I don't have a deep 
> understanding of Linux memory analysis, so I can't find the problem in this 
> code.
> I wonder if you have any suggestion.
> The following is the code I use to add watermarks to PDF:



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5770) When adding watermark to PDF, there may be a native memory leak

2024-02-18 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818288#comment-17818288
 ] 

Michael Klink commented on PDFBOX-5770:
---

Where is the code?

> When adding watermark to PDF, there may be a native memory leak
> ---
>
> Key: PDFBOX-5770
> URL: https://issues.apache.org/jira/browse/PDFBOX-5770
> Project: PDFBox
>  Issue Type: Bug
>Reporter: weiteFeng
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When using the following code to add a watermark to a PDF file, the memory 
> usage of the Java process will gradually increase, even exceeding the limit 
> of the maximum heap memory usage. When the process uses memory exceeding the 
> maximum memory of the machine, the Java process will be killed by the 
> operating system.
> When analyzing the dumped memory, I found that when the Java process occupies 
> a large amount of memory (viewed through the top command), the heap memory of 
> the process actually does not occupy too much space, so I inferred that there 
> may be a native memory leak in this code, due to I don't have a deep 
> understanding of Linux memory analysis, so I can't find the problem in this 
> code.
> I wonder if you have any suggestion.
> The following is the code I use to add watermarks to PDF:



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5717) NullPointerException calling saveIncrementalForExternalSigning

2023-11-22 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788843#comment-17788843
 ] 

Michael Klink commented on PDFBOX-5717:
---

PDFBox has a problem with object 3 - in the catalog it is referred to as name 
tree base for JavaScript:

{code}
4
0
obj
<<
/Type
/Catalog
/Names
<<
/JavaScript
3
0
R
>>
...
{code}

But there is no object 3 in the file:

{code}
xref
0 246
02 65535 f 
0002000782 0 n 
03 0 f 
00 0 f 
...
{code}

According to the spec, PDFBox should treat this as a {{null}} object:

{quote}
An indirect reference to an undefined object shall not be considered an error 
by a PDF processor; it shall be treated as a reference to the null object.
{quote}

Unfortunately it doesn't:

{code}
Thread [main] (Suspended (exception NullPointerException))  
owns: Hashtable  (id=64)   
Hashtable.computeIfAbsent(K, Function) 
line: 1032   
COSWriter.getObjectKey(COSBase) line: 1089  
COSWriter.writeReference(COSBase) line: 1367
COSWriter.visitFromDictionary(COSDictionary) line: 1207 
COSWriter.writeDictionary(COSDictionary) line: 1155 
COSWriter.visitFromDictionary(COSDictionary) line: 1202 
COSDictionary.accept(ICOSVisitor) line: 1265
COSWriter.doWriteObject(COSObjectKey, COSBase) line: 610
COSWriter.doWriteObject(COSBase) line: 643  
COSWriter.doWriteObjects() line: 540
COSWriter.doWriteBody(COSDocument) line: 450
COSWriter.visitFromDocument(COSDocument) line: 1299 
COSDocument.accept(ICOSVisitor) line: 413   
COSWriter.write(PDDocument, SignatureInterface) line: 1568  
COSWriter.write(PDDocument) line: 1444  
PDDocument.saveIncrementalForExternalSigning(OutputStream) line: 1186   
{code}

> NullPointerException calling saveIncrementalForExternalSigning
> --
>
> Key: PDFBOX-5717
> URL: https://issues.apache.org/jira/browse/PDFBOX-5717
> Project: PDFBox
>  Issue Type: Bug
>  Components: Signing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Daniele Ribaudo
>Priority: Major
> Attachments: 
> Cryptomathic_White_Paper_-_eIDAS_Compliant_Remote_eSigning.pdf
>
>
> I tried to apply a digital signature to the attached PDF using the method 
> PDDocument.saveIncrementalForExternalSigning in the release 3.0.0 of PDFBox 
> but a NPE is thrown every time.
> The same action executed on a 2.0.x release is successfully completed.
> Here you are a snipped code to reproduce the error:
>
> {code:java}
> PDDocument document = Loader.loadPDF(new 
> File("Cryptomathic_White_Paper_-_eIDAS_Compliant_Remote_eSigning.pdf"));
> document.addSignature(new PDSignature(), new SignatureOptions());
> ExternalSigningSupport externalSigning = 
> document.saveIncrementalForExternalSigning(new ByteArrayOutputStream());
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5709) Getting document corrupted while signing hash which has DER encoded signed attributes

2023-11-02 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782062#comment-17782062
 ] 

Michael Klink commented on PDFBOX-5709:
---

>From former stack overflow questions by you I assume you use that 
>{{ContentSigner}} in a {{CMSSignedDataGenerator}} that produces a CMS 
>signature container for a {{ExternalSigningSupport}}-based PDF signing routine.

In that case it is completely wrong to create your own set of signed attributes 
in the {{ContentSigner.getSignature}} method: The BouncyCastle 
{{CMSSignedDataGenerator}} usually creates (and embeds!) its own set of such 
attributes and merely asks your {{ContentSigner}} to sign it but you return a 
signature for your own set which is not embedded.

You can customize the signed attributes generated by {{CMSSignedDataGenerator}} 
by setting a custom SignedAttributeGenerator to the SignedInfoGenerator.



> Getting document corrupted while signing hash which has DER encoded signed 
> attributes
> -
>
> Key: PDFBOX-5709
> URL: https://issues.apache.org/jira/browse/PDFBOX-5709
> Project: PDFBox
>  Issue Type: Bug
>  Components: Signing
>Reporter: Tanmay Sharma
>Priority: Critical
>
> I am trying to do external signing. For that we use to calculate hash of pdf 
> and get it sign using some external trust service provider. Now our use case 
> is that instead of signing hash bytes we need to do signing over DER encoding 
> signing attributes. But after generating signed hash and embedding it to 
> document we are getting document corrupted error.
> Code of content signer is 
> {code:java}
> ContentSigner contentSigner = new ContentSigner() {
> private MessageDigest digest = MessageDigest.getInstance("SHA-256");
> private OutputStream stream = OutputStreamFactory.createStream(digest);
> @SneakyThrows
> @Override
> public byte[] getSignature() {
> try {
> byte[] b = new byte[4096];
> int count;
> while ((count = inputStream.read(b)) > 0) {
> digest.update(b, 0, count);
> }
> byte[] hashBytes = digest.digest();
> byte[] derEncoded = getAuthenticatedAttributeSet(hashBytes, 
> calendar).getEncoded(ASN1Encoding.DER);
> List hash = Arrays.asList(new 
> String(org.bouncycastle.util.encoders.Base64.encode(derEncoded)));
> byte[] signedHash = getSignedHash(hash, 
> cscCredentialOptions.getAuthorizationContext().getAccessToken(),
> cscCredentialOptions.getCredentialId(), 
> cscCredentialOptions.getCredentialAuthParameters().getPin(), signAlgo);
> return signedHash;
> } catch (Exception e) {
> LOG.error(e.getMessage());
> }
> }
> @Override
> public OutputStream getOutputStream() {
> return stream;
> }
> @Override
> public AlgorithmIdentifier getAlgorithmIdentifier() {
> return new AlgorithmIdentifier(new 
> ASN1ObjectIdentifier("1.2.840.113549.1.1.11"));
> }
> };{code}
> {code:java}
> public DERSet getAuthenticatedAttributeSet(byte secondDigest[], Calendar 
> signingTime) {
> ASN1EncodableVector attribute = new ASN1EncodableVector();
> ASN1EncodableVector v = new ASN1EncodableVector();
> v.add(new ASN1ObjectIdentifier("1.2.840.113549.1.9.3"));
> v.add(new DERSet(new ASN1ObjectIdentifier("1.2.840.113549.1.7.1")));
> attribute.add(new DERSequence(v));
> v = new ASN1EncodableVector();
> v.add(new ASN1ObjectIdentifier("1.2.840.113549.1.9.5"));
> v.add(new DERSet(new DERUTCTime(signingTime.getTime(;
> attribute.add(new DERSequence(v));
> v = new ASN1EncodableVector();
> v.add(new ASN1ObjectIdentifier("1.2.840.113549.1.9.4"));
> v.add(new DERSet(new DEROctetString(secondDigest)));
> attribute.add(new DERSequence(v));
> return new DERSet(attribute);
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5696) COSStream lost, becomes a COSDictionary

2023-10-09 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773210#comment-17773210
 ] 

Michael Klink commented on PDFBOX-5696:
---

{quote}because the appearance stream is now a dictionary{quote}

Not only has that appearance stream lost its stream data, it also has been 
added to an object stream which is not possible for streams. This might even 
have caused the loss of the stream data.

Furthermore, another issue becomes apparent already in the first saved object: 
This single revision document has a cross reference stream with segmented 
object number ranges:
{noformat}/Index [0 3 4 1 8 13 22 3 26 9
36 15 52 5]
{noformat}
But single revision documents must have a single subsection cross reference 
only starting with object number 0 having _Size_ entries. Also there must be a 
mapping for every object number from 0 to _Size - 1 which isn't the case here 
either.
Usually PDF viewers have no problems if these rules aren't followed but in some 
cases (e.g. signature validation) that might result in issues.

> COSStream lost, becomes a COSDictionary
> ---
>
> Key: PDFBOX-5696
> URL: https://issues.apache.org/jira/browse/PDFBOX-5696
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Priority: Critical
> Attachments: 30-1.pdf, 30-2.pdf, CambriaMath.ttf, TemplateTank.pdf
>
>
> This is reduced from the example presented by Pados Attila on the mailing list
> {code}
> public static void main(String[] args) throws IOException
> {
> InputStream resourceAsStream = 
> NewClass.class.getResourceAsStream("/pdf/TemplateTank.pdf");
> try (PDDocument a1doc = Loader.loadPDF(new 
> RandomAccessReadBuffer(resourceAsStream)))
> {
> PDAcroForm form = a1doc.getDocumentCatalog().getAcroForm();
> PDResources dr = form.getDefaultResources();
> form.getField("Site Name").setValue("Site Name");
> a1doc.save(new File("30-1.pdf"));
> PDFont font = PDType0Font.load(a1doc, 
> PdfGenerator.class.getResourceAsStream("/fonts/CambriaMath.ttf"), false);
> dr.add(font);
> a1doc.save(new File("30-2.pdf"));
> }
> }
> {code}
> The result file 30-1.pdf has "Site Name" in the rendering, the result file 
> 30-2.pdf doesn't, because the appearance stream is now a dictionary.
> (The test was done on jdk21, I didn't test on another jdk)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5692) Can we use pdfbox to finde certificate information of timestamp token present in signature timestamp attribute

2023-10-05 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772106#comment-17772106
 ] 

Michael Klink commented on PDFBOX-5692:
---

See the answer to your 
[cross-post|https://stackoverflow.com/a/77234972/1729265] on stack overflow.

> Can we use pdfbox to finde certificate information of timestamp token present 
> in signature timestamp attribute
> --
>
> Key: PDFBOX-5692
> URL: https://issues.apache.org/jira/browse/PDFBOX-5692
> Project: PDFBox
>  Issue Type: Wish
>  Components: Signing
>Reporter: Tanmay Sharma
>Priority: Major
>
> A document is digitally signed. A timestamp token is embedded as signature 
> timestamp attribute while signing the document. How can we find the 
> certificate information of that timestamp token using pdfbox?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5688) PDFTextStripper not returning full text of document

2023-09-29 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770514#comment-17770514
 ] 

Michael Klink commented on PDFBOX-5688:
---

The PDF does not contain the information required to extract text.

There is a font and it contains some glyph drawing instructions for a number of 
character codes. Thus, you see plain text. What is missing, though, is a 
mapping from those character codes or those glyph drawing instructions to a 
character in a known encoding like Unicode. Thus, PDFBox has to guess such a 
mapping but the guess goes wrong.

(If you do a regular copy in Adobe Acrobat, the result also is garbage, 
by the way. That document has been created without the intent to allow text 
extraction, maybe even with the intent to make text extraction difficult.)

> PDFTextStripper not returning full text of document
> ---
>
> Key: PDFBOX-5688
> URL: https://issues.apache.org/jira/browse/PDFBOX-5688
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.29
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: pdfbox-get-text-problem.pdf
>
>
>  
> {code:java}
> try (PDDocument document = PDDocument.load(pdfFile) {
>   String text = new PDFTextStripper().getText();
> } {code}
> > get text above is only returning few chars \n, \r, 0, 1 but opening the PDF 
> > in chrome / acrobat it appears have lots of text.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5674) refactor string operations

2023-09-04 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761870#comment-17761870
 ] 

Michael Klink commented on PDFBOX-5674:
---

{quote}The equality check for empty strings was converted from an equals method 
call ("".equals(stringVar)) to use the isEmpty() method.{quote}

Beware, {{"".equals(stringVar)}} does work for {{null}} values, 
{{stringVar.isEmpty()}} throws NPEs in that case. Thus, please make sure that 
{{stringVar}} cannot be {{null}} there.

> refactor string operations
> --
>
> Key: PDFBOX-5674
> URL: https://issues.apache.org/jira/browse/PDFBOX-5674
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Axel Howind
>Priority: Minor
> Attachments: refactor_String_operations.patch
>
>
> * Conversion of StringBuilder to String was simplified by using the objects 
> directly in the concatenation, instead of calling the toString() function.
> * The equality check for empty strings was converted from an equals method 
> call ("".equals(stringVar)) to use the isEmpty() method.
> * When obtaining a one-character substring from a string, charAt() was used 
> in preference over substring().
> * The new keyword was discarded when converting a single character into a 
> string, using String.valueOf() instead.
> * The creation of a string from a ByteArrayOutputStream was simplified to 
> call toString(encoding) directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5665) NPE when converting pdf to image.

2023-08-30 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760272#comment-17760272
 ] 

Michael Klink edited comment on PDFBOX-5665 at 8/30/23 8:14 AM:


{quote}The cause is a comment in a content stream. I doubt that this is legit: 
"Any occurrence of the PERCENT SIGN (25h) outside a string or stream introduces 
a comment".
{quote}
It is valid.

You quote from ISO 32000-1. Already here that characterization was incomplete 
as one can see in some example content streams, e.g. the colored tiling pattern 
examples.

In ISO 32000-2 that has been changed to
{quote}Any occurrence of the PERCENT SIGN (25h) outside a string or inside a 
content stream (see 7.8.2, "Content streams") introduces a comment.
{quote}
which isn't that much better as it, strictly speaking, even makes `%`s in 
arbitrary binary streams starts of comments. Which is not intended of course.

A clarification hereof has been the topic of 
[https://github.com/pdf-association/pdf-issues/issues/273] for quite a while 
now.


was (Author: mkl):
{quote}The cause is a comment in a content stream. I doubt that this is legit: 
"Any occurrence of the PERCENT SIGN (25h) outside a string or stream introduces 
a comment".{quote}
It is valid.

You quote from ISO 32000-1. Already here that was an error as one can see in 
some example content streams, e.g. the colored tiling pattern examples.

In ISO 32000-2 that has been changed to
{quote}Any occurrence of the PERCENT SIGN (25h) outside a string or inside a 
content stream (see 7.8.2, "Content streams") introduces a comment.{quote}
which isn't that much better as it, strictly speaking, even makes `%`s in 
arbitrary binary streams starts of comments. Which is not intended of course.

A clarification hereof has been the topic of 
https://github.com/pdf-association/pdf-issues/issues/273 for quite a while now.

> NPE when converting pdf to image.
> -
>
> Key: PDFBOX-5665
> URL: https://issues.apache.org/jira/browse/PDFBOX-5665
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.25, 2.0.29, 3.0.0 PDFBox, 4.0.0
>Reporter: Ritu Dubey
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.30, 3.0.1 PDFBox, 4.0.0
>
> Attachments: PDFBOX-5665-2_unc.pdf, test.pdf
>
>
> For attached pdf I am getting a null pointer exception when converting it to 
> image. Log attached.
> java.lang.NullPointerException at 
> org.apache.pdfbox.rendering.PageDrawer.getPaint(PageDrawer.java:355) at 
> org.apache.pdfbox.rendering.PageDrawer.getNonStrokingPaint(PageDrawer.java:747)
>  at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:914) at 
> org.apache.pdfbox.rendering.PageDrawer.fillAndStrokePath(PageDrawer.java:1019)
>  at 
> org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroAndStrokePath.process(FillNonZeroAndStrokePath.java:39)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:492)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
>  at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:282) at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:355)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5665) NPE when converting pdf to image.

2023-08-30 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760272#comment-17760272
 ] 

Michael Klink edited comment on PDFBOX-5665 at 8/30/23 8:13 AM:


{quote}The cause is a comment in a content stream. I doubt that this is legit: 
"Any occurrence of the PERCENT SIGN (25h) outside a string or stream introduces 
a comment".{quote}
It is valid.

You quote from ISO 32000-1. Already here that was an error as one can see in 
some example content streams, e.g. the colored tiling pattern examples.

In ISO 32000-2 that has been changed to
{quote}Any occurrence of the PERCENT SIGN (25h) outside a string or inside a 
content stream (see 7.8.2, "Content streams") introduces a comment.{quote}
which isn't that much better as it, strictly speaking, even makes `%`s in 
arbitrary binary streams starts of comments. Which is not intended of course.

A clarification hereof has been the topic of 
https://github.com/pdf-association/pdf-issues/issues/273 for quite a while now.


was (Author: mkl):
{quote}The cause is a comment in a content stream. I doubt that this is legit: 
"Any occurrence of the PERCENT SIGN (25h) outside a string or stream introduces 
a comment".{quote}
It is valid.

You quote from ISO 32000-1. Already here that was an error as one can see in 
some example content streams, e.g. the colored tiling pattern examples.

In ISO 32000-2 that has been changed to
{quote}Any occurrence of the PERCENT SIGN (25h) outside a string or inside a 
content stream (see 7.8.2, "Content streams") introduces a comment.{quote}
which isn't that much better as it, strictly speaking, even makes `%`s in 
arbitrary binary streams starts of comments. Which is not intended of course.

> NPE when converting pdf to image.
> -
>
> Key: PDFBOX-5665
> URL: https://issues.apache.org/jira/browse/PDFBOX-5665
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.25, 2.0.29, 3.0.0 PDFBox, 4.0.0
>Reporter: Ritu Dubey
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.30, 3.0.1 PDFBox, 4.0.0
>
> Attachments: PDFBOX-5665-2_unc.pdf, test.pdf
>
>
> For attached pdf I am getting a null pointer exception when converting it to 
> image. Log attached.
> java.lang.NullPointerException at 
> org.apache.pdfbox.rendering.PageDrawer.getPaint(PageDrawer.java:355) at 
> org.apache.pdfbox.rendering.PageDrawer.getNonStrokingPaint(PageDrawer.java:747)
>  at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:914) at 
> org.apache.pdfbox.rendering.PageDrawer.fillAndStrokePath(PageDrawer.java:1019)
>  at 
> org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroAndStrokePath.process(FillNonZeroAndStrokePath.java:39)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:492)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
>  at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:282) at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:355)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5665) NPE when converting pdf to image.

2023-08-30 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17760272#comment-17760272
 ] 

Michael Klink commented on PDFBOX-5665:
---

{quote}The cause is a comment in a content stream. I doubt that this is legit: 
"Any occurrence of the PERCENT SIGN (25h) outside a string or stream introduces 
a comment".{quote}
It is valid.

You quote from ISO 32000-1. Already here that was an error as one can see in 
some example content streams, e.g. the colored tiling pattern examples.

In ISO 32000-2 that has been changed to
{quote}Any occurrence of the PERCENT SIGN (25h) outside a string or inside a 
content stream (see 7.8.2, "Content streams") introduces a comment.{quote}
which isn't that much better as it, strictly speaking, even makes `%`s in 
arbitrary binary streams starts of comments. Which is not intended of course.

> NPE when converting pdf to image.
> -
>
> Key: PDFBOX-5665
> URL: https://issues.apache.org/jira/browse/PDFBOX-5665
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.25, 2.0.29, 3.0.0 PDFBox, 4.0.0
>Reporter: Ritu Dubey
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.30, 3.0.1 PDFBox, 4.0.0
>
> Attachments: PDFBOX-5665-2_unc.pdf, test.pdf
>
>
> For attached pdf I am getting a null pointer exception when converting it to 
> image. Log attached.
> java.lang.NullPointerException at 
> org.apache.pdfbox.rendering.PageDrawer.getPaint(PageDrawer.java:355) at 
> org.apache.pdfbox.rendering.PageDrawer.getNonStrokingPaint(PageDrawer.java:747)
>  at org.apache.pdfbox.rendering.PageDrawer.fillPath(PageDrawer.java:914) at 
> org.apache.pdfbox.rendering.PageDrawer.fillAndStrokePath(PageDrawer.java:1019)
>  at 
> org.apache.pdfbox.contentstream.operator.graphics.FillNonZeroAndStrokePath.process(FillNonZeroAndStrokePath.java:39)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:939)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:514)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:492)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
>  at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:282) at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:355)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5664) 3.0.0: PDFCloneUtility needs a protected constructor to be useable outside of PDFBox when using Java 9 JPMS

2023-08-27 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759385#comment-17759385
 ] 

Michael Klink commented on PDFBOX-5664:
---

Have you considered copying that class into your own package structure?

(I have to admit I haven't checked whether that class uses any package 
protected methods of other classes in PDFBox...)

> 3.0.0: PDFCloneUtility needs a protected constructor to be useable outside of 
> PDFBox when using Java 9 JPMS
> ---
>
> Key: PDFBOX-5664
> URL: https://issues.apache.org/jira/browse/PDFBOX-5664
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 3.0.0 PDFBox
>Reporter: Emmeran Seehuber
>Priority: Major
>
> The constructor of PDFCloneUtility is package private. I did not have a 
> problem with this, because I did an ugly workaround in my pdfbox-graphics2d 
> 3.0.0 branch. I created a derived class InternalDeprecatedCOSCloner in the 
> org.apache.pdfbox.multipdf package inside my project. And could access the 
> constructor.
> This works fine as long as you don't plan to use the JPMS modules introduced 
> with Java 9. Which I personally don't plan every to do.
> But it seems Apache POI is going to use those JPMS modules, at least 
> [~fanningpj] is trying to get POI working with PDFBox 3.0.0 and my 
> pdfbox-graphics2d with version 3.0.0. And now he gets a not so nice 
>  
> {{/Users/pj.fanning/svn/poi/poi-ooxml/src/main/java9/module-info.java:18: 
> error: module org.apache.poi.ooxml reads package org.apache.pdfbox.multipdf 
> from both de.rototor.pdfbox.graphics2d and org.apache.pdfbox}}
> As the - to be honest rather dirty - workaround done be me no longer works 
> with JPMS...
> You can find the concrete usage for the cloner here 
> [https://github.com/rototor/pdfbox-graphics2d/blob/master/graphics2d/src/main/java/de/rototor/pdfbox/graphics2d/PdfBoxGraphics2DPaintApplier.java].
>  Just search for PDFCloneUtility. I use it to clone PDShading when I'm 
> "rewriting" PDFs. I.e. I use PDFBox to draw on my Graphics2D adapter to 
> create new PDFs and filter / change stuff in the PDF on the fly. Mostly to 
> split PDFs for Seperation colors and such stuff.
> Just making the PDFClonerUtility constructor public again would of course 
> work. But I'm not sure that this is the right solution. AFAIR it was made 
> package private because of many problems of users which did not really 
> understand what this class was for.
> Maybe a solution could be to make the constructor protected and create a 
> package private getCloner() factory method? That would allow me to derive 
> from the class from outside the original package but would also prevent 
> people who don't know for sure that they really want to use this class from 
> using it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5647) Showing signature verified for tampered document

2023-08-13 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753785#comment-17753785
 ] 

Michael Klink commented on PDFBOX-5647:
---

{quote}my team will do a poc to check if they can solve that issue{quote}

Some time ago I implemented a simple POC for revision comparison based on iText 
7, see https://stackoverflow.com/a/69617158/1729265 - the same concept should 
be possible to implement based on PDFBox. Your team may draw some inspirations 
from that POC.

Beware, though: That POC only reports differences in a very lowlevel manner, 
not in terms used in the specification of allowed and disallowed changes.

> Showing signature verified for tampered document
> 
>
> Key: PDFBOX-5647
> URL: https://issues.apache.org/jira/browse/PDFBOX-5647
> Project: PDFBox
>  Issue Type: Bug
>  Components: Signing
>Reporter: Tanmay Sharma
>Priority: Blocker
> Attachments: Doc1_signed.pdf, Doc1_signed_corrupted.pdf
>
>
> A 2 page document was signed. The signature of document was verified by 
> [ShowSignature 
> sample|https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/ShowSignature.java]
>  and it prints "Signature Verified". 
> Then a corrupted signed PDF was created by deleting the second page of the 
> same signed PDF and the signature of the corrupted PDF was also verified 
> using [ShowSignature 
> sample|https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/ShowSignature.java].
>  Ideally the verification should fail because hash of the document is changed 
> (as second page is deleted). But instead of printing "Signature verification 
> failed", it still prints "Signature Verified". 
> How the signature of corrupted pdf is still getting verified successfully?
> Both signed pdf and corrupted signed pdf is added in the attachments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5647) Showing signature verified for tempered document

2023-08-11 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17753226#comment-17753226
 ] 

Michael Klink commented on PDFBOX-5647:
---

The format PDF allows to append changes to a PDF without touching the original 
bytes of the former document revision. These appended changes are called 
incremental updates.

If you apply that mechanism to a signed PDF file, the signature mathematically 
remains valid because the original bytes remain the same. For details see [this 
old security stack exchange 
answer|https://security.stackexchange.com/a/35131/16096].

This is why PDFBox outputs that the signature in  [^Doc1_signed_corrupted.pdf]  
is ok: The change, the deletion of the second page, is done in an incremental 
update.  You can verify using file compare tools that  
[^Doc1_signed_corrupted.pdf] is [^Doc1_signed.pdf] plus some additions at the 
end.

The PDFBox sample also tells you that there were additional changes, it outputs 
"Signature does not cover whole document". Whenever you see that in the output 
of ShowSignature, there may be arbitrary changes added after the signed 
document revision.



Of course signatures that remain valid after arbitrary manipulations are not 
helpful. Thus, only certain changes are allowed in incremental updates to 
signed PDFs, see [this old stack overflow 
answer|https://stackoverflow.com/a/16711745/1729265].

Analyzing the changes in an incremental update is non-trivial. Also, the 
allowed changes are technically not well-specified. Thus, PDFBox has not 
implemented a check whether incremental updates to a signed PDF are allowed, 
its example validation code merely outputs if there are incremental updates 
after the signature or not.

Adobe Acrobat, on the other hand, has implemented a check of the incremental 
updates. Due to the mentioned deficits in the specification of the allowed 
changes, though, this implementation has changed quite a bit in the recent 
years. There are still multiple false positives and false negatives in its 
reports, though.


> Showing signature verified for tempered document
> 
>
> Key: PDFBOX-5647
> URL: https://issues.apache.org/jira/browse/PDFBOX-5647
> Project: PDFBox
>  Issue Type: Bug
>  Components: Signing
>Reporter: Tanmay Sharma
>Priority: Blocker
> Attachments: Doc1_signed.pdf, Doc1_signed_corrupted.pdf
>
>
> A 2 page document was signed. The signature of document was verified by 
> [ShowSignature 
> sample|https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/ShowSignature.java]
>  and it prints "Signature Verified". 
> Then a corrupted signed PDF was created by deleting the second page of the 
> same signed PDF and the signature of the corrupted PDF was also verified 
> using [ShowSignature 
> sample|https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/signature/ShowSignature.java].
>  Ideally the verification should fail because hash of the document is changed 
> (as second page is deleted). But instead of printing "Signature verification 
> failed", it still prints "Signature Verified". 
> How the signature of corrupted pdf is still getting verified successfully?
> Both signed pdf and corrupted signed pdf is added in the attachments.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5639) Password protected PDF opens in GUI apps but PDFbox says invalid password

2023-07-18 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744354#comment-17744354
 ] 

Michael Klink commented on PDFBOX-5639:
---

R=5 was an Adobe first shot at AES256 which they defined in an Adobe extension 
to ISO 32000-1 and which in ISO 32000-2 has been deprecated. Before this 
example here I've only seen R=5 being used in Adobe PDFs.

Interestingly, this scanner has copied an Adobe error, too: It fills the *O* 
and *U* value (which are specified to be 48 bytes long) with zeros to 128 
bytes. In case of R=6 PDFBox repairs such values and cuts off any bytes in 
excess of the specified ones (see the {{computeHash2A}} method) but it doesn't 
do so for R=5. Maybe it suffices to cut down the *O* and *U* values to 48 bytes 
here, too...

> Password protected PDF opens in GUI apps but PDFbox says invalid password
> -
>
> Key: PDFBOX-5639
> URL: https://issues.apache.org/jira/browse/PDFBOX-5639
> Project: PDFBox
>  Issue Type: Bug
>  Components: Crypto
>Affects Versions: 2.0.29, 3.0.0 PDFBox
> Environment: Java 17 on both Linux and Windows
>Reporter: Steve Davies
>Priority: Major
> Attachments: incorrect_password.pdf
>
>
> I am using PDFbox to test whether a password is correct for a protected PDF 
> before handling it with a different process. This is working for the vast 
> majority of files I receive but from one particular source PDFbox reports an 
> invalid password when the files can be opened without complaint in GUI 
> applications (Adobe Reader et. al.).
> These files are created by scanning to PDF on Xerox MFDs and using the MFD's 
> menus to add a password to the document. As luck would have it I have access 
> to the same MFDs in my location and my files created using the same method 
> are correctly read by PDFbox.
> The issue can be seen using the command line utilities and the same is seen 
> on v2.0.29 and v3.0.0-beta1 (I am using the 2.0 series in the application):
> {code:java}
> ❯ java -jar pdfbox-app-2.0.29.jar Decrypt -password JUL2023rfi 
> invalid_password.pdf
> Exception in thread "main" 
> org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException: Cannot decrypt 
> PDF, the password is incorrect
>         at 
> org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.prepareForDecryption(StandardSecurityHandler.java:284)
>         at 
> org.apache.pdfbox.pdfparser.COSParser.prepareDecryption(COSParser.java:2992)
>         at 
> org.apache.pdfbox.pdfparser.COSParser.retrieveTrailer(COSParser.java:285)
>         at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:173)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1110)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1093)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1070)
>         at org.apache.pdfbox.tools.Decrypt.decrypt(Decrypt.java:143)
>         at org.apache.pdfbox.tools.Decrypt.main(Decrypt.java:65)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:52){code}
> {code:java}
> ❯ java -jar pdfbox-app-3.0.0-beta1.jar Decrypt -password JUL2023rfi -i 
> invalid_password.pdf
> Error decrypting document [InvalidPasswordException]: Cannot decrypt PDF, the 
> password is incorrect{code}
> A sample file is attached, the password is JUL2023rfi



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5623) Signature Image not Rendered starting with PDFBox 2.0.23 + patch provided

2023-06-28 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738178#comment-17738178
 ] 

Michael Klink commented on PDFBOX-5623:
---

[~lionel.fradin],

it's great that you tried and identified the component that created the broken 
*Index*!

It's better to create PDFs according to the standard instead of counting on PDF 
processors to be lax.


> Signature Image not Rendered starting with PDFBox 2.0.23 + patch provided
> -
>
> Key: PDFBOX-5623
> URL: https://issues.apache.org/jira/browse/PDFBOX-5623
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.23, 3.0.0 PDFBox
> Environment: Java 8, Windows 10 and Ubuntu 22
>Reporter: Lionel Fradin
>Assignee: Andreas Lehmkühler
>Priority: Major
> Attachments: 
> Fixing_the_problem_when_the_COSArray_is_not_sorted_in_increasing_order_.patch,
>  PDFBOX-issue-rendering-signature.pdf, pdfbox22-page9-br.jpg, 
> pdfbox23-page9-br.jpg
>
>
> We have an online service where our customers post their PDF files so that we 
> can render them. 
> One of our customer noticed recently that one of its signed document did not 
> show the image associated with the signature. They gave me the right to share 
> this document and you will find it attached 
> ([^PDFBOX-issue-rendering-signature.pdf]).
> The problem is in the last page, page 9. The issue can easily be reproduced 
> using pdfbox-app-2.0*.jar PDFToImage.
> Result with pdfbox 2.0.22 is:
> !pdfbox22-page9-br.jpg!
> Result with pdfbox 2.0.23 or later is:
> !pdfbox23-page9-br.jpg!
> The regression was introduced with commit (seen in git) 
> [f34a33824c4363b9b683245cb582328dc92b79ca|https://github.com/apache/pdfbox/commit/f34a33824c4363b9b683245cb582328dc92b79ca],
>  dated 2021-03-02 07:12:11+. The associated ticket was PDFBOX-5112.
> The issue is in PDFXrefStreamParser's ObjectNumbers constructor, as it 
> assumes that the COSInteger objects in the COSArray are necessarily sorted. 
> In the case of the attached pdf, they are not, and this causes the parser to 
> abort browsing the array too soon.
> I have a patch for that on branch 2.0: 
> [^Fixing_the_problem_when_the_COSArray_is_not_sorted_in_increasing_order_.patch]
> With this patch the image is created successfully. However, there are warning 
> that appear, that did not exist in version 2.0.22:
> {noformat}
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6789] found [6791]
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6790] found [5327]
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6791] found [6485]
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6485] found [6789]
> {noformat}
> There may be additional fixes to be made in order to fully support this PDF. 
> I did not have time to investigate, and also my knowledge of the codebase if 
> fairly limited. So help would be appreciated here.
> Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5623) Signature Image not Rendered starting with PDFBox 2.0.23 + patch provided

2023-06-19 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17734132#comment-17734132
 ] 

Michael Klink commented on PDFBOX-5623:
---

{quote}This one in particular, according to the metadata, was created using 
Acrobat PDFMaker 22 for Word, which might be quite common, and modified with 
iText 7.1.15, which might also be common.{quote}

There are three revisions of the document in the PDF file.

The first one is the one created by Acrobat PDFMaker and modified using iText. 
Its *Index* entry is ok: {{[0 6788]}}

The second one is created by some unknown tool which has added two signature 
fields on the last page of the document and changed the *SigFlags* entry from 
{{1}} to {{3}}. Its *Index* entry is broken: {{[ 6788 1 6789 1 6790 1 6791 1 
5327 1 6485 1 ]}}

The third one is created by Dictao  D2S v5.8 which has signed one of the new 
signature fields. Its *Index* entry is ok: {{[ 5327 1 6788 1 6792 1 6793 1 6794 
1 6795 1 6796 1 6797 1 6798 1 6799 1 6800 1 6801 1 6802 1 6803 1 ]}}

Thus, the tools you mention (PDFMaker 22, iText 7.x) most likely have not 
introduced the error.

Furthermore, Adobe Acrobat signature validation has become stricter in the 
course of the recent years and has started considering a number of (otherwise 
ignored) errors in signed PDFs to invalidate signatures. While your signature 
currently is not subject to that, it may well become in the future. Thus, 
unless those signatures are of interest only for a few months, your customer 
should take steps to ensure that their signed PDFs are valid according to spec.

Nonetheless, PDFBox developers are known to try and emulate Adobe Acrobat 
behavior as far as possible, including under-the-hood error repairs.

> Signature Image not Rendered starting with PDFBox 2.0.23 + patch provided
> -
>
> Key: PDFBOX-5623
> URL: https://issues.apache.org/jira/browse/PDFBOX-5623
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.23, 2.0.24, 2.0.25, 2.0.26, 2.0.27, 2.0.28
> Environment: Java 8, Windows 10 and Ubuntu 22
>Reporter: Lionel Fradin
>Assignee: Andreas Lehmkühler
>Priority: Major
> Attachments: 
> Fixing_the_problem_when_the_COSArray_is_not_sorted_in_increasing_order_.patch,
>  PDFBOX-issue-rendering-signature.pdf, pdfbox22-page9-br.jpg, 
> pdfbox23-page9-br.jpg
>
>
> We have an online service where our customers post their PDF files so that we 
> can render them. 
> One of our customer noticed recently that one of its signed document did not 
> show the image associated with the signature. They gave me the right to share 
> this document and you will find it attached 
> ([^PDFBOX-issue-rendering-signature.pdf]).
> The problem is in the last page, page 9. The issue can easily be reproduced 
> using pdfbox-app-2.0*.jar PDFToImage.
> Result with pdfbox 2.0.22 is:
> !pdfbox22-page9-br.jpg!
> Result with pdfbox 2.0.23 or later is:
> !pdfbox23-page9-br.jpg!
> The regression was introduced with commit (seen in git) 
> [f34a33824c4363b9b683245cb582328dc92b79ca|https://github.com/apache/pdfbox/commit/f34a33824c4363b9b683245cb582328dc92b79ca],
>  dated 2021-03-02 07:12:11+. The associated ticket was PDFBOX-5112.
> The issue is in PDFXrefStreamParser's ObjectNumbers constructor, as it 
> assumes that the COSInteger objects in the COSArray are necessarily sorted. 
> In the case of the attached pdf, they are not, and this causes the parser to 
> abort browsing the array too soon.
> I have a patch for that on branch 2.0: 
> [^Fixing_the_problem_when_the_COSArray_is_not_sorted_in_increasing_order_.patch]
> With this patch the image is created successfully. However, there are warning 
> that appear, that did not exist in version 2.0.22:
> {noformat}
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6789] found [6791]
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6790] found [5327]
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6791] found [6485]
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6485] found [6789]
> {noformat}
> There may be additional fixes to be made in order to fully support this PDF. 
> I did not have time to investigate, and also my knowledge of the codebase if 
> fairly limited. So help would be appreciated here.
> Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: 

[jira] [Comment Edited] (PDFBOX-5623) Signature Image not Rendered starting with PDFBox 2.0.23 + patch provided

2023-06-18 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733612#comment-17733612
 ] 

Michael Klink edited comment on PDFBOX-5623 at 6/18/23 10:19 AM:
-

{quote}The issue is in PDFXrefStreamParser's ObjectNumbers constructor, as it 
assumes that the COSInteger objects in the COSArray are necessarily sorted. In 
the case of the attached pdf, they are not, and this causes the parser to abort 
browsing the array too soon.{quote}
Strictly speaking, according to the specification that assumption is correct:

||Key||Type||Value||
|*Index*|array|(Optional) An array containing a pair of integers for each 
subsection in this section. The first integer shall be the first object number 
in the subsection; the second integer shall be the number of entries in the 
subsection
*The array shall be sorted in ascending order by object number.*
Subsections cannot overlap; an object number shall have no more than one entry 
in a section.
Default value: [0 Size].|
_(ISO 32000-2:2020 Table 17 — Additional entries specific to a cross-reference 
stream dictionary)_

Thus, the issue actually is that the PDF is broken.

So, even if the PDFBox developers decide to enable PDFBox to process your PDF, 
your customers are likely to run into problems again and again if they do not 
fix those PDFs.

Concerning the patch - unfortunately it only sorts the *Index* entries and not 
the associated stream data elements. Thus, the sorting mixes up object numbers 
and the associated offsets. Consequentially, object lookups fail with the 
warnings shown.


was (Author: mkl):
{quote}The issue is in PDFXrefStreamParser's ObjectNumbers constructor, as it 
assumes that the COSInteger objects in the COSArray are necessarily sorted. In 
the case of the attached pdf, they are not, and this causes the parser to abort 
browsing the array too soon.{quote}
Strictly speaking, according to the specification that assumption is correct:

||Key||Type||Value||
|*Index*|array|(Optional) An array containing a pair of integers for each 
subsection in this section. The first integer shall be the first object number 
in the subsection; the second integer shall be the number of entries in the 
subsection
*The array shall be sorted in ascending order by object number.*
Subsections cannot overlap; an object number shall have no more than one entry 
in a section.
Default value: [0 Size].|
_(ISO 32000-2:2020 Table 17 — Additional entries specific to a cross-reference 
stream dictionary)_

Thus, the issue actually is that the PDF is broken.

So, even if the PDFBox developers decide to enable PDFBox to process your PDF, 
your customers are likely to run into problems again and again if they do not 
fix those PDFs.

> Signature Image not Rendered starting with PDFBox 2.0.23 + patch provided
> -
>
> Key: PDFBOX-5623
> URL: https://issues.apache.org/jira/browse/PDFBOX-5623
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.23, 2.0.24, 2.0.25, 2.0.26, 2.0.27, 2.0.28
> Environment: Java 8, Windows 10 and Ubuntu 22
>Reporter: Lionel Fradin
>Assignee: Andreas Lehmkühler
>Priority: Major
> Attachments: 
> Fixing_the_problem_when_the_COSArray_is_not_sorted_in_increasing_order_.patch,
>  PDFBOX-issue-rendering-signature.pdf, pdfbox22-page9-br.jpg, 
> pdfbox23-page9-br.jpg
>
>
> We have an online service where our customers post their PDF files so that we 
> can render them. 
> One of our customer noticed recently that one of its signed document did not 
> show the image associated with the signature. They gave me the right to share 
> this document and you will find it attached 
> ([^PDFBOX-issue-rendering-signature.pdf]).
> The problem is in the last page, page 9. The issue can easily be reproduced 
> using pdfbox-app-2.0*.jar PDFToImage.
> Result with pdfbox 2.0.22 is:
> !pdfbox22-page9-br.jpg!
> Result with pdfbox 2.0.23 or later is:
> !pdfbox23-page9-br.jpg!
> The regression was introduced with commit (seen in git) 
> [f34a33824c4363b9b683245cb582328dc92b79ca|https://github.com/apache/pdfbox/commit/f34a33824c4363b9b683245cb582328dc92b79ca],
>  dated 2021-03-02 07:12:11+. The associated ticket was PDFBOX-5112.
> The issue is in PDFXrefStreamParser's ObjectNumbers constructor, as it 
> assumes that the COSInteger objects in the COSArray are necessarily sorted. 
> In the case of the attached pdf, they are not, and this causes the parser to 
> abort browsing the array too soon.
> I have a patch for that on branch 2.0: 
> [^Fixing_the_problem_when_the_COSArray_is_not_sorted_in_increasing_order_.patch]
> With this patch the image is created successfully. However, there are warning 
> that appear, that did not exist in version 

[jira] [Comment Edited] (PDFBOX-5623) Signature Image not Rendered starting with PDFBox 2.0.23 + patch provided

2023-06-18 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733612#comment-17733612
 ] 

Michael Klink edited comment on PDFBOX-5623 at 6/18/23 10:05 AM:
-

{quote}The issue is in PDFXrefStreamParser's ObjectNumbers constructor, as it 
assumes that the COSInteger objects in the COSArray are necessarily sorted. In 
the case of the attached pdf, they are not, and this causes the parser to abort 
browsing the array too soon.{quote}
Strictly speaking, according to the specification that assumption is correct:

||Key||Type||Value||
|*Index*|array|(Optional) An array containing a pair of integers for each 
subsection in this section. The first integer shall be the first object number 
in the subsection; the second integer shall be the number of entries in the 
subsection
*The array shall be sorted in ascending order by object number.*
Subsections cannot overlap; an object number shall have no more than one entry 
in a section.
Default value: [0 Size].|
_(ISO 32000-2:2020 Table 17 — Additional entries specific to a cross-reference 
stream dictionary)_

Thus, the issue actually is that the PDF is broken.

So, even if the PDFBox developers decide to enable PDFBox to process your PDF, 
your customers are likely to run into problems again and again if they do not 
fix those PDFs.


was (Author: mkl):
{quote}The issue is in PDFXrefStreamParser's ObjectNumbers constructor, as it 
assumes that the COSInteger objects in the COSArray are necessarily sorted. In 
the case of the attached pdf, they are not, and this causes the parser to abort 
browsing the array too soon.{quote}
Strictly speaking, according to the specification that assumption is correct:

||Key||Type||Value||
|*Index*|array|(Optional) An array containing a pair of integers for each 
subsection in this section. The first integer shall be the first object number 
in the subsection; the second integer shall be the number of entries in the 
subsection
*The array shall be sorted in ascending order by object number.*
Subsections cannot overlap; an object number shall have no more than one entry 
in a section.
Default value: [0 Size].|
_(ISO 32000-2:2020 Table 17 — Additional entries specific to a cross-reference 
stream dictionary)_

Thus, the issue actually is that the PDF is broken.

So, even if PDFBox decides to enable PDFBox to process your PDF, your customers 
are likely to run into problems again and again if they do not fix those PDFs.

> Signature Image not Rendered starting with PDFBox 2.0.23 + patch provided
> -
>
> Key: PDFBOX-5623
> URL: https://issues.apache.org/jira/browse/PDFBOX-5623
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.23, 2.0.24, 2.0.25, 2.0.26, 2.0.27, 2.0.28
> Environment: Java 8, Windows 10 and Ubuntu 22
>Reporter: Lionel Fradin
>Assignee: Andreas Lehmkühler
>Priority: Major
> Attachments: 
> Fixing_the_problem_when_the_COSArray_is_not_sorted_in_increasing_order_.patch,
>  PDFBOX-issue-rendering-signature.pdf, pdfbox22-page9-br.jpg, 
> pdfbox23-page9-br.jpg
>
>
> We have an online service where our customers post their PDF files so that we 
> can render them. 
> One of our customer noticed recently that one of its signed document did not 
> show the image associated with the signature. They gave me the right to share 
> this document and you will find it attached 
> ([^PDFBOX-issue-rendering-signature.pdf]).
> The problem is in the last page, page 9. The issue can easily be reproduced 
> using pdfbox-app-2.0*.jar PDFToImage.
> Result with pdfbox 2.0.22 is:
> !pdfbox22-page9-br.jpg!
> Result with pdfbox 2.0.23 or later is:
> !pdfbox23-page9-br.jpg!
> The regression was introduced with commit (seen in git) 
> [f34a33824c4363b9b683245cb582328dc92b79ca|https://github.com/apache/pdfbox/commit/f34a33824c4363b9b683245cb582328dc92b79ca],
>  dated 2021-03-02 07:12:11+. The associated ticket was PDFBOX-5112.
> The issue is in PDFXrefStreamParser's ObjectNumbers constructor, as it 
> assumes that the COSInteger objects in the COSArray are necessarily sorted. 
> In the case of the attached pdf, they are not, and this causes the parser to 
> abort browsing the array too soon.
> I have a patch for that on branch 2.0: 
> [^Fixing_the_problem_when_the_COSArray_is_not_sorted_in_increasing_order_.patch]
> With this patch the image is created successfully. However, there are warning 
> that appear, that did not exist in version 2.0.22:
> {noformat}
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6789] found [6791]
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong 

[jira] [Commented] (PDFBOX-5623) Signature Image not Rendered starting with PDFBox 2.0.23 + patch provided

2023-06-16 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733612#comment-17733612
 ] 

Michael Klink commented on PDFBOX-5623:
---

{quote}The issue is in PDFXrefStreamParser's ObjectNumbers constructor, as it 
assumes that the COSInteger objects in the COSArray are necessarily sorted. In 
the case of the attached pdf, they are not, and this causes the parser to abort 
browsing the array too soon.{quote}
Strictly speaking, according to the specification that assumption is correct:

||Key||Type||Value||
|*Index*|array|(Optional) An array containing a pair of integers for each 
subsection in this section. The first integer shall be the first object number 
in the subsection; the second integer shall be the number of entries in the 
subsection
*The array shall be sorted in ascending order by object number.*
Subsections cannot overlap; an object number shall have no more than one entry 
in a section.
Default value: [0 Size].|
_(ISO 32000-2:2020 Table 17 — Additional entries specific to a cross-reference 
stream dictionary)_

Thus, the issue actually is that the PDF is broken.

So, even if PDFBox decides to enable PDFBox to process your PDF, your customers 
are likely to run into problems again and again if they do not fix those PDFs.

> Signature Image not Rendered starting with PDFBox 2.0.23 + patch provided
> -
>
> Key: PDFBOX-5623
> URL: https://issues.apache.org/jira/browse/PDFBOX-5623
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.23, 2.0.24, 2.0.25, 2.0.26, 2.0.27, 2.0.28
> Environment: Java 8, Windows 10 and Ubuntu 22
>Reporter: Lionel Fradin
>Priority: Major
> Attachments: 
> Fixing_the_problem_when_the_COSArray_is_not_sorted_in_increasing_order_.patch,
>  PDFBOX-issue-rendering-signature.pdf, pdfbox22-page9-br.jpg, 
> pdfbox23-page9-br.jpg
>
>
> We have an online service where our customers post their PDF files so that we 
> can render them. 
> One of our customer noticed recently that one of its signed document did not 
> show the image associated with the signature. They gave me the right to share 
> this document and you will find it attached 
> ([^PDFBOX-issue-rendering-signature.pdf]).
> The problem is in the last page, page 9. The issue can easily be reproduced 
> using pdfbox-app-2.0*.jar PDFToImage.
> Result with pdfbox 2.0.22 is:
> !pdfbox22-page9-br.jpg!
> Result with pdfbox 2.0.23 or later is:
> !pdfbox23-page9-br.jpg!
> The regression was introduced with commit (seen in git) 
> f34a33824c4363b9b683245cb582328dc92b79ca, dated 2021-03-02 07:12:11+. The 
> associated ticket was PDFBOX-5112.
> The issue is in PDFXrefStreamParser's ObjectNumbers constructor, as it 
> assumes that the COSInteger objects in the COSArray are necessarily sorted. 
> In the case of the attached pdf, they are not, and this causes the parser to 
> abort browsing the array too soon.
> I have a patch for that on branch 2.0: 
> [^Fixing_the_problem_when_the_COSArray_is_not_sorted_in_increasing_order_.patch]
> With this patch the image is created successfully. However, there are warning 
> that appear, that did not exist in version 2.0.22:
> {noformat}
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6789] found [6791]
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6790] found [5327]
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6791] found [6485]
> Jun 16, 2023 5:18:29 PM org.apache.pdfbox.pdfparser.COSParser findObjectKey
> WARNING: found wrong object number. expected [6485] found [6789]
> {noformat}
> There may be additional fixes to be made in order to fully support this PDF. 
> I did not have time to investigate, and also my knowledge of the codebase if 
> fairly limited. So help would be appreciated here.
> Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5610) Security-Related Findings in OSS-Fuzz for PDFBox (Issue 58353)

2023-05-30 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727618#comment-17727618
 ] 

Michael Klink commented on PDFBOX-5610:
---

Catching it in {{parse}} _and discarding every result from the parsing process_ 
(by clearing caches and returning {{null}} or throwing a dedicated exception) 
may well be save enough.

To really prevent the error from occurring, one could explicitly limit 
recursion depth (not too difficult) or switch to a non-recursive parsing 
mechanism (more difficult).

> Security-Related Findings in OSS-Fuzz for PDFBox (Issue 58353)
> --
>
> Key: PDFBOX-5610
> URL: https://issues.apache.org/jira/browse/PDFBOX-5610
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Henry Lin
>Priority: Major
> Attachments: crashing_input
>
>
> Dear PDFBox maintainers,
>  
> Fuzzing has found a security related issue in 
> [OSS-Fuzz|https://github.com/google/oss-fuzz] with JVM Fuzzer 
> [Jazzer|https://github.com/CodeIntelligenceTesting/jazzer] in PDFBox. We have 
> reviewed the finding and regarded it as security-related due to the potential 
> of a denial of service. We would appreciate it if you could take a look at 
> the finding. Do you see a risk that this might be exploited by untrusted 
> input?
>  
> Part of the stack trace:
> == Java Exception: com.code_intelligence.jazzer.api.FuzzerSecurityIssueLow: 
> Stack overflow (use '-Xss921k' to reproduce)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:187)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:347)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:263)
> at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:882)
> Caused by: java.lang.StackOverflowError
> at org.apache.commons.logging.impl.Jdk14Logger.log(Jdk14Logger.java:76)
> at org.apache.commons.logging.impl.Jdk14Logger.warn(Jdk14Logger.java:260)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:271)
> at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:882)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:187)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:347)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:263)
> at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:882)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:187)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:347)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:263)
> ...
>  
> We have added a reproducer zip which contains a README that describes how to 
> reproduce the issue.
> Reproducer Zip: 
> [https://drive.google.com/file/d/1CrVPoQhnTZ6FdAOr7tuny7vhG0gsnZZa/view?usp=share_link]
>  
> Fuzz target: 
> [https://github.com/google/oss-fuzz/blob/master/projects/pdfbox/project-parent/fuzz-targets/src/test/java/com/example/PDFStreamParserFuzzer.java]
> OSS-Fuzz issue: [https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=58353 
> |https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=58353]
> Hint: The provided OSS-Fuzz Issue link is only accessible if the issue is 
> fixed or you are the maintainer of the OSS-Fuzz project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5613) uncorrent paragraph split

2023-05-28 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17726936#comment-17726936
 ] 

Michael Klink commented on PDFBOX-5613:
---

As the PDF in question is tagged, you may want to use tags in extraction.

The document is tagged like this:

{noformat}



Daily Report


 




1) which language is your text in? - English 2) some examples of sentences 
containing addresses you'd want to pick up - Data are contarct documents, it 
contains addresses in different formates(of different countries),some are comma 
saperated, some are new line saperated etc 3) perhaps examples of mistakes - 
currently en model of SpaCy is even not able to tag entities clearly 4) Are you 
training your own model or are you using a model as is? - tried as it is but 
very poor in results to need to know a generic approach to train own model. any 
referance code will be helpfu;  Can you please edit your question to add what 
you wrote in your last comment (that was what I was trying to do by asking all 
of them). And please do add actual examples and not just "addresses are in 
different formats", that doesn't really help us understand what you are facing. 
I have added a link on how to train a SpaCy NER model in my answer. It's very 
well documented on their website
;
 Please look at my comment to add more information to your post. Based on the 
information you provided, here are my remarks: 





• SpaCy is trained to find locations, not addresses per se 





If you use a "common" language, SpaCy is trained using WikiNER data, where 
locations aren't addresses but more like geographical places like city names, 
country names etc. So it's quite normal to not be able to detect full 
addresses. 






You likely need to train your own entity recognizer. They detail how to do this 
on their website, including code samples: 


?org.apache.pdfbox.pdmodel.documentinterchange.logicalstructure.PDObjectReference@66498326
?org.apache.pdfbox.pdmodel.documentinterchange.logicalstructure.PDObjectReference@cad498c

https://spacy.io/usage/training#ner



 






• Don't underestimate SpaCy's rule-based matching 








Is it a fancy neural network? No. Does it matter? Also no. SpaCy allows you to 
create 


?org.apache.pdfbox.pdmodel.documentinterchange.logicalstructure.PDObjectReference@1e6454ec

rules to find entities



 and in cases like addresses which are generally following a pattern across 
entities. 




 



{noformat}

(Ah, I see my simple implementation does not correctly inspect links.)

> uncorrent paragraph split
> -
>
> Key: PDFBOX-5613
> URL: https://issues.apache.org/jira/browse/PDFBOX-5613
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction
>Affects Versions: 2.0.1, 2.0.28
>Reporter: Key Hutu
>Priority: Major
> Attachments: Daily Report.pdf
>
>
> when i use pdfbox to extract paragraph text, i get an uncorrent paragraph info
> {code}
> public class PDFParagraphTextStripper extends PDFTextStripper {
>      public PDFParagraphTextStripper() throws IOException{
>          this.setLineSeparator(" ");
>          this.setParagraphStart("");
>          this.setParagraphEnd(this.LINE_SEPARATOR);
>          this.setPageStart("");
>          this.setPageEnd("");
>          this.setArticleStart(this.LINE_SEPARATOR);
>          this.setArticleEnd(this.LINE_SEPARATOR);
>       }
> }
> public class PdfParser {
>     private static final String dataPath = 
> "D:\\IdeaProject\\PdfParser\\PdfParser\\data";
>     public static void main(String[] args) {
>          String fileName = "Daily Report.pdf";
>          try{
>               extract_pdfbox(dataPath + fileName);
>          }
>  catch (Exception e) { 
> e.printStackTrace(); 
> }
>       }
>      private static void extract_pdfbox(String filePath) throws Exception{
>           File file = new File(filePath);
>           PDDocument document = PDDocument.load(file);
>           PDFTextStripper pdfTextStripper = new PDFParagraphTextStripper();
>           String text = pdfTextStripper.getText(document);
>           System.out.println(text);
>           document.close();
>      }
> }
> {code}
> {noformat}
> Daily Report 1) which language is your text in? - English 
> 2) some examples of sentences containing 
> addresses you'd want to pick up - Data are 
> contarct documents, it contains addresses in 
> different formates(of different 
> countries),some are comma saperated, some 
> are new line saperated etc 3) perhaps 
> examples of mistakes - currently en model 
> of SpaCy is even not able to tag entities 
> clearly 4) Are you training your own model 
> or are you using a model as is? - tried as it is 
> but very poor in results to need to know a 
> generic approach to train own model. any 
> {noformat}



--
This 

[jira] [Commented] (PDFBOX-5610) Security-Related Findings in OSS-Fuzz for PDFBox (Issue 58353)

2023-05-28 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17726934#comment-17726934
 ] 

Michael Klink commented on PDFBOX-5610:
---

As long as PDFBox parsing is implemented by recursion with unlimited depths, it 
will remain possible to throw PDFs at it that will cause stack overflows.

The simplest option would be to catch stack overflows in {{parse}}.

> Security-Related Findings in OSS-Fuzz for PDFBox (Issue 58353)
> --
>
> Key: PDFBOX-5610
> URL: https://issues.apache.org/jira/browse/PDFBOX-5610
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Henry Lin
>Priority: Major
> Attachments: crashing_input
>
>
> Dear PDFBox maintainers,
>  
> Fuzzing has found a security related issue in 
> [OSS-Fuzz|https://github.com/google/oss-fuzz] with JVM Fuzzer 
> [Jazzer|https://github.com/CodeIntelligenceTesting/jazzer] in PDFBox. We have 
> reviewed the finding and regarded it as security-related due to the potential 
> of a denial of service. We would appreciate it if you could take a look at 
> the finding. Do you see a risk that this might be exploited by untrusted 
> input?
>  
> Part of the stack trace:
> == Java Exception: com.code_intelligence.jazzer.api.FuzzerSecurityIssueLow: 
> Stack overflow (use '-Xss921k' to reproduce)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:187)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:347)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:263)
> at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:882)
> Caused by: java.lang.StackOverflowError
> at org.apache.commons.logging.impl.Jdk14Logger.log(Jdk14Logger.java:76)
> at org.apache.commons.logging.impl.Jdk14Logger.warn(Jdk14Logger.java:260)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:271)
> at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:882)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:187)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:347)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:263)
> at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:882)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:187)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:347)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:263)
> ...
>  
> We have added a reproducer zip which contains a README that describes how to 
> reproduce the issue.
> Reproducer Zip: 
> [https://drive.google.com/file/d/1CrVPoQhnTZ6FdAOr7tuny7vhG0gsnZZa/view?usp=share_link]
>  
> Fuzz target: 
> [https://github.com/google/oss-fuzz/blob/master/projects/pdfbox/project-parent/fuzz-targets/src/test/java/com/example/PDFStreamParserFuzzer.java]
> OSS-Fuzz issue: [https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=58353 
> |https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=58353]
> Hint: The provided OSS-Fuzz Issue link is only accessible if the issue is 
> fixed or you are the maintainer of the OSS-Fuzz project.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5583) Adding font (or changing font subset) not coming through in saveIncremental

2023-04-21 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17715042#comment-17715042
 ] 

Michael Klink commented on PDFBOX-5583:
---

{quote}Given that re-statement, do you think I'm missing anything when it comes 
to marking things during a `saveIncremental` scenario?{quote}

Looking at your code I see that you use {{PDAcroForm.getFields}} to iterate 
over the fields. This only retrieves the root fields, for general use you may 
want to walk the {{PDAcroForm.getFieldTree}} instead. Don't forget to mark 
intermediary fields.

When iterating over the widgets of a field you forget to mark the *AP* and the 
*N* values.

> Adding font (or changing font subset) not coming through in saveIncremental
> ---
>
> Key: PDFBOX-5583
> URL: https://issues.apache.org/jira/browse/PDFBOX-5583
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, PDModel, Signing
>Affects Versions: 2.0.24
>Reporter: Chris Newhouse
>Priority: Major
> Attachments: image-2023-03-31-17-40-48-710.png, 
> update-after-signature-includes-font-change.pdf, 
> update-after-signature-that-changes-font-two.pdf
>
>
> In an effort to keep file sizes small, we leverage Font Subsets in our PDFs.
> Also, we already have "incremental change and signing" (where fields are 
> changed _after_ a signature, without voiding the prior signature thanks to 
> using `saveIncremental` and signing the changes) working just fine in most 
> cases.
> However, when the Font on a field is changed or a new Font Subset must be 
> added to the document because the characters used in a field with a 
> tightly-subsetted Font, things don't work correctly:
>  * The signatures stay valid, which is great
>  * But the new Font information does not appear to get written to new version 
> appendix, so you get nonsense rendering in a PDF viewer since the field still 
> points to a Font resource that is either not there or is a subset that does 
> not contain all the necessary characters. I'm not super proficient in this so 
> I don't know exactly what's going on.
>  
> I have ensured that the fields we update are getting marked as 
> `setNeedToBeUpdated(true)` (this is, I believe, why the fields changes _do_ 
> end up in the version changes), it just seems that the Font resources are not 
> getting updated in the version.
> I have also tried to mark the document's PDResource as 
> `setNeedToBeUpdated(true)` but to no avail.
>  
> Is this behavior possible/allowed? Is it a bug or am I doing something wrong?
>  
> I have included an example file where the field named 
> `incrementalChangeField` goes from `NotoSans-Regular` to 
> `NotoSansCJK-Regular` during the incremental change, but the font resource 
> does not get added.
> Thanks for your time!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5585) Adding new Field to form during saveIncremental invalidates prior signatures

2023-04-16 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712796#comment-17712796
 ] 

Michael Klink commented on PDFBOX-5585:
---

{quote}
It's possible that this is expected behavior, and that what I'm trying to do is 
just not possible, but:

When I have some fields in a document, then sign the document, then add a new 
field to the document and `saveIncremental` sign it, the prior signatures are 
invalidated.
{quote}

Indeed, that's the expected behavior.

After a document is signed, only very few changes remain allowed. While form 
fill-in may or may not be allowed, structural change of the form (like the 
addition of new form fields) always is forbidden. The only exception can be the 
addition of new signature fields.

For details see [this stack overflow 
answer|https://stackoverflow.com/a/16711745/1729265].

> Adding new Field to form during saveIncremental invalidates prior signatures
> 
>
> Key: PDFBOX-5585
> URL: https://issues.apache.org/jira/browse/PDFBOX-5585
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, PDModel
>Affects Versions: 2.0.24
>Reporter: Chris Newhouse
>Priority: Major
> Attachments: signed-then-new-field-added-save-incremental.pdf
>
>
> It's possible that this is expected behavior, and that what I'm trying to do 
> is just not possible, but:
> When I have some fields in a document, then sign the document, then add a new 
> field to the document and `saveIncremental` sign it, the prior signatures are 
> invalidated.
>  
> The message is "Document has been altered or corrupted since it was signed".
>  
> The final signature on top of that is good, and the new field is there...but 
> the old signatures are invalid. Is it possible to add a new field and not 
> invalidate the old signatures? I've successfully edited existing fields, but 
> can't get it a to be OK after adding a new field.
>  
> I've attached a sample document.
>  
> Thank you for your help!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5583) Adding font (or changing font subset) not coming through in saveIncremental

2023-04-16 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712782#comment-17712782
 ] 

Michael Klink commented on PDFBOX-5583:
---

I think you overestimate what {{setNeedToBeUpdated}} does.

In your code there is the assumption that that method updates the _contents_ of 
the object in question to match changes in related properties. In particular 
you seem to assume that the method updates an appearance to use updated default 
appearance strings and updated default resources.

This is not the case. {{setNeedToBeUpdated}} merely marks the object to be 
included in an incremental update if the document is saved incrementally. But 
the object will be included with the contents it has at the time of saving. So 
if you don't update the contents of the appearance stream you marked, the 
appearance stream will be stored with the old contents. Thus, you will not see 
any differences in the displayed PDF.

Furthermore, I'm not sure how exactly new font subsets are actually created. 
It's automatically done for fonts used on pages, but for fonts used only 
elsewhere you may have to do some extra housekeeping...

Essentially the method is a bit mis-named, it probably should have been 
something like {{setNeedToBeStored}}.

> Adding font (or changing font subset) not coming through in saveIncremental
> ---
>
> Key: PDFBOX-5583
> URL: https://issues.apache.org/jira/browse/PDFBOX-5583
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, PDModel, Signing
>Affects Versions: 2.0.24
>Reporter: Chris Newhouse
>Priority: Major
> Attachments: image-2023-03-31-17-40-48-710.png, 
> update-after-signature-includes-font-change.pdf, 
> update-after-signature-that-changes-font-two.pdf
>
>
> In an effort to keep file sizes small, we leverage Font Subsets in our PDFs.
> Also, we already have "incremental change and signing" (where fields are 
> changed _after_ a signature, without voiding the prior signature thanks to 
> using `saveIncremental` and signing the changes) working just fine in most 
> cases.
> However, when the Font on a field is changed or a new Font Subset must be 
> added to the document because the characters used in a field with a 
> tightly-subsetted Font, things don't work correctly:
>  * The signatures stay valid, which is great
>  * But the new Font information does not appear to get written to new version 
> appendix, so you get nonsense rendering in a PDF viewer since the field still 
> points to a Font resource that is either not there or is a subset that does 
> not contain all the necessary characters. I'm not super proficient in this so 
> I don't know exactly what's going on.
>  
> I have ensured that the fields we update are getting marked as 
> `setNeedToBeUpdated(true)` (this is, I believe, why the fields changes _do_ 
> end up in the version changes), it just seems that the Font resources are not 
> getting updated in the version.
> I have also tried to mark the document's PDResource as 
> `setNeedToBeUpdated(true)` but to no avail.
>  
> Is this behavior possible/allowed? Is it a bug or am I doing something wrong?
>  
> I have included an example file where the field named 
> `incrementalChangeField` goes from `NotoSans-Regular` to 
> `NotoSansCJK-Regular` during the incremental change, but the font resource 
> does not get added.
> Thanks for your time!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5583) Adding font (or changing font subset) not coming through in saveIncremental

2023-04-02 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17707640#comment-17707640
 ] 

Michael Klink commented on PDFBOX-5583:
---

{quote}but it looks like the `AP` value of the changed field/widget is not 
updating to reference the new font or something? I've attached the PDF again so 
hopefully you can make sense of it?  Perhaps I need to update the widgets more 
completely or something?{quote}
Unfortunately you don't show your pivotal code, so I don't know what it does or 
not does, and in particular not what is needed additionally. Thus, I can merely 
guess.

Your first question in the quote appears to indicate that your code updates 
some font objects and references (*DA*, *DR*) and you expect the existing 
appearance stream to automatically use those new values. *DA* and *DR* are 
there for creating a new appearance stream whenever the underlying field value 
changes. Do you probably first change the value of the field and only 
thereafter default appearance and default resources?

> Adding font (or changing font subset) not coming through in saveIncremental
> ---
>
> Key: PDFBOX-5583
> URL: https://issues.apache.org/jira/browse/PDFBOX-5583
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, PDModel, Signing
>Affects Versions: 2.0.24
>Reporter: Chris Newhouse
>Priority: Major
> Attachments: image-2023-03-31-17-40-48-710.png, 
> update-after-signature-includes-font-change.pdf, 
> update-after-signature-that-changes-font-two.pdf
>
>
> In an effort to keep file sizes small, we leverage Font Subsets in our PDFs.
> Also, we already have "incremental change and signing" (where fields are 
> changed _after_ a signature, without voiding the prior signature thanks to 
> using `saveIncremental` and signing the changes) working just fine in most 
> cases.
> However, when the Font on a field is changed or a new Font Subset must be 
> added to the document because the characters used in a field with a 
> tightly-subsetted Font, things don't work correctly:
>  * The signatures stay valid, which is great
>  * But the new Font information does not appear to get written to new version 
> appendix, so you get nonsense rendering in a PDF viewer since the field still 
> points to a Font resource that is either not there or is a subset that does 
> not contain all the necessary characters. I'm not super proficient in this so 
> I don't know exactly what's going on.
>  
> I have ensured that the fields we update are getting marked as 
> `setNeedToBeUpdated(true)` (this is, I believe, why the fields changes _do_ 
> end up in the version changes), it just seems that the Font resources are not 
> getting updated in the version.
> I have also tried to mark the document's PDResource as 
> `setNeedToBeUpdated(true)` but to no avail.
>  
> Is this behavior possible/allowed? Is it a bug or am I doing something wrong?
>  
> I have included an example file where the field named 
> `incrementalChangeField` goes from `NotoSans-Regular` to 
> `NotoSansCJK-Regular` during the incremental change, but the font resource 
> does not get added.
> Thanks for your time!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5583) Adding font (or changing font subset) not coming through in saveIncremental

2023-03-30 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17706957#comment-17706957
 ] 

Michael Klink edited comment on PDFBOX-5583 at 3/30/23 4:53 PM:


{quote}I have included an example file where the field named 
`incrementalChangeField` goes from `NotoSans-Regular` to `NotoSansCJK-Regular` 
during the incremental change, but the font resource does not get added.{quote}

Indeed, the font resources of that field do not change. Most likely you forgot 
a {{setNeedToBeUpdated}} somewhere. You say you _have also tried to mark the 
document's PDResource as `setNeedToBeUpdated(true)` but to no avail._ It does 
not suffice to mark the overall *Resources* dictionary as changed, you also 
have to mark the dictionaries therein, in particular the value of *Font*.

{quote}Related-ish: a similar-ish thing happens when I add a new field to the 
document after the first signing. In this case, I can see some bits and pieces 
of the new field in the version change, but something about it is missing and 
it won't render the new field in Preview, etc.{quote}

Most likely you here also forgot to mark some objects as changed.


was (Author: mkl):
{quote}I have included an example file where the field named 
`incrementalChangeField` goes from `NotoSans-Regular` to `NotoSansCJK-Regular` 
during the incremental change, but the font resource does not get added.{quote}

Indeed, the font resources of that field do not change. Most likely you forgot 
a {{setNeedToBeUpdated}} somewhere. You say you _have also tried to mark the 
document's PDResource as `setNeedToBeUpdated(true)` but to no avail._ It does 
not suffice to mark the overall *Resources* dictionary as changed, you also 
have to mark the dictionaries therein, in particular the value of *Font*.


> Adding font (or changing font subset) not coming through in saveIncremental
> ---
>
> Key: PDFBOX-5583
> URL: https://issues.apache.org/jira/browse/PDFBOX-5583
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, PDModel, Signing
>Affects Versions: 2.0.24
>Reporter: Chris Newhouse
>Priority: Major
> Attachments: update-after-signature-includes-font-change.pdf
>
>
> In an effort to keep file sizes small, we leverage Font Subsets in our PDFs.
> Also, we already have "incremental change and signing" (where fields are 
> changed _after_ a signature, without voiding the prior signature thanks to 
> using `saveIncremental` and signing the changes) working just fine in most 
> cases.
> However, when the Font on a field is changed or a new Font Subset must be 
> added to the document because the characters used in a field with a 
> tightly-subsetted Font, things don't work correctly:
>  * The signatures stay valid, which is great
>  * But the new Font information does not appear to get written to new version 
> appendix, so you get nonsense rendering in a PDF viewer since the field still 
> points to a Font resource that is either not there or is a subset that does 
> not contain all the necessary characters. I'm not super proficient in this so 
> I don't know exactly what's going on.
>  
> I have ensured that the fields we update are getting marked as 
> `setNeedToBeUpdated(true)` (this is, I believe, why the fields changes _do_ 
> end up in the version changes), it just seems that the Font resources are not 
> getting updated in the version.
> I have also tried to mark the document's PDResource as 
> `setNeedToBeUpdated(true)` but to no avail.
>  
> Is this behavior possible/allowed? Is it a bug or am I doing something wrong?
>  
> I have included an example file where the field named 
> `incrementalChangeField` goes from `NotoSans-Regular` to 
> `NotoSansCJK-Regular` during the incremental change, but the font resource 
> does not get added.
> Thanks for your time!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5583) Adding font (or changing font subset) not coming through in saveIncremental

2023-03-30 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17706957#comment-17706957
 ] 

Michael Klink commented on PDFBOX-5583:
---

{quote}I have included an example file where the field named 
`incrementalChangeField` goes from `NotoSans-Regular` to `NotoSansCJK-Regular` 
during the incremental change, but the font resource does not get added.{quote}

Indeed, the font resources of that field do not change. Most likely you forgot 
a {{setNeedToBeUpdated}} somewhere. You say you _have also tried to mark the 
document's PDResource as `setNeedToBeUpdated(true)` but to no avail._ It does 
not suffice to mark the overall *Resources* dictionary as changed, you also 
have to mark the dictionaries therein, in particular the value of *Font*.


> Adding font (or changing font subset) not coming through in saveIncremental
> ---
>
> Key: PDFBOX-5583
> URL: https://issues.apache.org/jira/browse/PDFBOX-5583
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, PDModel, Signing
>Affects Versions: 2.0.24
>Reporter: Chris Newhouse
>Priority: Major
> Attachments: update-after-signature-includes-font-change.pdf
>
>
> In an effort to keep file sizes small, we leverage Font Subsets in our PDFs.
> Also, we already have "incremental change and signing" (where fields are 
> changed _after_ a signature, without voiding the prior signature thanks to 
> using `saveIncremental` and signing the changes) working just fine in most 
> cases.
> However, when the Font on a field is changed or a new Font Subset must be 
> added to the document because the characters used in a field with a 
> tightly-subsetted Font, things don't work correctly:
>  * The signatures stay valid, which is great
>  * But the new Font information does not appear to get written to new version 
> appendix, so you get nonsense rendering in a PDF viewer since the field still 
> points to a Font resource that is either not there or is a subset that does 
> not contain all the necessary characters. I'm not super proficient in this so 
> I don't know exactly what's going on.
>  
> I have ensured that the fields we update are getting marked as 
> `setNeedToBeUpdated(true)` (this is, I believe, why the fields changes _do_ 
> end up in the version changes), it just seems that the Font resources are not 
> getting updated in the version.
> I have also tried to mark the document's PDResource as 
> `setNeedToBeUpdated(true)` but to no avail.
>  
> Is this behavior possible/allowed? Is it a bug or am I doing something wrong?
>  
> I have included an example file where the field named 
> `incrementalChangeField` goes from `NotoSans-Regular` to 
> `NotoSansCJK-Regular` during the incremental change, but the font resource 
> does not get added.
> Thanks for your time!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5568) Document getting corrupted on adding Signed Attributes

2023-02-20 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691309#comment-17691309
 ] 

Michael Klink commented on PDFBOX-5568:
---

As already mentioned in a comment to your stack overflow question, you 
completely ignore in your code that the _signed attributes_ shall be 
{_}*signed* attributes{_}. Your code simply signs the plain document hash, not 
the to-be-signed attributes.

To fix this, you first need to make sure that the {{attrGen}} creates a 
message-digest attribute with the value of {{hashBytes}} and that you replace 
your {{ContentSigner nonSigner}} by a {{ContentSigner}} that actually does sign 
the bytes it retrieves to sign.

Reading [RFC 5652|https://www.rfc-editor.org/rfc/rfc5652] may help you 
understand what needs to be done...

> Document getting corrupted on adding Signed Attributes
> --
>
> Key: PDFBOX-5568
> URL: https://issues.apache.org/jira/browse/PDFBOX-5568
> Project: PDFBox
>  Issue Type: Bug
>  Components: Signing
>Reporter: Piyush
>Priority: Major
>
> While trying to digitally sign document using *filter* as 
> *_FILTER_ADOBE_PPKLITE_* and *subfilter* as 
> {*}_SUBFILTER_ETSI_CADES_DETACHED_{*}. For {*}ETSI_CADES_Detached{*}, a 
> signing attribute is needs to be added. I am fetching signed hash and 
> certificates from CSC. But after adding signing attribute, it is making the 
> document corrupt. Below is the screenshot for the reference . Seems like hash 
> is getting changed.
> !https://i.stack.imgur.com/KKgRh.png!
>  
> *Code snippet for reference:*
> {code:java}
> PDDocument document = PDDocument.load(inputStream);
> outFile = File.createTempFile("signedFIle", ".pdf");
> Certificate[] certificateChain = //retrieve certificate chain from CSC 
> integration
> setCertificateChain(certificateChain);
> // sign
> FileOutputStream output = new FileOutputStream(outFile);
> IOUtils.copy(inputStream, output);
> // create signature dictionary
> PDSignature signature = new PDSignature();
> int accessPermissions = SigUtils.getMDPPermission(document);
> if (accessPermissions == 1)
> {
> throw new IllegalStateException("No changes to the document are permitted due 
> to DocMDP transform parameters dictionary");
> }
> signature.setFilter(PDSignature.FILTER_ADOBE_PPKLITE);
> signature.setSubFilter(PDSignature.SUBFILTER_ETSI_CADES_DETACHED);
> signature.setName("Test Name");
> signature.setLocation("Bucharest, RO");
> signature.setReason("PDFBox Signing");
> signature.setSignDate(Calendar.getInstance());
> Rectangle2D humanRect = new Rectangle2D.Float(location.getLeft(), 
> location.getBottom(), location.getRight(), location.getTop());
> PDRectangle rect = createSignatureRectangle(document, humanRect);
> SignatureOptions signatureOptions = new SignatureOptions();
> signatureOptions.setVisualSignature(createVisualSignatureTemplate(document, 
> 0, rect, signature));
> signatureOptions.setPage(0);
> document.addSignature(signature, signatureOptions);
> ExternalSigningSupport externalSigning =
> document.saveIncrementalForExternalSigning(output);
> InputStream content = externalSigning.getContent();
> CMSSignedDataGenerator gen = new CMSSignedDataGenerator();
> X509Certificate cert = (X509Certificate) certificateChain[0];
> gen.addCertificates(new JcaCertStore(Arrays.asList(certificateChain)));
> MessageDigest digest = MessageDigest.getInstance("SHA-256");
> // Use a buffer to read the input stream in chunks
> byte[] buffer = new byte[4096];
> int bytesRead;
> while ((bytesRead = content.read(buffer)) != -1) {
> digest.update(buffer, 0, bytesRead);
> }
> byte[] hashBytes = digest.digest();
> ESSCertIDv2 certid = new ESSCertIDv2(
> new AlgorithmIdentifier(new ASN1ObjectIdentifier("*")),
> MessageDigest.getInstance("SHA-256").digest(cert.getEncoded())
> );
> SigningCertificateV2 sigcert = new SigningCertificateV2(certid);
> final DERSet attrValues = new DERSet(sigcert);
> Attribute attr = new 
> Attribute(PKCSObjectIdentifiers.id_aa_signingCertificateV2, attrValues);
> ASN1EncodableVector v = new ASN1EncodableVector();
> v.add(attr);
> AttributeTable atttributeTable = new AttributeTable(v);
> //Create a standard attribute table from the passed in parameters - certhash
> CMSAttributeTableGenerator attrGen = new 
> DefaultSignedAttributeTableGenerator(atttributeTable);
> final byte[] signedHash = // Retrieve signed hash from CSC.
> ContentSigner nonSigner = new ContentSigner() {
> @Override
> public byte[] getSignature()
> { return signedHash; }
> @Override
> public OutputStream getOutputStream() {
> return new ByteArrayOutputStream();
> }
> @Override
> public AlgorithmIdentifier getAlgorithmIdentifier() {
> return new DefaultSignatureAlgorithmIdentifierFinder().find( "SHA256WithRSA" 
> );
> }
> };
> org.bouncycastle.asn1.x509.Certificate cert2 = 
> 

[jira] [Comment Edited] (PDFBOX-5561) qpdf shows warnings trying to linearize file modified by PDFBOX

2023-01-27 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17681394#comment-17681394
 ] 

Michael Klink edited comment on PDFBOX-5561 at 1/27/23 4:12 PM:


Hhmmm, the problem is that there is no entry for the xref stream object 12990 
in the xref stream. According to spec there must be an entry in the xrefs for 
the xref stream object, too.
{panel:title=ISO 32000-2 section 7.5.8.3 "Cross-reference stream data"}
Like any stream, a cross-reference stream shall be an indirect object. 
Therefore, an entry for it shall exist in either a cross-reference stream 
(usually itself) or in a cross-reference table (in hybridreference files; see 
7.5.8.4, "Compatibility with applications that do not support compressed 
reference streams").
{panel}


was (Author: mkl):
Hhmmm, the problem is that there is no entry for the xref stream object 12990 
in the xref stream. According to spec there must be an entry in the xrefs for 
the xref stream object, too.

> qpdf shows warnings trying to linearize file modified by PDFBOX
> ---
>
> Key: PDFBOX-5561
> URL: https://issues.apache.org/jira/browse/PDFBOX-5561
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing
>Affects Versions: 2.0.27
>Reporter: menteith85
>Priority: Minor
>
> I have a PDF file* that is generated by a software other than PDFBox. When 
> the PDF is modified by code given below using PDFBOX, *qpdf* shows the 
> following warning:
> {code:java}
> WARNING: modified.pdf: reported number of objects (12991) is not one plus the 
> highest object number (12989)
> qpdf: operation succeeded with warnings; resulting file may have some 
> problems{code}
> Note the warning is not shown when *qpdf* analyses original pdf file (ie. pdf 
> not modified by PDFBox).
> Here's the code to modify PDF in question:
>  
> {code:java}
> for (final PDPage page: document.getPages()) {
>     page.getAnnotations().forEach(annotation - > {
>         if (annotation instanceof PDAnnotationLink link) {
>             final PDPageXYZDestination destination = new 
> PDPageXYZDestination();
>             destination.setPage(document.getPage(1));
>             final PDActionGoTo action = new PDActionGoTo();
>             action.setDestination(destination);
>             link.setAction(action);
>         }
>     });
> } {code}
>  
> I forgot to mention that the result file generated by PDFBox is almost as 
> twice as big as the original one.
> *I've sent the file to Tilman Hausherr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5561) qpdf shows warnings trying to linearize file modified by PDFBOX

2023-01-27 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17681394#comment-17681394
 ] 

Michael Klink commented on PDFBOX-5561:
---

Hhmmm, the problem is that there is no entry for the xref stream object 12990 
in the xref stream. According to spec there must be an entry in the xrefs for 
the xref stream object, too.

> qpdf shows warnings trying to linearize file modified by PDFBOX
> ---
>
> Key: PDFBOX-5561
> URL: https://issues.apache.org/jira/browse/PDFBOX-5561
> Project: PDFBox
>  Issue Type: Bug
>  Components: Writing
>Affects Versions: 2.0.27
>Reporter: menteith85
>Priority: Minor
>
> I have a PDF file* that is generated by a software other than PDFBox. When 
> the PDF is modified by code given below using PDFBOX, *qpdf* shows the 
> following warning:
> {code:java}
> WARNING: modified.pdf: reported number of objects (12991) is not one plus the 
> highest object number (12989)
> qpdf: operation succeeded with warnings; resulting file may have some 
> problems{code}
> Note the warning is not shown when *qpdf* analyses original pdf file (ie. pdf 
> not modified by PDFBox).
> Here's the code to modify PDF in question:
>  
> {code:java}
> for (final PDPage page: document.getPages()) {
>     page.getAnnotations().forEach(annotation - > {
>         if (annotation instanceof PDAnnotationLink link) {
>             final PDPageXYZDestination destination = new 
> PDPageXYZDestination();
>             destination.setPage(document.getPage(1));
>             final PDActionGoTo action = new PDActionGoTo();
>             action.setDestination(destination);
>             link.setAction(action);
>         }
>     });
> } {code}
>  
> I forgot to mention that the result file generated by PDFBox is almost as 
> twice as big as the original one.
> *I've sent the file to Tilman Hausherr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5559) QPDF prints warnings about a PDF modified by PDFBOX

2023-01-23 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17679727#comment-17679727
 ] 

Michael Klink commented on PDFBOX-5559:
---

Just like [~tilman] says, the attached files refer to a garbage-in/garbage-out 
problem. Already the original file has an error which also is present in the 
result file.

The other issue mentioned, though,
{quote}qpdf complains also about a different pdf (copyrighted) modified in 
similar way. The warning reads:

{noformat}WARNING: file.pdf: reported number of objects (12991) is not one plus 
the highest object number (12989){noformat}{quote}
might point to an actual issue. If you find a PDF reproducing the issue you can 
share, please do so.

> QPDF prints warnings about a PDF modified by PDFBOX 
> 
>
> Key: PDFBOX-5559
> URL: https://issues.apache.org/jira/browse/PDFBOX-5559
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.27
>Reporter: menteith85
>Priority: Minor
> Attachments: sample.pdf, sample_modified.pdf, screenshot-1.png, 
> screenshot-2.png
>
>
> Hi!
> I created a sample PDF file with PDAnnotationLink using PDFBox. Then I 
> changed action from PDActionURI to PDActionGoTo. The modified pdf is 
> correctly working in Okular (a pdf viewer for Linux) but *qpdf* (version 
> 11.2.0) emits the following warning:
> {code:java}
> ❯ qpdf --linearize --replace-input sample_modified.pdf 
> WARNING: sample_modified.pdf, object 2 0 at offset 88: kid 1 (from 0) appears 
> more than once in the pages tree; creating a new page object as a copy 
> qpdf: there are warnings; original file kept in 
> sample_modified.pdf.~qpdf-orig 
> qpdf: operation succeeded with warnings; resulting file may have some 
> problems{code}
> Please find below the code I used to modify pdf. I can also provide code to 
> create that pdf if needed.
> {code:java}
> final PDPage page = doc.getPage(0);
> final PDPageXYZDestination destination = new PDPageXYZDestination();
> destination.setPage(page);
> final PDActionGoTo action = new PDActionGoTo();
> action.setDestination(destination);
> final PDAnnotationLink annotationLink = new PDAnnotationLink();
> annotationLink.setAction(action);
> float X_MARGIN_LEFT = 50F;
> float BOX_WIDTH = 240F;
> float TEXT_LINE_HEIGHT = 14F;
> final PDRectangle position = new PDRectangle();
> final int x = 120;
> final int y = 120;
> position.setLowerLeftX(x);
> position.setLowerLeftY(y);
> position.setUpperRightX(X_MARGIN_LEFT + BOX_WIDTH);
> position.setUpperRightY(y + TEXT_LINE_HEIGHT);
> annotationLink.setRectangle(position);
> page.setAnnotations(List.of(annotationLink));
> doc.save("sample_modified.pdf");{code}
>  
> *qpdf* complains also about a different pdf (copyrighted) modified in similar 
> way. The warning reads:
> {code:java}
> WARNING: file.pdf: reported number of objects (12991) is not one plus the 
> highest object number (12989){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5556) The font name displayed in the exported PDF is incorrect

2023-01-03 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653929#comment-17653929
 ] 

Michael Klink commented on PDFBOX-5556:
---

The embedded subset font does not include a *name* table.
Maybe the Acrobat Edit tool without that table guesses a name but can be 
persuaded to use a specific name by adding that table with the desired name 
entries.

> The font name displayed in the exported PDF is incorrect
> 
>
> Key: PDFBOX-5556
> URL: https://issues.apache.org/jira/browse/PDFBOX-5556
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.24
>Reporter: bai yuan
>Priority: Major
> Attachments: Swiss 721 Bold BT.ttf, fontName.png, screenshot-1.png, 
> screenshot-2.png, test.pdf
>
>
> Load the attach ttf font and save this document
> {code:java}
> PDDocument doc = new PDDocument();
> PDPage page = new PDPage();
> doc.addPage(page);
> PDPageContentStream stream = new PDPageContentStream(doc, page);
> TrueTypeFont ttFont = new TTFParser().parse("resources//fonts//Swiss 721 Bold 
> BT.ttf");
> PDFont font = PDType0Font.load(doc, ttFont, true);
> stream.setFont(font, 14);
> stream.beginText();
> stream.newLineAtOffset(100, 700);
> stream.setNonStrokingColor(Color.BLACK);
> String text = "Lazy dog";
> stream.showText(text);
> stream.endText();
> stream.stroke();
> stream.close();
> doc.save("test.pdf");
> doc.close(); {code}
> Open the exported document with Adobe Acrobat Pro, the font name is 
> incorrect. It should be "Swis721 BT", but it is "Swiss 72 1 BT". 
> !fontName.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5549) Invisible signature field is not referenced from /Annots dictionary of a Page

2022-12-05 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643331#comment-17643331
 ] 

Michael Klink commented on PDFBOX-5549:
---

I'm a bit late to the party but as [~bsanchezb] asked, I'll add my 2c anyways, 
essentially summing up what has already been said or implied ;) .

If a signature has a widget annotation, that widget _may_ have a *P* entry 
pointing to the page the widget is associated with. If it has such an entry, 
the *Annots* of that page _must_ point back to the widget. 

Thus, the prior PDFBox behavior was wrong. It could have been fixed by not 
adding a *P* entry to the widgets of invisible signatures or by adding the 
widget to the *Annots* of the page.

The latter option, which has been implemented, might be slightly better for 
interoperability.

But indeed, a software must not rely on a signature field to be referenced from 
a page to be found, it must look into the *AcroForm* form definition.


> Invisible signature field is not referenced from /Annots dictionary of a Page
> -
>
> Key: PDFBOX-5549
> URL: https://issues.apache.org/jira/browse/PDFBOX-5549
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Signing
>Affects Versions: 2.0.27
>Reporter: Aleksandr Beliakov
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.28, 3.0.0 PDFBox
>
> Attachments: screenshot-1.png, signed.pdf
>
>
> Hello,
>  
> Recently we received a complain about not adding a reference to the newly 
> created signature field to the /Annots array of a page dictionary.
> After analyzing the code, we found that PdfBox dependency used in our 
> project, skips binding of an invisible signature field from a page 
> dictionary. See 
> [PDDocument.java#L455:|https://github.com/apache/pdfbox/blob/2.0.27/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDDocument.java#L455]
> {code:java}
> if (visualSignature == null) 
> {
> prepareNonVisibleSignature(firstWidget);
> return;
> } {code}
> while adding the signature widget to the given page for visible signature 
> after.
>  
> After analyzing ISO 32000-1/2 I was not able to conclude what is the expected 
> behavior in case of invisible signature. While _/Annots_ array within a page 
> dictionary is optional and shall contain references to annotations associated 
> with a page, the chapter "12.5.2 Annotation dictionaries" also tells "{+}_A 
> given annotation dictionary shall be referenced from the Annots array of only 
> one page._{+}", which is also ambiguous.
> After checking [OpenPDF|https://github.com/LibrePDF/OpenPDF] library, it 
> seems like they associate an invisible signature field with a first page 
> explicitly by providing the reference within /Annots array.
>  
> Could you please give us information about the rational for skipping the 
> invisible signature field from adding into a page's /Annots dictionary and 
> confirm whether the behavior is correct?
>  
> Thank you!
>  
> Best regards,
> Aleksandr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5530) Java heap space

2022-10-26 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17624344#comment-17624344
 ] 

Michael Klink commented on PDFBOX-5530:
---

{quote}Parsing such files seems to be an attack{quote}
I doubt it's an attack. In particular I doubt it's an attack to prevent 
_arbitrary loading_ by causing out-of-memory situations.

I think it's more likely that the creator of this document attempted to prevent 
_text and bitmap extraction_. Text extraction is made difficult by drawing the 
characters using vector graphics paths instead of using fonts with the side 
effect of gigantic content streams. And bitmap extraction is made difficult by 
partitioning the bitmaps (of official looking stamps) into thousands of mini 
parts, resulting in the thousands and thousands of tiny bitmap images.

> Java heap space
> ---
>
> Key: PDFBOX-5530
> URL: https://issues.apache.org/jira/browse/PDFBOX-5530
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.25
>Reporter: liu
>Priority: Blocker
> Attachments: image-2022-10-20-14-30-19-790.png, 
> image-2022-10-20-14-30-57-332.png, image-2022-10-20-14-32-10-258.png, 
> image-2022-10-20-15-01-06-688.png, image-2022-10-20-19-07-42-632.png, 
> image-2022-10-20-19-08-23-932.png, screenshot-1.png, 引起宕机-1.pdf, 引起宕机.pdf
>
>
> code(only this part of the code):
> PDDocument load = PDDocument.load(file, 
> MemoryUsageSetting.setupTempFileOnly(-1);
>  
> hi. Why do I configure it like this, it still takes up so much memory? What 
> is the effect of using setupTempFileOnly. 
> !image-2022-10-20-14-30-19-790.png!
> !image-2022-10-20-14-30-57-332.png!
> !image-2022-10-20-14-32-10-258.png!
> [^引起宕机.pdf]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5533) Store password from PDF document in a byte array

2022-10-25 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17623950#comment-17623950
 ] 

Michael Klink commented on PDFBOX-5533:
---

If you use {{byte}} arrays, then the users have to do the conversion from 
{{String}} themselves.

This sounds trivial but it is not: The exact conversion to apply depends on the 
encryption algorithm used. For example, for the current (revision 6 as defined 
in ISO 32000-2) encryption, _the UTF-8 password string shall be generated from 
Unicode input by processing the input string with the SASLprep (Internet RFC 
4013) profile of stringprep (Internet RFC 3454) using the Normalize and BiDi 
options, and then converting to a UTF-8 representation._

I doubt most users will follow that routine, most will simply call 
{{getBytes()}} and run into errors in internationalized contexts.

Switching from {{String}} to {{char[]}}, on the other hand, would leave the 
conversion to bytes in PDFBox, allowing for proper conversion.

> Store password from PDF document in a byte array
> 
>
> Key: PDFBOX-5533
> URL: https://issues.apache.org/jira/browse/PDFBOX-5533
> Project: PDFBox
>  Issue Type: Improvement
>Affects Versions: 2.0.27
>Reporter: Aleksandr Beliakov
>Priority: Minor
>
> Hello,
>  
> I would like to propose a security improvement regarding storing and handling 
> a provided user-password when opening a protected PDF document.
> Currently the class 
> [COSParser|https://github.com/apache/pdfbox/blob/2.0.27/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java#L98]
>  stores the password as a String object, which is not the best practice.
> The problem is that sensitive data (such as passwords) stored in memory can 
> be leaked if it is stored in a managed String object. String objects are not 
> pinned, so the garbage collector can relocate these objects at will and leave 
> several copies in memory. These objects are not encrypted by default, so 
> anyone that can read the process' memory will be able to see the contents. 
> Furthermore, if the process' memory gets swapped out to disk, the unencrypted 
> contents of the string will be written to a swap file. Lastly, since String 
> objects are immutable, removing the value of a String from memory can only be 
> done by the CLR garbage collector.
>  
> Therefore, it would be preferable to handle all user-passwords as a byte[] or 
> char[] array instead of String, which can be cleaned after the use. You may 
> also see that when passing a password to JDK classes, the password is 
> converted to an array of characters (e.g. 
> [here|https://github.com/apache/pdfbox/blob/2.0.27/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java#L2979]).
>  
> To avoid unnecessary transformations and improve the security, it would be 
> good to handle all passwords as an array starting from 
> [PDDocumentload(...)|https://github.com/apache/pdfbox/blob/2.0.27/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDDocument.java#L1030]
>  method(s).
>  
> For backward compatibility, you may keep the old constructors and methods.
>  
> Thank you for your nice job!
>  
> Best regards,
> Aleksandr.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5532) COSString field non-ascii characters

2022-10-25 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17623945#comment-17623945
 ] 

Michael Klink commented on PDFBOX-5532:
---

For a general solution you indeed need to know the encoding associated with the 
font in question. I.e. you also have to keep track of the *Tf* commands and 
lookup the associated font resource from the page resources. And you also have 
to handle *q* and *Q* accordingly. Also don't forget that there are not only 
page content streams, text can also be in form XObjects or Patterns referred to 
from there

Also you may have to deal with fonts whose encoding does not really help, in 
particular encodings with non-standard names for glyphs and *Identity-H* and 
*Identity-V*.

So this really is no trivial problem to solve in general.

If you can be sure that in your use case the PDFs are generated by the same 
software all over, consider analyzing the PDF contents and determining whether 
the text drawing in them allows for some short cuts in your task...

> COSString field non-ascii characters
> 
>
> Key: PDFBOX-5532
> URL: https://issues.apache.org/jira/browse/PDFBOX-5532
> Project: PDFBox
>  Issue Type: Bug
>Reporter: David
>Priority: Major
>
>  
> Hello,
> I am reading a pdf document but in the COSString field non-ascii characters 
> are being retrieved. What can be the motive? I am using version 
> pdfbox-2.0.24.jar
> This would be an example of the pdf document parsed:
> COSInt\{50} 
> COSInt\{0} 
> PDFOperator\{Td} 
> COSString\{åÅÕãÁâ@} 
> PDFOperator\{Tj} 
> COSFloat\{770.18} 
> COSInt\{0} 
> PDFOperator\{Td} 
> COSString\{×–Ž–©@} 
> PDFOperator\{Tj} 
> COSFloat\{520.21} 
> COSInt\{0}
> Function java:
>  public static PDDocument replaceText(PDDocument document, String 
> searchString, String replacement) throws IOException {
>   
>   PDPageTree pages = document.getDocumentCatalog().getPages();
>   for (PDPage page : pages) {
>   
>   PDFStreamParser parser = new PDFStreamParser(page);
>   parser.parse();
>   List tokens = parser.getTokens();
>   for (int j = 0; j < tokens.size(); j++) {
>   Object next = tokens.get(j);
>  
>   if (next instanceof Operator) {
>   Operator op = (Operator) next;
>
>if (op.getName().equals("Tj")) {
>   COSString previous = (COSString) 
> tokens.get(j - 1);  
>   String string = previous.getString();
>   System.out.println("previous:=" + string);
>   
>   
>   if (string.equals(searchString)){
>COSString sx = new 
> COSString(replacement); 
>   previous.setValue(sx.getBytes());
>   
>   }
>   }
>   }
>   }
>   // now that the tokens are updated we will replace the 
> page content stream.
>   PDStream updatedStream = new PDStream(document);
>   OutputStream out = updatedStream.createOutputStream();
>   ContentStreamWriter tokenWriter = new 
> ContentStreamWriter(out);
>   tokenWriter.writeTokens(tokens);
>   page.setContents(updatedStream);
>   out.close();
>   
>   
>   }
>   return document;
>   }
>



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5532) COSString field non-ascii characters

2022-10-24 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17623319#comment-17623319
 ] 

Michael Klink commented on PDFBOX-5532:
---

{quote}I am reading a pdf document but in the COSString field non-ascii 
characters are being retrieved. What can be the motive?{quote}

Please be aware that the encoding of strings in content streams can be 
completely arbitrary and is defined by the respectively current font.

Your {{replaceText}} method makes very specific assumptions which only are true 
in simple PDFs.

> COSString field non-ascii characters
> 
>
> Key: PDFBOX-5532
> URL: https://issues.apache.org/jira/browse/PDFBOX-5532
> Project: PDFBox
>  Issue Type: Bug
>Reporter: David
>Priority: Major
>
>  
> Hello,
> I am reading a pdf document but in the COSString field non-ascii characters 
> are being retrieved. What can be the motive? I am using version 
> pdfbox-2.0.24.jar
> This would be an example of the pdf document parsed:
> COSInt\{50} 
> COSInt\{0} 
> PDFOperator\{Td} 
> COSString\{åÅÕãÁâ@} 
> PDFOperator\{Tj} 
> COSFloat\{770.18} 
> COSInt\{0} 
> PDFOperator\{Td} 
> COSString\{×–Ž–©@} 
> PDFOperator\{Tj} 
> COSFloat\{520.21} 
> COSInt\{0}
> Function java:
>  public static PDDocument replaceText(PDDocument document, String 
> searchString, String replacement) throws IOException {
>   
>   PDPageTree pages = document.getDocumentCatalog().getPages();
>   for (PDPage page : pages) {
>   
>   PDFStreamParser parser = new PDFStreamParser(page);
>   parser.parse();
>   List tokens = parser.getTokens();
>   for (int j = 0; j < tokens.size(); j++) {
>   Object next = tokens.get(j);
>  
>   if (next instanceof Operator) {
>   Operator op = (Operator) next;
>
>if (op.getName().equals("Tj")) {
>   COSString previous = (COSString) 
> tokens.get(j - 1);  
>   String string = previous.getString();
>   System.out.println("previous:=" + string);
>   
>   
>   if (string.equals(searchString)){
>COSString sx = new 
> COSString(replacement); 
>   previous.setValue(sx.getBytes());
>   
>   }
>   }
>   }
>   }
>   // now that the tokens are updated we will replace the 
> page content stream.
>   PDStream updatedStream = new PDStream(document);
>   OutputStream out = updatedStream.createOutputStream();
>   ContentStreamWriter tokenWriter = new 
> ContentStreamWriter(out);
>   tokenWriter.writeTokens(tokens);
>   page.setContents(updatedStream);
>   out.close();
>   
>   
>   }
>   return document;
>   }
>



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5528) PDF/UA: Add marked content sections when flattening acro forms

2022-10-20 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621302#comment-17621302
 ] 

Michael Klink commented on PDFBOX-5528:
---

Hhmmm, I just skimmed the PDF spec on tagging. It looks like one does _not_ 
need much extra information *if the original PDF is tagged very well*. On the 
other hand, if the original tags in the PDF are not so good, the result after 
flattening may be horrible for a user requiring accessibility information.

> PDF/UA: Add marked content sections when flattening acro forms
> --
>
> Key: PDFBOX-5528
> URL: https://issues.apache.org/jira/browse/PDFBOX-5528
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm
>Reporter: Andre Wachsmuth
>Priority: Minor
> Attachments: correct.png, wrong.png
>
>
> We need to support PDF/UA compliant documents to some extent. I noticed that 
> when we take a PDF/UA compliant PDF document and flatten it via 
> PDAcroForm#flatten, the resulting output is not PDF/UA compliant anymore.
> After a little bit of research, the problem is that PDFBox creates /DO 
> operators with paths representing the appearance of the form fields. 
> According to the PDF/UA standard, such paths need to be enclosed in marked 
> content sections (BMC ... EMC, BDC ... EMC, see attached images)
> By copying some code from AcroForm#flatten and adding 
> contentStream.beginMarkedContent and contentStream.endMarkedContent myself, I 
> can workaround the problem, but that's less than ideal, it would be great if 
> this could be included in PDFBox.
>  
> {code:java}
> public void flatten(List fields, boolean refreshAppearances) throws 
> IOException
>   // ...
>final var dict = new COSDictionary();
>            dict.setLong(COSName.MCID, mcid);
>            dict.setItem(COSName.BBOX, bBox);
>            dict.setItem(COSName.TYPE, COSName.BACKGROUND);
>             final var propList = PDPropertyList.create(dict);
>             contentStream.beginMarkedContent(COSName.ARTIFACT, propList);
>             contentStream.saveGraphicsState();
>             // see https://stackoverflow.com/a/54091766/1729265 for an 
> explanation
>             // of the steps required
>             // this will transform the appearance stream form object into the 
> rectangle of the
>             // annotation bbox and map the coordinate systems
>             final var transformationMatrix = 
> pdfbox_resolveTransformationMatrix(form, annotation, appearanceStream);
>             contentStream.transform(transformationMatrix);
>             contentStream.drawForm(fieldObject);
>             contentStream.restoreGraphicsState();
>             contentStream.endMarkedContent();
>  
>   // ...
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5530) Java heap space

2022-10-20 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621283#comment-17621283
 ] 

Michael Klink commented on PDFBOX-5530:
---

Thousands and thousands of tiny bitmap images. Gigantic content streams.

This is a very uncommon PDF internally... optimizing resource usage while still 
remaining performant would be quite a challenge.

{quote}Can this hashmap be changed to soft reference or weak reference?like 
WeakHashMap or ConcurrentReferenceHashMap.{quote}

PDFBox 2.x is based on an architecture that requires all objects in the PDF to 
be parsed and represented in memory, so "no".

You can try PDFBox 3 which offers just-in-time loading. Unfortunately it also 
requires all loaded objects to remain in memory, so if your processing 
eventually touches most of the PDF, the resource requirement eventually will be 
the same. So even there "no".

A mode that allows loaded but currently unused objects to be freed again (which 
would allow for a "yes") is not yet implemented in the mainstream PDFBox.

> Java heap space
> ---
>
> Key: PDFBOX-5530
> URL: https://issues.apache.org/jira/browse/PDFBOX-5530
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.25
>Reporter: liu
>Priority: Blocker
> Attachments: image-2022-10-20-14-30-19-790.png, 
> image-2022-10-20-14-30-57-332.png, image-2022-10-20-14-32-10-258.png, 
> image-2022-10-20-15-01-06-688.png, image-2022-10-20-19-07-42-632.png, 
> image-2022-10-20-19-08-23-932.png, screenshot-1.png, 引起宕机-1.pdf, 引起宕机.pdf
>
>
> code(only this part of the code):
> PDDocument load = PDDocument.load(file, 
> MemoryUsageSetting.setupTempFileOnly(-1);
>  
> hi. Why do I configure it like this, it still takes up so much memory? What 
> is the effect of using setupTempFileOnly. 
> !image-2022-10-20-14-30-19-790.png!
> !image-2022-10-20-14-30-57-332.png!
> !image-2022-10-20-14-32-10-258.png!
> [^引起宕机.pdf]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-20 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621240#comment-17621240
 ] 

Michael Klink commented on PDFBOX-5529:
---

{quote}I looked for *ActualText* information, but I didn't find any tag like 
this in the PDF content.{quote}
Then please share the PDF for further analysis.
While you're right that in case of your document the text extraction result 
would improve by _not_ trying to identify gaps, in general one needs this gap 
detection.

> Wrong Text Extraction - Unwanted Extra Spaces in the middle of words
> 
>
> Key: PDFBOX-5529
> URL: https://issues.apache.org/jira/browse/PDFBOX-5529
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.0.5, 2.0.6, 2.0.7, 
> 2.0.8, 2.0.9, 2.0.10, 2.0.11, 2.0.12, 2.0.13, 2.0.14, 2.0.15, 2.0.16, 2.0.17, 
> 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22, 2.0.23, 2.0.24, 2.0.25, 2.0.26, 2.0.27
>Reporter: Carlos Alfonso Maya
>Priority: Major
> Attachments: image-2022-10-18-15-53-06-512.png, 
> image-2022-10-18-16-23-00-123.png, image-2022-10-18-16-26-15-001.png, 
> image-2022-10-19-16-48-36-198.png
>
>
> *Overview:* 
> We are using PDFBOX as a third party API to extract text from financial PDF 
> documents.
> We have been using PDFBox since a long time back, and we have detected a 
> problem related to a bad text extraction on PDFs from a Customer. 
> Since we worked with Customer Data we cannot shared the PDF besides that are 
> signed and we cannot even edit them.
> *Description of the problem:*
> By opening the PDF in Adobe Reader we can see several cases like the 
> following screenshot:
> !image-2022-10-18-15-53-06-512.png|width=221,height=211!
> Visually it appears to have spaces between words, but if we copy the text 
> from Adobe Reader and paste it into a text editor there is no extra spaces. 
> The following is the output that PDFBOX generates at the moment of doing text 
> extraction:
> {code:java}
> Da te
> In v oice number
> Ou r r eference
> You r reference
> Con tact person{code}
> (!) *Important note: this behavior is present in all the versions of PDFBox.*
> *Analysis:*
> By downloading the PDFBOX source code 2.0.27 (this was checked as well in 
> 2.0.26, 2.0.25 and 2.0.24) and testing/debugging we detected that the method 
> _*writePage()* inside *PDFTextStripper.java*_ declared a list of objects:
> {code:java}
> List line = new ArrayList();{code}
> Which subsequently the code add elements into the list:
> {code:java}
> line.add(LineItem.getWordSeparator()); 
> .
> .
> .
> line.add(new LineItem(position));{code}
>  
> And at some point it passes the list as a parameter into the following 
> statement:
> {code:java}
> writeLine(normalize(line));{code}
> (!) *The important about this list called "line" is that somehow the 
> "LineItem" objects are having NULL values inserted into it, and this values 
> are at some point interpreted as "blank spaces" causing the behavior 
> described above.*
> Here is an screenshot of how it is showed in the debugger:
> !image-2022-10-18-16-23-00-123.png|width=621,height=195!
> !image-2022-10-18-16-26-15-001.png|width=620,height=431!
>  
> We tried to look for a method that manipulates this list and that we can 
> override, but all of these methods that modified or access the list are 
> protected.
>  
> (!) *This is an example of how it displayed in the PDF Debugger:*
> {code:java}
>     q
>       94.525 545.32 141 11.2 re
>       W*
>       n
>       BT
>         /F3 8.8 Tf
>         1 0 0 1 99.325 547.72 Tm
>         0 g
>         0 G
>         [ (D) 22 (a) -131 (t) -109 (e) ] TJ
>       ET
>     Q 
>     q
>       94.525 530.9 141 11.225 re
>       W*
>       n
>       BT
>         /F3 8.8 Tf
>         1 0 0 1 99.325 533.3 Tm
>         0 G
>         [ (I) 26 (n) -135 (v) -229 (o) -5 (i) 20 (ce) -62 ( ) 59 (n) -44 (u) 
> 30 (m) -27 (b) -75 (e) 28 (r) ] TJ
>       ET
>     Q
>     q
>       94.525 516.5 141 11.2 re
>       W*
>       n
>       BT
>         /F3 8.8 Tf
>         1 0 0 1 99.325 519.7 Tm
>         0 G
>         [ (O) -73 (u) -151 (r) -44 ( ) 59 (r) -134 (e) 28 (f) -38 (e) 28 (r) 
> -44 (e) 28 (n) -44 (ce) ] TJ
>       ET
>     Q{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5528) PDF/UA: Add marked content sections when flattening acro forms

2022-10-20 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620899#comment-17620899
 ] 

Michael Klink commented on PDFBOX-5528:
---

Well, when flattening form fields into the static content and trying to 
integrate them into the existing structure tree, one strictly speaking would 
need to know more details of how the flattened content _semantically_ fits in.

At least for a good tagging result one needs that, for a tagging result passing 
automated tests that's not necessary, but human users of the document 
accessibility may well complain.

> PDF/UA: Add marked content sections when flattening acro forms
> --
>
> Key: PDFBOX-5528
> URL: https://issues.apache.org/jira/browse/PDFBOX-5528
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm
>Reporter: Andre Wachsmuth
>Priority: Minor
> Attachments: correct.png, wrong.png
>
>
> We need to support PDF/UA compliant documents to some extent. I noticed that 
> when we take a PDF/UA compliant PDF document and flatten it via 
> PDAcroForm#flatten, the resulting output is not PDF/UA compliant anymore.
> After a little bit of research, the problem is that PDFBox creates /DO 
> operators with paths representing the appearance of the form fields. 
> According to the PDF/UA standard, such paths need to be enclosed in marked 
> content sections (BMC ... EMC, BDC ... EMC, see attached images)
> By copying some code from AcroForm#flatten and adding 
> contentStream.beginMarkedContent and contentStream.endMarkedContent myself, I 
> can workaround the problem, but that's less than ideal, it would be great if 
> this could be included in PDFBox.
>  
> {code:java}
> public void flatten(List fields, boolean refreshAppearances) throws 
> IOException
>   // ...
>final var dict = new COSDictionary();
>            dict.setLong(COSName.MCID, mcid);
>            dict.setItem(COSName.BBOX, bBox);
>            dict.setItem(COSName.TYPE, COSName.BACKGROUND);
>             final var propList = PDPropertyList.create(dict);
>             contentStream.beginMarkedContent(COSName.ARTIFACT, propList);
>             contentStream.saveGraphicsState();
>             // see https://stackoverflow.com/a/54091766/1729265 for an 
> explanation
>             // of the steps required
>             // this will transform the appearance stream form object into the 
> rectangle of the
>             // annotation bbox and map the coordinate systems
>             final var transformationMatrix = 
> pdfbox_resolveTransformationMatrix(form, annotation, appearanceStream);
>             contentStream.transform(transformationMatrix);
>             contentStream.drawForm(fieldObject);
>             contentStream.restoreGraphicsState();
>             contentStream.endMarkedContent();
>  
>   // ...
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5529) Wrong Text Extraction - Unwanted Extra Spaces in the middle of words

2022-10-19 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620474#comment-17620474
 ] 

Michael Klink commented on PDFBOX-5529:
---

Looking at the screen shot it is clear why an extractor would add those spaces, 
after all you say yourself:
{quote}Visually it appears to have spaces between words,{quote}
And you only wonder why the spaces are there after observing
{quote}but if we copy the text from Adobe Reader and paste it into a text 
editor there is no extra spaces.{quote}
Please be aware that Adobe Acrobat also takes tagging information into account; 
if there are *ActualText* information, Acrobat uses them and not heuristics 
based on the appearance. PDFBox on the other hand does not use the tagging 
information in its text stripper.

Thus, please check whether your example file has such tags or not. The easiest 
option would be for you to share the file (or at least a page of it with that 
behavior).

> Wrong Text Extraction - Unwanted Extra Spaces in the middle of words
> 
>
> Key: PDFBOX-5529
> URL: https://issues.apache.org/jira/browse/PDFBOX-5529
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.0.5, 2.0.6, 2.0.7, 
> 2.0.8, 2.0.9, 2.0.10, 2.0.11, 2.0.12, 2.0.13, 2.0.14, 2.0.15, 2.0.16, 2.0.17, 
> 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22, 2.0.23, 2.0.24, 2.0.25, 2.0.26, 2.0.27
>Reporter: Carlos Alfonso Maya
>Priority: Major
> Attachments: image-2022-10-18-15-53-06-512.png, 
> image-2022-10-18-16-23-00-123.png, image-2022-10-18-16-26-15-001.png
>
>
> *Overview:* 
> We are using PDFBOX as a third party API to extract text from financial PDF 
> documents.
> We have been using PDFBox since a long time back, and we have detected a 
> problem related to a bad text extraction on PDFs from a Customer. 
> Since we worked with Customer Data we cannot shared the PDF besides that are 
> signed and we cannot even edit them.
> *Description of the problem:*
> By opening the PDF in Adobe Reader we can see several cases like the 
> following screenshot:
> !image-2022-10-18-15-53-06-512.png|width=221,height=211!
> Visually it appears to have spaces between words, but if we copy the text 
> from Adobe Reader and paste it into a text editor there is no extra spaces. 
> The following is the output that PDFBOX generates at the moment of doing text 
> extraction:
> {code:java}
> Da te
> In v oice number
> Ou r r eference
> You r reference
> Con tact person{code}
> (!) *Important note: this behavior is present in all the versions of PDFBox.*
> *Analysis:*
> By downloading the PDFBOX source code 2.0.27 (this was checked as well in 
> 2.0.26, 2.0.25 and 2.0.24) and testing/debugging we detected that the method 
> _*writePage()* inside *PDFTextStripper.java*_ declared a list of objects:
> {code:java}
> List line = new ArrayList();{code}
> Which subsequently the code add elements into the list:
> {code:java}
> line.add(LineItem.getWordSeparator()); 
> .
> .
> .
> line.add(new LineItem(position));{code}
>  
> And at some point it passes the list as a parameter into the following 
> statement:
> {code:java}
> writeLine(normalize(line));{code}
> (!) *The important about this list called "line" is that somehow the 
> "LineItem" objects are having NULL values inserted into it, and this values 
> are at some point interpreted as "blank spaces" causing the behavior 
> described above.*
> Here is an screenshot of how it is showed in the debugger:
> !image-2022-10-18-16-23-00-123.png|width=621,height=195!
> !image-2022-10-18-16-26-15-001.png|width=620,height=431!
>  
> We tried to look for a method that manipulates this list and that we can 
> override, but all of these methods that modified or access the list are 
> protected.
>  
> (!) *This is an example of how it displayed in the PDF Debugger:*
> {code:java}
>     q
>       94.525 545.32 141 11.2 re
>       W*
>       n
>       BT
>         /F3 8.8 Tf
>         1 0 0 1 99.325 547.72 Tm
>         0 g
>         0 G
>         [ (D) 22 (a) -131 (t) -109 (e) ] TJ
>       ET
>     Q 
>     q
>       94.525 530.9 141 11.225 re
>       W*
>       n
>       BT
>         /F3 8.8 Tf
>         1 0 0 1 99.325 533.3 Tm
>         0 G
>         [ (I) 26 (n) -135 (v) -229 (o) -5 (i) 20 (ce) -62 ( ) 59 (n) -44 (u) 
> 30 (m) -27 (b) -75 (e) 28 (r) ] TJ
>       ET
>     Q
>     q
>       94.525 516.5 141 11.2 re
>       W*
>       n
>       BT
>         /F3 8.8 Tf
>         1 0 0 1 99.325 519.7 Tm
>         0 G
>         [ (O) -73 (u) -151 (r) -44 ( ) 59 (r) -134 (e) 28 (f) -38 (e) 28 (r) 
> -44 (e) 28 (n) -44 (ce) ] TJ
>       ET
>     Q{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PDFBOX-5521) Signing tries to set byteRange of old signature

2022-09-28 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610436#comment-17610436
 ] 

Michael Klink commented on PDFBOX-5521:
---

{quote}COSWriter code hits an "old" signature that is (for whatever reason, 
maybe it was incorrectly included) present in the incremental part.{quote}

Most likely the problem signature in question is the usage rights signature. 
(The message "This document enabled extended features in Adobe Acrobat Reader." 
indicates that there is a usage rights signature in the PDF in question.)

In contrast to other signatures, a usage rights signature dictionary need not 
be an indirect object, it may be a direct object in the *Perms* dictionary 
which in turn may be a direct object in the catalog dictionary. Thus, such a 
usage rights signature may occur again and again in each incremental update 
touching the catalog.

In particular such a recurring usage rights signature is not _incorrectly 
included_ and the PDFBox signing code must be able to recognize that its 
signature dictionary is not the dictionary of the currently to sign signature 
field.

> Signing tries to set byteRange of old signature
> ---
>
> Key: PDFBOX-5521
> URL: https://issues.apache.org/jira/browse/PDFBOX-5521
> Project: PDFBox
>  Issue Type: Bug
>  Components: Signing
>Affects Versions: 2.0.27
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.28, 3.0.0 PDFBox
>
>
> A long bug report on the users mailing lists leads to the finding that the 
> COSWriter code hits an "old" signature that is (for whatever reason, maybe it 
> was incorrectly included) present in the incremental part. The signing then 
> fails because the byte range to be written is longer than the existing byte 
> range.
> To avoid this, we improve signature detection by checking that the size 
> indicated by byteRange is higher than the existing PDF size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] (PDFBOX-5288) Valid PDF/A 1B is rejected

2022-09-27 Thread Michael Klink (Jira)


[ https://issues.apache.org/jira/browse/PDFBOX-5288 ]


Michael Klink deleted comment on PDFBOX-5288:
---

was (Author: mkl):
Do you get the identical errors as the OP got? Or merely similar ones? I'd 
think it unlikely that the identical object numbers would be reported...

> Valid PDF/A 1B is rejected
> --
>
> Key: PDFBOX-5288
> URL: https://issues.apache.org/jira/browse/PDFBOX-5288
> Project: PDFBox
>  Issue Type: Bug
>  Components: Preflight
>Affects Versions: 2.0.21, 2.0.24
> Environment: Java 1.8
>Reporter: Manuel
>Priority: Major
> Attachments: pdfa1a.pdf
>
>
> When we try to validate a PDF/A 1B file, we get an not valid value with these 
> error messages:
>  * 7.3 - error on metadata, schema is not set in this document : 
> http://ns.adobe.com/xap/1.0/stype/resourceevent#||
>  * 1.2.5 - body syntax error, stream length is invalid [cobj=cosobject\{10, 
> 0}; defined length=15; buffer2=endstream]
>  * 1.2.5 - body syntax error, stream length is invalid [cobj=cosobject\{13, 
> 0}; defined length=15; buffer2=endstream]
> But this file is valid for veraPDF and at online validator 
> ([https://www.pdf-online.com/osa/validate.aspx).]
>  
> We are working with 2.0.21 version and we also tried with 2.0.24, but same 
> result is returned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5288) Valid PDF/A 1B is rejected

2022-09-27 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17610119#comment-17610119
 ] 

Michael Klink commented on PDFBOX-5288:
---

Do you get the identical errors as the OP got? Or merely similar ones? I'd 
think it unlikely that the identical object numbers would be reported...

> Valid PDF/A 1B is rejected
> --
>
> Key: PDFBOX-5288
> URL: https://issues.apache.org/jira/browse/PDFBOX-5288
> Project: PDFBox
>  Issue Type: Bug
>  Components: Preflight
>Affects Versions: 2.0.21, 2.0.24
> Environment: Java 1.8
>Reporter: Manuel
>Priority: Major
> Attachments: pdfa1a.pdf
>
>
> When we try to validate a PDF/A 1B file, we get an not valid value with these 
> error messages:
>  * 7.3 - error on metadata, schema is not set in this document : 
> http://ns.adobe.com/xap/1.0/stype/resourceevent#||
>  * 1.2.5 - body syntax error, stream length is invalid [cobj=cosobject\{10, 
> 0}; defined length=15; buffer2=endstream]
>  * 1.2.5 - body syntax error, stream length is invalid [cobj=cosobject\{13, 
> 0}; defined length=15; buffer2=endstream]
> But this file is valid for veraPDF and at online validator 
> ([https://www.pdf-online.com/osa/validate.aspx).]
>  
> We are working with 2.0.21 version and we also tried with 2.0.24, but same 
> result is returned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5519) Can PDFbox create the ability to extract tab numbers from pdf fields?

2022-09-25 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17609094#comment-17609094
 ] 

Michael Klink commented on PDFBOX-5519:
---

Don't forget the *Tabs* entry of the page object:

||Key||Type||Value||
|*Tabs*|name|_(Optional; PDF 1.5)_ A name specifying the tab order that shall 
be used for annotations on the page. The possible values shall be _R_ (row 
order), _C_ (column order), and _S_ (structure order). See 12.5, "Annotations" 
for details.|

_(ISO 32000-1, Table 30 – Entries in a page object)_

In ISO 32000-2 (now Table 31) the following has been added to the value 
description:

{quote}Beginning with PDF 2.0, additional values also include _A_ (annotations 
array order) and _W_ (widget order). Annotations array order refers to the 
order of the annotation enumerated in the *Annots* entry of the Page dictionary 
(see "Table 31 — Entries in a page object"). Widget order means using the same 
array ordering but making two passes, the first only picking the widget 
annotations and the second picking all other annotations.{quote}

Interestingly no default is defined for this optional entry...

> Can PDFbox create the ability to extract tab numbers from pdf fields? 
> --
>
> Key: PDFBOX-5519
> URL: https://issues.apache.org/jira/browse/PDFBOX-5519
> Project: PDFBox
>  Issue Type: Wish
>Affects Versions: 2.0.26
>Reporter: Tony C
>Priority: Major
> Attachments: DummyPdf-1.pdf, Screen Shot 2022-09-23 at 5.19.50 
> PM-1.png, image-2022-09-24-05-36-27-659.png
>
>
> I am in the process of converting a pdf into html. The pdf i am using has tab 
> numbers set on its fields.
> This is where i run across an issue. I am trying to extract the tab number 
> from the pdf fields but i dont think the library offers that. I would need 
> that value in order to set the tabindex when I create the corresponding html 
> elements.
> The pdf i am using is[^DummyPdf.pdf]
> ^!Screen Shot 2022-09-23 at 5.19.50 PM.png!^



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5513) getPageLayout throws IllegalArgumentException for empty mode

2022-09-21 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607906#comment-17607906
 ] 

Michael Klink commented on PDFBOX-5513:
---

{quote}If we create a PageMode.UNSPECIFIED then what would be its value?{quote}

Most appropriately an attempt to retrieve its value should cause an exception. 
Thus, this enumeration member would essentially shift the exception to a later 
time.

Also an attempt to set the document page mode to PageMode.UNSPECIFIED should be 
rejected.

Oh well, maybe not a good alternative after all...

{quote}I think this is now evolving into something complex.{quote}

Indeed, and that wasn't my intention at all. I just wanted to express that a 
plain getter IMO should not return a value clearly different from the actual 
value.

If the method name had indicated that some interpretation takes place (e.g. 
{{interpretPageMode}} or {{getBestPageModeMatch}} or {{getEffectivePageMode}}), 
I probably wouldn't have started such an argument at all ;). But as the method 
with that name has been around for so many years, one also shouldn't rename it 
on a whim.

> getPageLayout throws IllegalArgumentException for empty mode
> 
>
> Key: PDFBOX-5513
> URL: https://issues.apache.org/jira/browse/PDFBOX-5513
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.25, 2.0.26, 3.0.0 PDFBox
>Reporter: Karol Bryd
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.27, 3.0.0 PDFBox
>
> Attachments: page_layout_issue.patch
>
>
> getPageLayout() method in PDDocumentCatalog can throw an exception 
> IllegalArgumentException when the PageLayout mode is not one of defined in 
> the PageLayout class. In my case the mode is simply an empty string.The PDF 
> documents which contain such unexpected Page Layout value are all rendered by 
> quite old Adobe PDF library 7.0 from 2014 (I can't share the document, it is 
> confidential).
> My suggestion is to modify the method so that, similarly to getPageMode() 
> method, the eventual exception is caught and the method returns the default 
> PageLayout.{color:#9876aa}SINGLE_PAGE {color}mode.{color:#9876aa}
> {color}
>  
> This problem affects the current version in trunk, as well as at least 2.0.25 
> and 2.0.26.
>  
> I have created very simple patch which fixes the problem, please consider 
> applying it to the trunk and 2.0.x branch.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5513) getPageLayout throws IllegalArgumentException for empty mode

2022-09-18 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606354#comment-17606354
 ] 

Michael Klink commented on PDFBOX-5513:
---

{quote}What's your proposal? ...{quote}

Something along those lines, I'm merely not sure why one should default for the 
empty name; the empty name is a name after all.

An alternative to the exception could be a new {{PageLayout}} value 
UNSPECIFIED. People could then decide whether to handle this situation like the 
default value, as an error, or whether to retrieve the actual value and look at 
it more closely.

The OP mentions that {{getPageMode}} already caught a 
`IllegalArgumentException` and returned the default instead. It think that 
handling these situations alike is even mode important than the approach used.

> getPageLayout throws IllegalArgumentException for empty mode
> 
>
> Key: PDFBOX-5513
> URL: https://issues.apache.org/jira/browse/PDFBOX-5513
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.25, 2.0.26, 3.0.0 PDFBox
>Reporter: Karol Bryd
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.27, 3.0.0 PDFBox
>
> Attachments: page_layout_issue.patch
>
>
> getPageLayout() method in PDDocumentCatalog can throw an exception 
> IllegalArgumentException when the PageLayout mode is not one of defined in 
> the PageLayout class. In my case the mode is simply an empty string.The PDF 
> documents which contain such unexpected Page Layout value are all rendered by 
> quite old Adobe PDF library 7.0 from 2014 (I can't share the document, it is 
> confidential).
> My suggestion is to modify the method so that, similarly to getPageMode() 
> method, the eventual exception is caught and the method returns the default 
> PageLayout.{color:#9876aa}SINGLE_PAGE {color}mode.{color:#9876aa}
> {color}
>  
> This problem affects the current version in trunk, as well as at least 2.0.25 
> and 2.0.26.
>  
> I have created very simple patch which fixes the problem, please consider 
> applying it to the trunk and 2.0.x branch.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5513) getPageLayout throws IllegalArgumentException for empty mode

2022-09-15 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605460#comment-17605460
 ] 

Michael Klink commented on PDFBOX-5513:
---

Yes, one can argue along that line. If PDFBox users want to get default values 
instead of invalid ones or exceptions, then go ahead.

Nonetheless, it feels wrong to me.

> getPageLayout throws IllegalArgumentException for empty mode
> 
>
> Key: PDFBOX-5513
> URL: https://issues.apache.org/jira/browse/PDFBOX-5513
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.25, 2.0.26, 3.0.0 PDFBox
>Reporter: Karol Bryd
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.27, 3.0.0 PDFBox
>
> Attachments: page_layout_issue.patch
>
>
> getPageLayout() method in PDDocumentCatalog can throw an exception 
> IllegalArgumentException when the PageLayout mode is not one of defined in 
> the PageLayout class. In my case the mode is simply an empty string.The PDF 
> documents which contain such unexpected Page Layout value are all rendered by 
> quite old Adobe PDF library 7.0 from 2014 (I can't share the document, it is 
> confidential).
> My suggestion is to modify the method so that, similarly to getPageMode() 
> method, the eventual exception is caught and the method returns the default 
> PageLayout.{color:#9876aa}SINGLE_PAGE {color}mode.{color:#9876aa}
> {color}
>  
> This problem affects the current version in trunk, as well as at least 2.0.25 
> and 2.0.26.
>  
> I have created very simple patch which fixes the problem, please consider 
> applying it to the trunk and 2.0.x branch.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5513) getPageLayout throws IllegalArgumentException for empty mode

2022-09-15 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605276#comment-17605276
 ] 

Michael Klink commented on PDFBOX-5513:
---

I'm a bit skeptical seeing a {*}get{*}ter that reports a value different from 
what's really there. This makes the {*}get{*}ter unusable when one wants to 
check the value...

> getPageLayout throws IllegalArgumentException for empty mode
> 
>
> Key: PDFBOX-5513
> URL: https://issues.apache.org/jira/browse/PDFBOX-5513
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.25, 2.0.26, 3.0.0 PDFBox
>Reporter: Karol Bryd
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.27, 3.0.0 PDFBox
>
> Attachments: page_layout_issue.patch
>
>
> getPageLayout() method in PDDocumentCatalog can throw an exception 
> IllegalArgumentException when the PageLayout mode is not one of defined in 
> the PageLayout class. In my case the mode is simply an empty string.The PDF 
> documents which contain such unexpected Page Layout value are all rendered by 
> quite old Adobe PDF library 7.0 from 2014 (I can't share the document, it is 
> confidential).
> My suggestion is to modify the method so that, similarly to getPageMode() 
> method, the eventual exception is caught and the method returns the default 
> PageLayout.{color:#9876aa}SINGLE_PAGE {color}mode.{color:#9876aa}
> {color}
>  
> This problem affects the current version in trunk, as well as at least 2.0.25 
> and 2.0.26.
>  
> I have created very simple patch which fixes the problem, please consider 
> applying it to the trunk and 2.0.x branch.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5499) Performance issue since 2.0.18

2022-09-06 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600725#comment-17600725
 ] 

Michael Klink commented on PDFBOX-5499:
---

{quote}How about that we replace the SmallMap at runtime with a LinkedHashMap 
if the count exceeds some arbitrary number, e.g. 100 ?
{quote}
It would be worth a try.

There is one tiny behavior change associated with that, though: According to 
the {{Map}} interface JavaDocs the derived collections of the map, 
{{{}entrySet{}}}, {{{}keySet{}}}, and {{{}values{}}}, shall be backed by the 
map itself. {{SmallMap}} violates this requirement, {{LinkedHashMap}} 
implements it correctly.

Thus, code using the {{COSDictionary}} methods {{{}entrySet{}}}, 
{{{}keySet{}}}, and {{getValues}} may have to deal with a changing behavior: As 
long as the map is a {{SmallMap}} these collections are not backed by the 
underlying map, but as soon as it is replaced by a {{{}LinkedHashMap{}}}, they 
suddenly are.

(Most usages will ignore the difference as the returned collections usually are 
used only once...)

> Performance issue since 2.0.18
> --
>
> Key: PDFBOX-5499
> URL: https://issues.apache.org/jira/browse/PDFBOX-5499
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.19
>Reporter: Thomas Debray Luyat
>Priority: Major
> Attachments: image-2022-09-05-12-48-04-608.png, 
> image-2022-09-05-17-37-55-155.png, image-2022-09-05-17-40-22-416.png, 
> image-2022-09-05-19-55-40-753.png
>
>
> Our PDF is parsed in less than 200ms in 2.0.18 and more then 8 seconds in 
> 2.0.19. The same issue is still there in 2.0.26.
>  
> In version 2.0.19, SmallMap has been introduced. We're facing a performance 
> issue since this modification.
> !image-2022-09-05-12-48-04-608.png|width=968,height=377!
> We patch our code to just replace the SmallMap implementation like this:
> {code:java}
> package org.apache.pdfbox.util;
> import java.util.LinkedHashMap;
> public class SmallMap extends LinkedHashMap {
> // nothing : use the standard LinkedHashMap
> }{code}
> And the performance issue disappear. 
> Our test is really simple:
> {code:java}
> long start = System.currentTimeMillis();
>     try (PDDocument document = PDDocument.load(new File(inFile))) {
>       // nothing : only parsing is evaluated
> }
> long duration = System.currentTimeMillis() -start;
>     assertTrue(duration < 500);{code}
>  
> I can understand that the SmallMap can solve issues in some cases, but it is 
> possible to implement a factory to create this map and then allow to setup 
> which Map implementation we want to use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5499) Performance issue since 2.0.18

2022-09-05 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600344#comment-17600344
 ] 

Michael Klink edited comment on PDFBOX-5499 at 9/5/22 12:25 PM:


Just to make sure: Your code measures not only _parsing_ but also _closing_ the 
document (by means of the try-with-resources feature). Please test whether the 
performance issue really is in the parsing and not in the closing.

But indeed, the {{SmallMap}} is performant only for small numbers of entries. 
If your PDF has many dictionaries with very many entries, {{SmallMap}} may 
cause that difference.


was (Author: mkl):
Just to make sure: Your code measures not only _parsing_ but also _closing_ the 
document (by means of the try-with-resources feature). Please test whether the 
performance issue really is in the parsing and not in the closing.

> Performance issue since 2.0.18
> --
>
> Key: PDFBOX-5499
> URL: https://issues.apache.org/jira/browse/PDFBOX-5499
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.18
>Reporter: Thomas Debray Luyat
>Priority: Major
> Attachments: image-2022-09-05-12-48-04-608.png
>
>
> Our PDF is parsed in less than 200ms in 2.0.18 and more then 8 seconds in 
> 2.0.19. The same issue is still there in 2.0.26.
>  
> In version 2.0.19, SmallMap has been introduced. We're facing a performance 
> issue since this modification.
> !image-2022-09-05-12-48-04-608.png|width=968,height=377!
> We patch our code to just replace the SmallMap implementation like this:
> {code:java}
> package org.apache.pdfbox.util;
> import java.util.LinkedHashMap;
> public class SmallMap extends LinkedHashMap {
> // nothing : use the standard LinkedHashMap
> }{code}
> And the performance issue disappear. 
> Our test is really simple:
> {code:java}
> long start = System.currentTimeMillis();
>     try (PDDocument document = PDDocument.load(new File(inFile))) {
>       // nothing : only parsing is evaluated
> }
> long duration = System.currentTimeMillis() -start;
>     assertTrue(duration < 500);{code}
>  
> I can understand that the SmallMap can solve issues in some cases, but it is 
> possible to implement a factory to create this map and then allow to setup 
> which Map implementation we want to use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5499) Performance issue since 2.0.18

2022-09-05 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17600344#comment-17600344
 ] 

Michael Klink commented on PDFBOX-5499:
---

Just to make sure: Your code measures not only _parsing_ but also _closing_ the 
document (by means of the try-with-resources feature). Please test whether the 
performance issue really is in the parsing and not in the closing.

> Performance issue since 2.0.18
> --
>
> Key: PDFBOX-5499
> URL: https://issues.apache.org/jira/browse/PDFBOX-5499
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.18
>Reporter: Thomas Debray Luyat
>Priority: Major
> Attachments: image-2022-09-05-12-48-04-608.png
>
>
> Our PDF is parsed in less than 200ms in 2.0.18 and more then 8 seconds in 
> 2.0.19. The same issue is still there in 2.0.26.
>  
> In version 2.0.19, SmallMap has been introduced. We're facing a performance 
> issue since this modification.
> !image-2022-09-05-12-48-04-608.png|width=968,height=377!
> We patch our code to just replace the SmallMap implementation like this:
> {code:java}
> package org.apache.pdfbox.util;
> import java.util.LinkedHashMap;
> public class SmallMap extends LinkedHashMap {
> // nothing : use the standard LinkedHashMap
> }{code}
> And the performance issue disappear. 
> Our test is really simple:
> {code:java}
> long start = System.currentTimeMillis();
>     try (PDDocument document = PDDocument.load(new File(inFile))) {
>       // nothing : only parsing is evaluated
> }
> long duration = System.currentTimeMillis() -start;
>     assertTrue(duration < 500);{code}
>  
> I can understand that the SmallMap can solve issues in some cases, but it is 
> possible to implement a factory to create this map and then allow to setup 
> which Map implementation we want to use?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5498) Setting NonStrokingColor with RGB checks wrong color ranges

2022-09-02 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599466#comment-17599466
 ] 

Michael Klink commented on PDFBOX-5498:
---

{quote}How would I then set the color, e.g. RGB(49, 154, 201), when I have to 
pass the values from 0 to 1?{quote}

If your RGB value range is from 0 to 255, divide each value by 255.



> Setting NonStrokingColor with RGB checks wrong color ranges
> ---
>
> Key: PDFBOX-5498
> URL: https://issues.apache.org/jira/browse/PDFBOX-5498
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.0 PDFBox
>Reporter: Holger Herrmann
>Priority: Minor
>  Labels: Color, Font
>
> In the PDAbstractContentStream, you can set the NonStrokingColor by passing 
> RGB values.
> However, these values may only be 0 or 1, as they are checked with 
> isOutsideOneInterval():
>  
> {{public void setNonStrokingColor(float r, float g, float b) throws 
> IOException {}}
> {{   if (isOutsideOneInterval(r) || isOutsideOneInterval(g) || 
> isOutsideOneInterval(b))}}
> {{   {}}
> {{      throw new IllegalArgumentException("Parameters must be within 0..1, 
> but are "}}
> {{      + String.format("(%.2f,%.2f,%.2f)", r, g, b));}}
> {{   }}}
> {{   ...}}
> {{}}}
>  
> {{The comment of the method seems correct to me: "Range is 0..255."}}
>  
> {{So I suppose the values have to be checked using isOutside255Interval(.).}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5498) Setting NonStrokingColor with RGB checks wrong color ranges

2022-09-02 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599407#comment-17599407
 ] 

Michael Klink edited comment on PDFBOX-5498 at 9/2/22 11:02 AM:


Which version exactly are you using? Looking into the sources I only see `Range 
is 0..1.`

There was a fix in February by [~tilman] .


was (Author: mkl):
Which version exactly are you using? Looking into the sources I only see `Range 
is 0..1.`

There was a fix in February, see PDFBOX-4892

> Setting NonStrokingColor with RGB checks wrong color ranges
> ---
>
> Key: PDFBOX-5498
> URL: https://issues.apache.org/jira/browse/PDFBOX-5498
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.0 PDFBox
>Reporter: Holger Herrmann
>Priority: Minor
>  Labels: Color, Font
>
> In the PDAbstractContentStream, you can set the NonStrokingColor by passing 
> RGB values.
> However, these values may only be 0 or 1, as they are checked with 
> isOutsideOneInterval():
>  
> {{public void setNonStrokingColor(float r, float g, float b) throws 
> IOException {}}
> {{   if (isOutsideOneInterval(r) || isOutsideOneInterval(g) || 
> isOutsideOneInterval(b))}}
> {{   {}}
> {{      throw new IllegalArgumentException("Parameters must be within 0..1, 
> but are "}}
> {{      + String.format("(%.2f,%.2f,%.2f)", r, g, b));}}
> {{   }}}
> {{   ...}}
> {{}}}
>  
> {{The comment of the method seems correct to me: "Range is 0..255."}}
>  
> {{So I suppose the values have to be checked using isOutside255Interval(.).}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5498) Setting NonStrokingColor with RGB checks wrong color ranges

2022-09-02 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599407#comment-17599407
 ] 

Michael Klink edited comment on PDFBOX-5498 at 9/2/22 11:01 AM:


Which version exactly are you using? Looking into the sources I only see `Range 
is 0..1.`

There was a fix in February, see PDFBOX-4892


was (Author: mkl):
Which version exactly are you using? Looking into the sources I only see `Range 
is 0..1.`

> Setting NonStrokingColor with RGB checks wrong color ranges
> ---
>
> Key: PDFBOX-5498
> URL: https://issues.apache.org/jira/browse/PDFBOX-5498
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.0 PDFBox
>Reporter: Holger Herrmann
>Priority: Minor
>  Labels: Color, Font
>
> In the PDAbstractContentStream, you can set the NonStrokingColor by passing 
> RGB values.
> However, these values may only be 0 or 1, as they are checked with 
> isOutsideOneInterval():
>  
> {{public void setNonStrokingColor(float r, float g, float b) throws 
> IOException {}}
> {{   if (isOutsideOneInterval(r) || isOutsideOneInterval(g) || 
> isOutsideOneInterval(b))}}
> {{   {}}
> {{      throw new IllegalArgumentException("Parameters must be within 0..1, 
> but are "}}
> {{      + String.format("(%.2f,%.2f,%.2f)", r, g, b));}}
> {{   }}}
> {{   ...}}
> {{}}}
>  
> {{The comment of the method seems correct to me: "Range is 0..255."}}
>  
> {{So I suppose the values have to be checked using isOutside255Interval(.).}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5498) Setting NonStrokingColor with RGB checks wrong color ranges

2022-09-02 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599407#comment-17599407
 ] 

Michael Klink commented on PDFBOX-5498:
---

Which version exactly are you using? Looking into the sources I only see `Range 
is 0..1.`

> Setting NonStrokingColor with RGB checks wrong color ranges
> ---
>
> Key: PDFBOX-5498
> URL: https://issues.apache.org/jira/browse/PDFBOX-5498
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.0 PDFBox
>Reporter: Holger Herrmann
>Priority: Minor
>  Labels: Color, Font
>
> In the PDAbstractContentStream, you can set the NonStrokingColor by passing 
> RGB values.
> However, these values may only be 0 or 1, as they are checked with 
> isOutsideOneInterval():
>  
> {{public void setNonStrokingColor(float r, float g, float b) throws 
> IOException {}}
> {{   if (isOutsideOneInterval(r) || isOutsideOneInterval(g) || 
> isOutsideOneInterval(b))}}
> {{   {}}
> {{      throw new IllegalArgumentException("Parameters must be within 0..1, 
> but are "}}
> {{      + String.format("(%.2f,%.2f,%.2f)", r, g, b));}}
> {{   }}}
> {{   ...}}
> {{}}}
>  
> {{The comment of the method seems correct to me: "Range is 0..255."}}
>  
> {{So I suppose the values have to be checked using isOutside255Interval(.).}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5493) Signature byte range is Invalid after singing

2022-08-17 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17580924#comment-17580924
 ] 

Michael Klink commented on PDFBOX-5493:
---

The original file uses hybrid cross reference information (see ISO 32000-1 
section 7.5.8.4). Albeit it looks a bit weird, it is according to spec. MS Word 
has exported PDFs with this quirk for many many years.

PDFBox 3 has problems handling hybrid reference files, see also PDFBOX-5261, 
PDFBOX-5170, and probably other issues.
 #

> Signature byte range is Invalid after singing
> -
>
> Key: PDFBOX-5493
> URL: https://issues.apache.org/jira/browse/PDFBOX-5493
> Project: PDFBox
>  Issue Type: Bug
>  Components: Signing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Dmitry
>Priority: Blocker
> Attachments: doc.pdf, doc_signed.pdf
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> After signing pdf by PDFBOX Adobe reader tells me "{color:#ff}The 
> signature byre range is invalid{color}".
> I will attach original and signed document.
> For signing i used example code. 
> Initial pdf created by MS Word 2007 (2016 also bad) (save as PDF). 
> Other pdfs works fine.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5490) Add reconstruction information to the PDDocument

2022-08-10 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578065#comment-17578065
 ] 

Michael Klink commented on PDFBOX-5490:
---

Sounds like a good idea.

That also would allow to implement some customized parsing strictness if an 
exception thrown by the listener is interpreted a rejected repair...

> Add reconstruction information to the PDDocument
> 
>
> Key: PDFBOX-5490
> URL: https://issues.apache.org/jira/browse/PDFBOX-5490
> Project: PDFBox
>  Issue Type: Wish
>  Components: Parsing
>Reporter: Tim Allison
>Priority: Minor
>
> When the xref has to be rebuilt or there are other anomalies in the parsing 
> of the PDDocument, the results are currently logged.  In a multithreaded 
> environment it is not easy to reconstruct which documents had which problems.
> It would be helpful if a PDF was able to be successfully loaded to include 
> information about what had to be fixed in order to load it successfully.  
> Certainly, rebuilding the xref table comes to mind, but any other info would 
> also be useful.
> This is a wish for 3.x.  I don't think I'll have time to contribute. :(



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5483) Replace methods using an InputStream from Loader.loadPDF

2022-08-03 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574627#comment-17574627
 ] 

Michael Klink commented on PDFBOX-5483:
---

Indeed, I didn't necessarily mean keeping the original method signature but at 
least keeping it simple.

E.g. one can introduce an enumeration {{PdfCaching}} with values {{inMemory}}, 
{{inFile}}, and {{inMemoryMappedFile}}. Then one could change
{code:java}
public static PDDocument loadPDF(InputStream input) throws IOException
{code}
to
{code:java}
public static PDDocument loadPDF(InputStream input, PdfCaching pdfCaching) 
throws IOException
{code}

IMO it is more friendly and less frustrating to have to write
{code:java}
PDDocument pdDocument = Loader.loadPdf(inputStream, PdfCaching.inMemory);
{code}
than
{code:java}
PDDocument pdDocument = 
Loader.loadPDF(RandomAccessReadBuffer.createBufferFromStream(inputStream));
{code}
in particular as IDEs often support enumeration value proposals there.

To keep things in one place, the actual code for creating the 
{{RandomAccessRead}} for an {{InputStream}} may be a method of the enumeration.

> Replace methods using an InputStream from Loader.loadPDF
> 
>
> Key: PDFBOX-5483
> URL: https://issues.apache.org/jira/browse/PDFBOX-5483
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
>
> As discussed on dev@pdfbox
> {quote}
> We have to remove the loadPDF variants using InputStream and replace them 
> with RandomAccessRead.
> If it comes to InputStreams users have to decide how to procide:
> * copy the InputStream to memory by using RandomAccessReadBuffer
> * copy the InputStream to a file and use RandomAccessReadBufferedFile or 
> RandomAccessReadMemoryMappedFile
> This would make it more transparent what happens under the hood when using 
> the different kinds of loadPDF methods:
> * a byte array as source is already in memory and the obvious choice is to 
> use RandomAccessReadBuffer as a wrapper
> * a file as source targets a local file and the most obvious choice is to use 
> RandomAccessReadBufferedFile as a wrapper. We should document that as the 
> other alternative RandomAccessReadMemoryMappedFile is offered in this case
> * RandomAccessRead as source is the most obvious one and the user decides how 
> to create it. Additionally is ist possible to implement some own caching 
> loading and/or mechanism
> {quote}
> see PDFBOX-5462 and [High memory usage with pdfbox 
> 3|https://lists.apache.org/thread/6mmgp23v8b2yztj4hghkgkd14s1gzs8g] as well



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5483) Replace methods using an InputStream from Loader.loadPDF

2022-08-02 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574264#comment-17574264
 ] 

Michael Klink edited comment on PDFBOX-5483 at 8/2/22 2:50 PM:
---

Is it really necessary to remove all {{InputStream}} constructors?

Wouldn't adding an enumeration argument with good names have sufficed?

IMO this change will frustrate PDFBox users if the {{InputStream}} constructors 
are completely removed...


was (Author: mkl):
Is it really necessary to remove all {{InputStream}} constructors?

Wouldn't adding an enumeration argument with good names have sufficed?

IMO this change will frustrate PDFBox users in its current form...

> Replace methods using an InputStream from Loader.loadPDF
> 
>
> Key: PDFBOX-5483
> URL: https://issues.apache.org/jira/browse/PDFBOX-5483
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
>
> As discussed on dev@pdfbox
> {quote}
> We have to remove the loadPDF variants using InputStream and replace them 
> with RandomAccessRead.
> If it comes to InputStreams users have to decide how to procide:
> * copy the InputStream to memory by using RandomAccessReadBuffer
> * copy the InputStream to a file and use RandomAccessReadBufferedFile or 
> RandomAccessReadMemoryMappedFile
> This would make it more transparent what happens under the hood when using 
> the different kinds of loadPDF methods:
> * a byte array as source is already in memory and the obvious choice is to 
> use RandomAccessReadBuffer as a wrapper
> * a file as source targets a local file and the most obvious choice is to use 
> RandomAccessReadBufferedFile as a wrapper. We should document that as the 
> other alternative RandomAccessReadMemoryMappedFile is offered in this case
> * RandomAccessRead as source is the most obvious one and the user decides how 
> to create it. Additionally is ist possible to implement some own caching 
> loading and/or mechanism
> {quote}
> see PDFBOX-5462 and [High memory usage with pdfbox 
> 3|https://lists.apache.org/thread/6mmgp23v8b2yztj4hghkgkd14s1gzs8g] as well



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5483) Replace methods using an InputStream from Loader.loadPDF

2022-08-02 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574264#comment-17574264
 ] 

Michael Klink commented on PDFBOX-5483:
---

Is it really necessary to remove all {{InputStream}} constructors?

Wouldn't adding an enumeration argument with good names have sufficed?

IMO this change will frustrate PDFBox users in its current form...

> Replace methods using an InputStream from Loader.loadPDF
> 
>
> Key: PDFBOX-5483
> URL: https://issues.apache.org/jira/browse/PDFBOX-5483
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
>
> As discussed on dev@pdfbox
> {quote}
> We have to remove the loadPDF variants using InputStream and replace them 
> with RandomAccessRead.
> If it comes to InputStreams users have to decide how to procide:
> * copy the InputStream to memory by using RandomAccessReadBuffer
> * copy the InputStream to a file and use RandomAccessReadBufferedFile or 
> RandomAccessReadMemoryMappedFile
> This would make it more transparent what happens under the hood when using 
> the different kinds of loadPDF methods:
> * a byte array as source is already in memory and the obvious choice is to 
> use RandomAccessReadBuffer as a wrapper
> * a file as source targets a local file and the most obvious choice is to use 
> RandomAccessReadBufferedFile as a wrapper. We should document that as the 
> other alternative RandomAccessReadMemoryMappedFile is offered in this case
> * RandomAccessRead as source is the most obvious one and the user decides how 
> to create it. Additionally is ist possible to implement some own caching 
> loading and/or mechanism
> {quote}
> see PDFBOX-5462 and [High memory usage with pdfbox 
> 3|https://lists.apache.org/thread/6mmgp23v8b2yztj4hghkgkd14s1gzs8g] as well



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5479) PDFTextStripper needs 1GB heap for a 3.6 MB pdf

2022-07-21 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569732#comment-17569732
 ] 

Michael Klink commented on PDFBOX-5479:
---

Wow, some 3000 form XObjects on page 1, many of them with an own font object, 
most of which point to the same font descriptor... that adds up...

> PDFTextStripper needs 1GB heap for a 3.6 MB pdf
> ---
>
> Key: PDFBOX-5479
> URL: https://issues.apache.org/jira/browse/PDFBOX-5479
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.26
> Environment: JDK11.0.2 on MacOS 12.4
>Reporter: Manfred Schauer
>Priority: Minor
> Attachments: heapDump.png, x.pdf
>
>
> Extracting text from the attached x.pdf:
> PDDocument pdDocument = PDDocument.load(new File("/tmp/x.pdf"));
> PDFTextStripper stripper = new PDFTextStripper();
> stripper.getText(pdDocument);
> succeeds with -Xmx1G but throws OOME with -Xmx900m
> Heapdump shows 2923 instances of TrueTypeFont, PDRessources.cache contains 
> SoftReferences to lots of fonts keyed by different COSObjects;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5471) NPE when Transparency Group is missing the BBox

2022-07-04 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562125#comment-17562125
 ] 

Michael Klink commented on PDFBOX-5471:
---

Strictly speaking the missing *BBox* entry is an error. You may consider at 
least logging a warning.

> NPE when Transparency Group is missing the BBox
> ---
>
> Key: PDFBOX-5471
> URL: https://issues.apache.org/jira/browse/PDFBOX-5471
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.26
>Reporter: Henry Iguaro
>Priority: Major
> Fix For: 2.0.27, 3.0.0 PDFBox
>
> Attachments: getbbox-null.pdf, image-2022-07-04-14-55-28-527.png
>
>
> Some files contain transparency groups with no {{{}BBox{}}}. When this 
> happens, PDFBox rendering code throws a {{NullPointerException}} in the 
> {{TransparencyGroup}} constructor:
>  
> {code:java}
> // transform the bbox
> GeneralPath transformedBox = form.getBBox().transform(transform);
>   \___/ 
>NPE when its null{code}
> The following is a screenshot taken from {{pdf-debugger}} when trying to open 
> a file with has this issue:
> !image-2022-07-04-14-55-28-527.png!
> The stack trace:
> {code:java}
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.pdfbox.pdmodel.common.PDRectangle.transform(org.apache.pdfbox.util.Matrix)"
>  because the return value of 
> "org.apache.pdfbox.pdmodel.graphics.form.PDTransparencyGroup.getBBox()" is 
> null
>     
> org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.done(PagePane.java:485)
>     
> java.desktop/sun.swing.AccumulativeRunnable.run(AccumulativeRunnable.java:112)
>     
> java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
>     
> java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.pdfbox.pdmodel.common.PDRectangle.transform(org.apache.pdfbox.util.Matrix)"
>  because the return value of 
> "org.apache.pdfbox.pdmodel.graphics.form.PDTransparencyGroup.getBBox()" is 
> null
>     
> org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.done(PagePane.java:465)
>     
> java.desktop/sun.swing.AccumulativeRunnable.run(AccumulativeRunnable.java:112)
>     
> java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
>     
> java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
> Caused by: java.lang.NullPointerException: Cannot invoke 
> "org.apache.pdfbox.pdmodel.common.PDRectangle.transform(org.apache.pdfbox.util.Matrix)"
>  because the return value of 
> "org.apache.pdfbox.pdmodel.graphics.form.PDTransparencyGroup.getBBox()" is 
> null
>     
> org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.(PageDrawer.java:1672)
>     
> org.apache.pdfbox.rendering.PageDrawer$TransparencyGroup.(PageDrawer.java:1637)
>     
> org.apache.pdfbox.rendering.PageDrawer.showTransparencyGroupOnGraphics(PageDrawer.java:1575)
>     
> org.apache.pdfbox.rendering.PageDrawer.showTransparencyGroup(PageDrawer.java:1553)
>     
> org.apache.pdfbox.contentstream.operator.graphics.DrawObject.process(DrawObject.java:81)
>     
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:966)
>     
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:541)
>     
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:516)
>     
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:155)
>     org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:279)
>     org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:355)
>     
> org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.doInBackground(PagePane.java:453)
>     
> org.apache.pdfbox.debugger.pagepane.PagePane$RenderWorker.doInBackground(PagePane.java:435)
>     java.base/java.lang.Thread.run(Thread.java:832)
>  {code}
> The following is an example file that reproduces this problem:
> [^getbbox-null.pdf]
>  
> A potential fix PR: https://github.com/apache/pdfbox/pull/145



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5455) java.lang.ExceptionInInitializerError in org.apache.pdfbox.util.PDFTextStripper class

2022-06-10 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552605#comment-17552605
 ] 

Michael Klink commented on PDFBOX-5455:
---

In version 1.8.9 you mention the lines 121..124 are
{code:java}
String[] versionComponents = 
System.getProperty("java.version").split("\\.");
int javaMajorVersion = Integer.parseInt(versionComponents[0]);
int javaMinorVersion = Integer.parseInt(versionComponents[1]);
is16orLess = javaMajorVersion == 1 && javaMinorVersion <= 6;
{code}
Your {{ArrayIndexOutOfBoundsException}} at {{Index 1}}, therefore, refers to 
{{versionComponents[1]}}.
Apparently your {{java.version}} system property has not the contents expected 
by PDFBox developers.
As a quick fix you may try and append {{.9}} to that system property before 
using PDFBox classes.
As a real fix, you should update your PDFBox version. 1.8.9 is ancient.

> java.lang.ExceptionInInitializerError in  
> org.apache.pdfbox.util.PDFTextStripper class
> --
>
> Key: PDFBOX-5455
> URL: https://issues.apache.org/jira/browse/PDFBOX-5455
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 1.8.9
>Reporter: Kalpesh Patel
>Priority: Minor
>
> Unable to read pdf file . Getting below exception - 
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds 
> for length 1
>     at 
> org.apache.pdfbox.util.PDFTextStripper.(PDFTextStripper.java:123)
>  
> Let me know if more details needed
>  
> [~Bettenburg] 
>  
> [~will86] 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-2776) support "Long Term Validation" signature extensions (LTV)

2022-06-07 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550903#comment-17550903
 ] 

Michael Klink commented on PDFBOX-2776:
---

The problem here is that the task in this issue is unclear.

Both LTV related mechanisms of the old ISO 32000-1 and of the newer PAdES 
specifications were discussed. These mechanisms differ considerably.

Both "LTV-enabled" in Adobe Acrobat and PAdES LTV were mentioned as targets. 
They are not identical.

Up to here the issue essentially served as a brain storming platform for LTV 
related features.

 

> support "Long Term Validation" signature extensions (LTV)
> -
>
> Key: PDFBOX-2776
> URL: https://issues.apache.org/jira/browse/PDFBOX-2776
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Signing
>Affects Versions: 2.0.0
>Reporter: Ralf Hauser
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
> Attachments: certified_368835_Sig_de_201026171017_LTV.pdf, 
> nonSigPdf-sig1.pdf, notCertified_368835_Sig_en_201026090509.pdf, 
> notCertified_368835_Sig_en_201026090509_report.png, shortLivedCrlAsLTV-sig.pdf
>
>
> in recent acrobat readers, every signature is commented w.r.t. "LTV"
> ETSI TS 102 778-4 V1.1.2 (2009-12) Technical Specification
> referenced as part 4 in
> http://en.wikipedia.org/wiki/PAdES 
> It would be great if pdf signatures created with PDFBox would assist in 
> creatign those.
> Target test setup: 
> 1) input of an unsigned PDF-1.5 document
> 2) signature with
> a) local key pair
> b) hsm
> c) remote signature service (e.g. via soap)
> 3) add ocsp response for LTV (crls typically are larger)
> ==> Result: signed pdf where acrobat reader claims it to be "LTV enabled"
> see also PDFBOX-1848
> more in 
> http://stackoverflow.com/questions/26090558/ltv-enabled-signature-in-pdf



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5451) Avoid copying byte array for COSString

2022-06-05 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550227#comment-17550227
 ] 

Michael Klink commented on PDFBOX-5451:
---

Please consider documenting this in the JavaDocs and probably also in the 3.0.0 
migration guide.

People used to the cloning may re-use their byte arrays...

> Avoid copying byte array for COSString
> --
>
> Key: PDFBOX-5451
> URL: https://issues.apache.org/jira/browse/PDFBOX-5451
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Affects Versions: 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.0 PDFBox
>
>
> When creating a COSString the given byte array is cloned. As in most cases 
> the array is just an intermediate object we should remove that to reduce the 
> memory footprint.
> Furthermore the {{getBytes}} returns the internal byte array so that I don't 
> see any reason not to use the given byte array itself instead of cloning it



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-4951) Sequences of DIN SPEC 91379 with combining letters are rendered incorrectly

2022-06-05 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550225#comment-17550225
 ] 

Michael Klink edited comment on PDFBOX-4951 at 6/5/22 2:59 PM:
---

{quote}In my understanding you always need to embed the complete font for a 
form.{quote}
No. You merely need to embed the glyphs that will be used in the fields in 
question. For example, if you have reason to expect only English inputs in 
those form fields, there is no need to embed the Greek or Cyrillic characters, 
let alone the lot of CJK glyphs, of the font.

For full DIN SPEC 91379 support that means a lot of characters but by far not 
all in case of very generic fonts.


was (Author: mkl):
{quote}In my understanding you always need to embed the complete font for a 
form.{quote}
No. You merely need to embed the glyphs that will be used in the fields in 
question. For example, if you have reason to expect only English inputs in 
those form fields, there is no need to embed the Greek or Cyrillic characters, 
let alone the lot of CJK glyphs, of the font. 

> Sequences of DIN SPEC 91379 with combining letters are rendered incorrectly
> ---
>
> Key: PDFBOX-4951
> URL: https://issues.apache.org/jira/browse/PDFBOX-4951
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.21
>Reporter: Volker Kunert
>Priority: Major
> Attachments: DIN_SPEC_91379_Sequences-aa.pdf, 
> DIN_SPEC_91379_Sequences-ab.pdf, DIN_SPEC_91379_Sequences-ac.pdf, 
> DIN_SPEC_91379_Sequences.txt, DefaultScriptProcessor.java, 
> DoGlyphLayoutDinSpec91379.pdf, DoGlyphLayoutDinSpec91379Form.pdf, 
> DoGlyphPositionBengali.pdf, ExamplePdfboxFopPos-By-Tilman.pdf, 
> ExamplePdfboxFopPos.java, ExamplePdfboxFopPos.pdf, 
> ExamplePdfboxFopPosForm.java, ExamplePdfboxFopPosForm.pdf, TestPdfbox.java, 
> TestPdfboxFop2.java, TestPdfboxFop2.pdf, TestPdfboxJava2D.java, 
> TestPdfboxJava2D.pdf, patch-2020-10-02.txt, pdfbox.patch, pdfbox.pdf, 
> screenshot-1.png
>
>
> Accented Letters composed of Unicode base letter and combining accent are 
> rendered wrong. E.g. with 0041 030B LATIN CAPITAL LETTER A WITH COMBINING 
> DOUBLE ACUTE ACCENT the accent appears at the right hand side of the letter 
> A, not above the letter A.
> The position is wrong for most of the sequences defined in the following spec:
> DIN SPEC 91379: Characters in Unicode for the electronic processing of names 
> and data 
>  exchange in Europe; with digital attachment
>  [https://www.xoev.de/downloads-2316#StringLatin]
>  [https://www.din.de/de/wdc-beuth:din21:301228458]
>  
> The correct rendering should look like the output of hb-view 2.6.8, see files 
> DIN_SPEC_91379_Sequences*.pdf.
> The output of PDFBox is appended in pdfbox.pdf, which is created by running 
> TestPdfbox.java. The sequences are read from file 
> DIN_SPEC_91379_Sequences.txt.
>  
> Font used for testing: NotoSansMono-Regular.ttf, see 
> [https://www.google.com/get/noto/] 
> download: 
> [https://noto-website-2.storage.googleapis.com/pkgs/NotoSansMono-hinted.zip]
>  See also FOP-2969
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-4951) Sequences of DIN SPEC 91379 with combining letters are rendered incorrectly

2022-06-05 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550225#comment-17550225
 ] 

Michael Klink commented on PDFBOX-4951:
---

{quote}In my understanding you always need to embed the complete font for a 
form.{quote}
No. You merely need to embed the glyphs that will be used in the fields in 
question. For example, if you have reason to expect only English inputs in 
those form fields, there is no need to embed the Greek or Cyrillic characters, 
let alone the lot of CJK glyphs, of the font. 

> Sequences of DIN SPEC 91379 with combining letters are rendered incorrectly
> ---
>
> Key: PDFBOX-4951
> URL: https://issues.apache.org/jira/browse/PDFBOX-4951
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.21
>Reporter: Volker Kunert
>Priority: Major
> Attachments: DIN_SPEC_91379_Sequences-aa.pdf, 
> DIN_SPEC_91379_Sequences-ab.pdf, DIN_SPEC_91379_Sequences-ac.pdf, 
> DIN_SPEC_91379_Sequences.txt, DefaultScriptProcessor.java, 
> DoGlyphLayoutDinSpec91379.pdf, DoGlyphLayoutDinSpec91379Form.pdf, 
> DoGlyphPositionBengali.pdf, ExamplePdfboxFopPos-By-Tilman.pdf, 
> ExamplePdfboxFopPos.java, ExamplePdfboxFopPos.pdf, 
> ExamplePdfboxFopPosForm.java, ExamplePdfboxFopPosForm.pdf, TestPdfbox.java, 
> TestPdfboxFop2.java, TestPdfboxFop2.pdf, TestPdfboxJava2D.java, 
> TestPdfboxJava2D.pdf, patch-2020-10-02.txt, pdfbox.patch, pdfbox.pdf, 
> screenshot-1.png
>
>
> Accented Letters composed of Unicode base letter and combining accent are 
> rendered wrong. E.g. with 0041 030B LATIN CAPITAL LETTER A WITH COMBINING 
> DOUBLE ACUTE ACCENT the accent appears at the right hand side of the letter 
> A, not above the letter A.
> The position is wrong for most of the sequences defined in the following spec:
> DIN SPEC 91379: Characters in Unicode for the electronic processing of names 
> and data 
>  exchange in Europe; with digital attachment
>  [https://www.xoev.de/downloads-2316#StringLatin]
>  [https://www.din.de/de/wdc-beuth:din21:301228458]
>  
> The correct rendering should look like the output of hb-view 2.6.8, see files 
> DIN_SPEC_91379_Sequences*.pdf.
> The output of PDFBox is appended in pdfbox.pdf, which is created by running 
> TestPdfbox.java. The sequences are read from file 
> DIN_SPEC_91379_Sequences.txt.
>  
> Font used for testing: NotoSansMono-Regular.ttf, see 
> [https://www.google.com/get/noto/] 
> download: 
> [https://noto-website-2.storage.googleapis.com/pkgs/NotoSansMono-hinted.zip]
>  See also FOP-2969
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5439) Details of form fields with same form field name not getting stored using PDAcroform

2022-05-20 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17540247#comment-17540247
 ] 

Michael Klink commented on PDFBOX-5439:
---

Let's have a look at the spec...

h3. ISO 32000-1

{panel:title=ISO 32000-1, section 12.7.3.2 "Field Names"}
It is possible for different field dictionaries to have the same fully 
qualified field name if they are descendants of a common ancestor with that 
name and have no partial field names (*T* entries) of their own. Such field 
dictionaries are different representations of the same underlying field; they 
should differ only in properties that specify their visual appearance. In 
particular, field dictionaries with the same fully qualified field name shall 
have the same field type (*FT*), value (*V*), and default value (*DV*).
{panel}

As _such field dictionaries are different representations of the same 
underlying field_, it is appropriate to treat them as a single field. One 
merely must make sure that the non-widget properties of all the different 
representations of the field are the same and are used as properties of one's 
field object, not the properties of the base field itself.

h3. ISO 32000-2

{panel:title=ISO 32000-2, 12.7.4.2 Field names}
A field dictionary that does not have a partial field name (*T* entry) of its 
own shall not be considered a
field but simply a Widget annotation. Such annotations are different 
representations of the same
underlying field; they should differ only in properties that specify their 
visual appearance. In addition,
actual field dictionaries with the same fully qualified field name shall have 
the same field type (*FT*),
value (*V*), and default value (*DV*).
{panel}

If working with the current PDF standard, the "different fields with the same 
name" in the example explicitly are merely different widget annotations of the 
same field.

> Details of form fields with same form field name not getting stored using 
> PDAcroform
> 
>
> Key: PDFBOX-5439
> URL: https://issues.apache.org/jira/browse/PDFBOX-5439
> Project: PDFBox
>  Issue Type: Improvement
>  Components: AcroForm
>Reporter: Shubham Gupta
>Priority: Minor
> Attachments: Expected And Actual Result.docx, SAMPLE PDF.pdf, sample 
> code.txt
>
>
> Steps to reproduce:
>  # Develop a program that will take the PDF in PDDocument and then get the 
> Acroform details in PDAcroform now in a list of PDField try to get all the 
> fields. (I have attached a sample code for a better understanding of the 
> team).
>  # Now use a PDF which is having forms and keep two form fields with the same 
> name (let's say you are using Adobe Acrobat when you will go to tools and 
> then to Forms and then to Edit Form option and when you will click any form 
> field TEXT FIELD PROPERTIES will open. Just Go Click on the General tab and 
> Keep the two form fields names the same.)
>  # Now if the PDF contains in total of 10 form fields, the list the we got 
> from pdfbox that we have will be of size 9, This is because PDAcroform is not 
> taking those form fields that have the same form field name, they are storing 
> only those form fields whose name are unique. 
>  # This needs to be improved so that a developer using PDFBOX library, which 
> is by the way superb,  wants to validate those Form Fields which have no 
> tooltip and the duplicate form fields are the ones that don't have a tooltip 
> but since only one is getting stored he will get the wrong result every time, 
> I have given a simple example to make the team understand but this needs to 
> be improved.
> Please find the attachment for your reference.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5437) COSStream has been closed Exception on saving PDF document

2022-05-18 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538872#comment-17538872
 ] 

Michael Klink commented on PDFBOX-5437:
---

This sounds like you work with two documents at a time, say A and B, add an 
object of document A to document B, close A (explicitly or via garbage 
collection), and then save B.

This cannot work.

You should _clone_ the object from A and only add the clone to B, or you should 
keep document A open and referenced until after you save B.

The PDFBox {{PDFCloneUtility}} gives you a hint how cloning is done.

> COSStream has been closed Exception on saving PDF document
> --
>
> Key: PDFBOX-5437
> URL: https://issues.apache.org/jira/browse/PDFBOX-5437
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.24, 2.0.25, 2.0.26
>Reporter: Sanjivani
>Priority: Major
>
> Below exception occurs on saving created pdf,
> java.io.IOException: COSStream has been closed and cannot be read. Perhaps 
> its enclosing PDDocument has been closed?
>     at org.apache.pdfbox.cos.COSStream.checkClosed(COSStream.java:83) 
> ~[pdfbox-2.0.26.jar:2.0.26]
>     at 
> org.apache.pdfbox.cos.COSStream.createRawInputStream(COSStream.java:133) 
> ~[pdfbox-2.0.26.jar:2.0.26]
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.visitFromStream(COSWriter.java:1268) 
> ~[pdfbox-2.0.26.jar:2.0.26]
>     at org.apache.pdfbox.cos.COSStream.accept(COSStream.java:416) 
> ~[pdfbox-2.0.26.jar:2.0.26]
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.doWriteObject(COSWriter.java:570) 
> ~[pdfbox-2.0.26.jar:2.0.26]
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.doWriteObjects(COSWriter.java:496) 
> ~[pdfbox-2.0.26.jar:2.0.26]
>     at org.apache.pdfbox.pdfwriter.COSWriter.doWriteBody(COSWriter.java:480) 
> ~[pdfbox-2.0.26.jar:2.0.26]
>     at 
> org.apache.pdfbox.pdfwriter.COSWriter.visitFromDocument(COSWriter.java:1162) 
> ~[pdfbox-2.0.26.jar:2.0.26]
>     at org.apache.pdfbox.cos.COSDocument.accept(COSDocument.java:452) 
> ~[pdfbox-2.0.26.jar:2.0.26]
>     at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1435) 
> ~[pdfbox-2.0.26.jar:2.0.26]
>     at org.apache.pdfbox.pdfwriter.COSWriter.write(COSWriter.java:1322) 
> ~[pdfbox-2.0.26.jar:2.0.26]
>     at org.apache.pdfbox.pdmodel.PDDocument.save(PDDocument.java:1377) 
> ~[pdfbox-2.0.26.jar:2.0.26]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5436) PDTerminalField.applyChange() no longer check for getAcroForm().getNeedAppearances()

2022-05-17 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538304#comment-17538304
 ] 

Michael Klink commented on PDFBOX-5436:
---

[~msahyoun],

{quote}Comments/Feedback welcome{quote}

I'm definitively in favor of generally creating appearances by default.

But as long as there is no replacement font mechanism to handle tasks like 
[~waiwai_]'s task, there should be a not-too-low-level way to set the value 
without creating an appearance. Optimally this way should _remove_ any existing 
appearance. Field appearances containing something different than the field 
value have been abused to change signed PDFs; nowadays, therefore, such 
differences may actually cause Adobe Acrobat to validate a signature as invalid.

> PDTerminalField.applyChange() no longer check for 
> getAcroForm().getNeedAppearances()
> 
>
> Key: PDFBOX-5436
> URL: https://issues.apache.org/jira/browse/PDFBOX-5436
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.0 PDFBox
>Reporter: wai
>Priority: Major
> Attachments: image-2022-05-17-18-36-33-378.png
>
>
> In version 2.0.26, I fill fields in PDF form by
> {code:java}
> PDDocument pdf = ...; // loaded a PDF form
> PDAcroForm pdfForm = pdf.getDocumentCatalog().getAcroForm(); 
> pdfForm.setNeedAppearances(true);
> pdfForm.getField("field_name_").setValue("some text");{code}
> Although the PDF form doesn't contain all glyph for the text set, 
> {{org.apache.pdfbox.pdmodel.interactive.form.PDTerminalField.applyChange()}} 
> would not apply {{constructAppearances()}} as {{setNeedAppearances(true)}} 
> configured.
> However when we come to version 3.0.0-alpha3, 
> {{PDTerminalField.applyChange()}} won't check the status from 
> {{getAcroForm().getNeedAppearances()}} before invoking 
> {{{}constructAppearances(){}}}. This behaviour contradicted the comment wrote 
> "{{{}Applies a value change to the field. Generates appearances if required 
> and raises events.{}}}"
> +version 2.0.26+
>  
> {code:java}
> package org.apache.pdfbox.pdmodel.interactive.form;
> public abstract class PDTerminalField extends PDField
> {
>     /**
>      * Applies a value change to the field. Generates appearances if required 
> and raises events.
>      * 
>      * @throws IOException if the appearance couldn't be generated
>      */
>     protected final void applyChange() throws IOException
>     {
>         if (!getAcroForm().getNeedAppearances())
>         {
>             constructAppearances();
>         }
>         // if we supported JavaScript we would raise a field changed event 
> here
>     }{code}
>  
> +3.0.0-alpha3+
>  
> {code:java}
> package org.apache.pdfbox.pdmodel.interactive.form;
> public abstract class PDTerminalField extends PDField
> { 
>     /**
>      * Applies a value change to the field. Generates appearances if required 
> and raises events.
>      * 
>      * @throws IOException if the appearance couldn't be generated
>      */
>     protected final void applyChange() throws IOException
>     {
>         constructAppearances();
>         // if we supported JavaScript we would raise a field changed event 
> here
>     }{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5062) IllegalBlockSizeException when loading the file

2022-05-17 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538107#comment-17538107
 ] 

Michael Klink commented on PDFBOX-5062:
---

{quote}if that is the case why I am able to open this document in any pdf 
viewer.{quote}

Viewers often are built to ignore or fix many errors under the hub.

While this may be ok for PDF viewers where the user usually can directly see 
whether a repair attempt failed or succeeded, it is not ok for a PDF library 
running automatically where the output may be forwarded directly to thousands 
of potential customers or be archived in some legal archive. 

(Strictly speaking it is *not ok* even for viewers as something wrong might be 
shown without being recognizable as rubbish; thus, never trust what you see in 
a PDF viewer!) 

> IllegalBlockSizeException when loading the file
> ---
>
> Key: PDFBOX-5062
> URL: https://issues.apache.org/jira/browse/PDFBOX-5062
> Project: PDFBox
>  Issue Type: Bug
>  Components: Crypto, PDModel
>Affects Versions: 2.0.22
>Reporter: Zubair Uddin Farooqui
>Priority: Major
> Attachments: Medical services-1 (dup-keywords) (1).pdf
>
>
> Getting _IllegalBlockSizeException_ when loading the file 
> *Code:*
> {code:java}
> PDDocument pdDoc = PDDocument.load(file);{code}
> *Exception:*
> {code:java}
> java.io.IOException: javax.crypto.IllegalBlockSizeException: Input length 
> must be multiple of 16 when decrypting with padded cipherjava.io.IOException: 
> javax.crypto.IllegalBlockSizeException: Input length must be multiple of 16 
> when decrypting with padded cipher at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptDataAESother(SecurityHandler.java:315)
>  at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.encryptData(SecurityHandler.java:201)
>  at 
> org.apache.pdfbox.pdmodel.encryption.SecurityHandler.decryptStream(SecurityHandler.java:510)
>  at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:929) 
> at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:886)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:806)
>  at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:766) at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187) at 
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1041) at 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:989)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5436) PDTerminalField.applyChange() no longer check for getAcroForm().getNeedAppearances()

2022-05-17 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538071#comment-17538071
 ] 

Michael Klink commented on PDFBOX-5436:
---

As mentioned in PDFBOX-3356, *NeedAppearances* has been deprecated in PDF 2.0. 
Also, appearance streams are required since PDF 2.0. Thus, for a version of 2 
and higher, the PDFBox 3 code change is correct.

But in PDFBOX-3356 [~msahyoun] said that new values shall *always* be 
reflected, not only for PDFs of version 2 and up. Thus, there apparently was 
some reason for that change for arbitrary PDF versions.

> PDTerminalField.applyChange() no longer check for 
> getAcroForm().getNeedAppearances()
> 
>
> Key: PDFBOX-5436
> URL: https://issues.apache.org/jira/browse/PDFBOX-5436
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.0 PDFBox
>Reporter: wai
>Priority: Major
>
> In version 2.0.26, I fill fields in PDF form by
> {code:java}
> PDDocument pdf = ...; // loaded a PDF form
> PDAcroForm pdfForm = pdf.getDocumentCatalog().getAcroForm(); 
> pdfForm.setNeedAppearances(true);
> pdfForm.getField("field_name_").setValue("some text");{code}
> Although the PDF form doesn't contain all glyph for the text set, 
> {{org.apache.pdfbox.pdmodel.interactive.form.PDTerminalField.applyChange()}} 
> would not apply {{constructAppearances()}} as {{setNeedAppearances(true)}} 
> configured.
> However when we come to version 3.0.0-alpha3, 
> {{PDTerminalField.applyChange()}} won't check the status from 
> {{getAcroForm().getNeedAppearances()}} before invoking 
> {{{}constructAppearances(){}}}. This behaviour contradicted the comment wrote 
> "{{{}Applies a value change to the field. Generates appearances if required 
> and raises events.{}}}"
> +version 2.0.26+
>  
> {code:java}
> package org.apache.pdfbox.pdmodel.interactive.form;
> public abstract class PDTerminalField extends PDField
> {
>     /**
>      * Applies a value change to the field. Generates appearances if required 
> and raises events.
>      * 
>      * @throws IOException if the appearance couldn't be generated
>      */
>     protected final void applyChange() throws IOException
>     {
>         if (!getAcroForm().getNeedAppearances())
>         {
>             constructAppearances();
>         }
>         // if we supported JavaScript we would raise a field changed event 
> here
>     }{code}
>  
> +3.0.0-alpha3+
>  
> {code:java}
> package org.apache.pdfbox.pdmodel.interactive.form;
> public abstract class PDTerminalField extends PDField
> { 
>     /**
>      * Applies a value change to the field. Generates appearances if required 
> and raises events.
>      * 
>      * @throws IOException if the appearance couldn't be generated
>      */
>     protected final void applyChange() throws IOException
>     {
>         constructAppearances();
>         // if we supported JavaScript we would raise a field changed event 
> here
>     }{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5433) PDFStreamEngine creating new operators that do not exist in document

2022-05-15 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537224#comment-17537224
 ] 

Michael Klink commented on PDFBOX-5433:
---

Beware, *TD* is not the only operator being implemented by two other ones, *"* 
and *'* are other examples.

An alternative to replacing each such {{OperatorProcessor}} would be to 
determine in {{processOperator}} whether it has been called recursively (e.g. 
by using a counter which you increse at the start and decrease upon leaving) 
and only apply special processing if not (e.g. if such a counter is 0).

I did something like that in my 
[{{PdfContentStreamEditor}}|https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/main/java/mkl/testarea/pdfbox2/content/PdfContentStreamEditor.java]
 using a {{boolean}}, but I just realized that that doesn't suffice in case of 
*"*.

> PDFStreamEngine creating new operators that do not exist in document
> 
>
> Key: PDFBOX-5433
> URL: https://issues.apache.org/jira/browse/PDFBOX-5433
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Mike Cantrell
>Priority: Major
> Attachments: pdfbox-stream-engine-operators.zip, screenshot-1.png
>
>
> We're using PDFStreamEngine to do some analysis and filtering (optimizations) 
> to the document's content streams. I've found an odd case where a form giving 
> us extra (unwanted) operators that don't exist in the original stream.
> According to the PDFDebugger, the form's stream has the following contents:
>  
> {code:java}
> 0 TL
> q
>   BT
>     1 0 0 rg
>     0 i
>     /TT0 20 Tf
>     0 Tc
>     0 Tw
>     0 Ts
>     100 Tz
>     0 Tr
>     0 -15.791 TD
>     (HOODHD035236) Tj
>   ET
> Q{code}
> I created a debug utility to output the operators given by the PDFStreamEngine
> {code:java}
> @Getter
> static class StreamDebugger extends PDFStreamEngine {
> String formName;
> Operator operator;
> List operands;
> int operatorCount;
> public StreamDebugger() {
> addOperator(new BeginText());
> addOperator(new Concatenate());
> addOperator(new DrawObject()); // special text version
> addOperator(new EndText());
> addOperator(new SetGraphicsStateParameters());
> addOperator(new Save());
> addOperator(new Restore());
> addOperator(new NextLine());
> addOperator(new SetCharSpacing());
> addOperator(new MoveText());
> addOperator(new MoveTextSetLeading());
> addOperator(new SetFontAndSize());
> addOperator(new ShowText());
> addOperator(new ShowTextAdjusted());
> addOperator(new SetTextLeading());
> addOperator(new SetMatrix());
> addOperator(new SetTextRenderingMode());
> addOperator(new SetTextRise());
> addOperator(new SetWordSpacing());
> addOperator(new SetTextHorizontalScaling());
> addOperator(new ShowTextLine());
> addOperator(new ShowTextLineAndSpace());
> }
> @Override
> public void showForm(PDFormXObject form) throws IOException {
> this.formName = ((COSName) operands.get(0)).getName();
> super.showForm(form);
> this.formName = null;
> }
> @Override
> protected void processOperator(Operator operator, List operands) 
> throws IOException {
> this.operator = operator;
> this.operands = operands;
> if (Objects.equals(this.formName, "Fm0")) {
> this.operatorCount++;
> System.out.printf("%s:%s%n", operator.getName(), 
> operands.toString());
> }
> super.processOperator(operator, operands);
> }
> } {code}
> The resulting output:
> {code:java}
> TL:[COSInt{0}]
> q:[]
> BT:[]
> rg:[COSInt{1}, COSInt{0}, COSInt{0}]
> i:[COSInt{0}]
> Tf:[COSName{TT0}, COSInt{20}]
> Tc:[COSInt{0}]
> Tw:[COSInt{0}]
> Ts:[COSInt{0}]
> Tz:[COSInt{100}]
> Tr:[COSInt{0}]
> TD:[COSInt{0}, COSFloat{-15.791}]
> TL:[COSFloat{15.791}]
> Td:[COSInt{0}, COSFloat{-15.791}]
> Tj:[COSString{HOODHD035236}]
> ET:[]
> Q:[] {code}
> These operators do not exist in the original stream:
> {code:java}
> TL:[COSFloat{15.791}]
> Td:[COSInt{0}, COSFloat{-15.791}]{code}
> If you were to re-write the stream given the operators from the engine, it 
> causes display issues in the resulting PDF.
> I'm attaching a test case which demonstrates the issue. 
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5430) PDFStreamEngine.showTextStrings with font switch

2022-05-08 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533452#comment-17533452
 ] 

Michael Klink commented on PDFBOX-5430:
---

Indeed, this content stream simply is broken. As [~tilman] has shown, a number 
of instructions therein have - incorrectly! - been made the contents of the 
array argument of a *TJ* instruction.

Adobe Acrobat apparently ignores that the instructions are so enclosed and acts 
as if there was no *\[* or *\] TJ*.

Other viewers might simply ignore (or treat as strings) everything that is 
neither string nor number in the array.

In case of content that matters (as invoice content does), this might lead 
completely different appearances if viewed with different viewers. Thus, PDFBox 
should definitively throw an exception here and not repair it one way or the 
other.

> PDFStreamEngine.showTextStrings with font switch
> 
>
> Key: PDFBOX-5430
> URL: https://issues.apache.org/jira/browse/PDFBOX-5430
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.26
>Reporter: Oliver Schmidtmer
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.27, 3.0.0 PDFBox
>
> Attachments: keine Vorschau ELO-1228188_20220228_11462_HD_online.pdf
>
>
> The attached PDF fails to render with an PDFStreamEngine.showTextStrings with 
> the following exception:
> "java.io.IOException: Unknown type COSName in array for TJ 
> operation:COSName\{F3}"
> This seems to be a font switch.
> {code:java}
> diff --git 
> "a/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDFStreamEngine.java" 
> "b/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDFStreamEngine.java"
> index e4f2259a5..12edadd2b 100644
> --- 
> "a/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDFStreamEngine.java"
> +++ 
> "b/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDFStreamEngine.java"
> @@ -680,6 +680,18 @@ public abstract class PDFStreamEngine
>  byte[] string = ((COSString)obj).getBytes();
>  showText(string);
>  }
> +else if (obj instanceof COSName)
> +{
> +if(((COSName) obj).getName().startsWith("F"))
> +{
> +textState.setFont(resources.getFont((COSName) obj));
> +}
> +else
> +{
> +throw new IOException("Unknown type " + 
> obj.getClass().getSimpleName()
> ++ " in array for TJ operation:" + obj);
> +}
> +}
>  else if (obj instanceof COSArray)
>  {
>  LOG.error("Nested arrays are not allowed in an array for TJ 
> operation:" + obj);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5408) The FlowCollection Licensing report is fine in the Desktop Client but when exported as a PDF the scale makes the report unusable.

2022-04-25 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527575#comment-17527575
 ] 

Michael Klink commented on PDFBOX-5408:
---

Also looking into the internal page representation, there is a bitmap of the 
chart without any red bars and without the text field. Other than that there 
merely is page header and footer material.

Apparently when you updated PDFBox, your PDF creation code broke. Maybe the 
update caused a transitive update of some other dependency while your PDF 
creation code depends on the previous version of that dependency.

It looks like you use iText-5.5.13 (AGPL version) to create the PDF. Thus, it 
should be no problem to share the pivotal parts thereof. 

> The FlowCollection Licensing report is fine in the Desktop Client but when 
> exported as a PDF the scale makes the report unusable.
> -
>
> Key: PDFBOX-5408
> URL: https://issues.apache.org/jira/browse/PDFBOX-5408
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.16, 2.0.25
>Reporter: Thrinadh
>Priority: Major
> Attachments: FlowRateUsageReport-Acrobat.png, PDF_not 
> _printing_chart_properly.jpg, actual_chart_in_desktop_client.jpg, 
> flow_rate_usage_report.pdf
>
>
> We are using pdfbox version 2.0.11 in lower version of product and in both 
> swing client and pdf we can see similar chart
>  
> But in higher version of product we updated pdfbox version to v2.0.16 to 
> overcome security vulnerability after that in pdf the license report chart is 
> not printing properly
> We don't see any errors
>  
> Note: we tried with pdfbox version v2.0.25 but still it is not working



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5420) PDFTextStripper does not use cm to infer correct font size

2022-04-24 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527199#comment-17527199
 ] 

Michael Klink edited comment on PDFBOX-5420 at 4/24/22 3:49 PM:


This is not a bug but a design decision. If you consider the JavaDoc 
documentation of {{TextPosition.getFontSizeInPt}}, you'll see

{code:java}
/**
 * This will get the font size in pt. To get this size we have to multiply 
the font size from
 * {@link #getFontSize() getFontSize()} with the text matrix (set by the 
"Tm" operator)
 * horizontal scaling factor and truncate the result to integer. The actual 
rendering may appear
 * bigger or smaller depending on the current transformation matrix (set by 
the "cm" operator).
 * To get the size in rendering, use {@link #getXScale() getXScale()}.
 *
 * @return The font size in pt.
 */
public float getFontSizeInPt()
{code}

Thus, the behavior you observed is the documented behavior!

Nonetheless, one might wonder whether this _documented_ behavior is the 
_desired_ behavior. So you might consider changing your *bug* issue to an 
*improvement* or *wish* issue. Be aware, though, that this effectively would be 
an API change which would be unlikely to be included in a 2.x update. But maybe 
you're still in time for a 3.0 change.

That being said, though, in that case the proper improvement would be 
different: Both the existing and the proposed code only work (in their 
respective fashion) if the considered matrices only scale. As soon as 
non-trivial rotation is involved, the value returned by {{getFontSizeInPt}} can 
be any number whose absolute value is not larger than the value expected for 
the respective implementation. 

Also, both the existing and the proposed implementation focus on the 
_horizontal_ scaling. Wouldn't the _vertical_ extent be more relevant for a 
font size value?

Furthermore, the page *UserUnit* value is ignored. As _The range of supported 
values shall be implementation-dependent,_ though, both the original 
implementation and your fix could claim that only the value {{1}} is 
supported... ;)



was (Author: mkl):
This is not a bug but a design decision. If you consider the JavaDoc 
documentation of {{TextPosition.getFontSizeInPt}}, you'll see

{code:java}
/**
 * This will get the font size in pt. To get this size we have to multiply 
the font size from
 * {@link #getFontSize() getFontSize()} with the text matrix (set by the 
"Tm" operator)
 * horizontal scaling factor and truncate the result to integer. The actual 
rendering may appear
 * bigger or smaller depending on the current transformation matrix (set by 
the "cm" operator).
 * To get the size in rendering, use {@link #getXScale() getXScale()}.
 *
 * @return The font size in pt.
 */
public float getFontSizeInPt()
{code}

Thus, the behavior you observed is the documented behavior!

Nonetheless, one might wonder whether this _documented_ behavior is the 
_desired_ behavior. So you might consider changing your *bug* issue to an 
*improvement* or *wish* issue. Be aware, though, that this effectively would be 
an API change which would be unlikely to be included in a 2.x update. But maybe 
you're still in time for a 3.0 change.

That being said, though, in that case the proper improvement would be 
different: Both the existing and the proposed code only work (in their 
respective fashion) if the considered matrices only scale. As soon as 
non-trivial rotation is involved, the value returned by {{getFontSizeInPt}} can 
be any number whose absolute value is not larger than the value expected for 
the respective implementation. 

Furthermore, the page *UserUnit* value is ignored. As _The range of supported 
values shall be implementation-dependent,_ though, both the original 
implementation and your fix could claim that only the value {{1}} is 
supported... ;)

> PDFTextStripper does not use cm to infer correct font size
> --
>
> Key: PDFBOX-5420
> URL: https://issues.apache.org/jira/browse/PDFBOX-5420
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Franken
>Priority: Minor
> Attachments: TextStripperTest.kt, 
> TextStripperUsesTransformationMatrix.java, ec_2.fixed.pdf, 
> image-2022-04-23-14-46-34-929.png
>
>
> *Given*
> Given is a PDF where the cm operator is used to scale the transformation 
> matrix by a factor of 0.02834933. The font size is then set to 282 using the 
> Tf operator. 
> !image-2022-04-23-14-46-34-929.png|width=389,height=84!
>  
> *Error Description*
> When the PdfTextStripper is used to fetch the text from that pdf, the 
> internal representation of the Textpositions contains the wrong font size of 
> 282pt. The correct font size would be 10pt. The reason for this 

[jira] [Commented] (PDFBOX-5420) PDFTextStripper does not use cm to infer correct font size

2022-04-24 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527199#comment-17527199
 ] 

Michael Klink commented on PDFBOX-5420:
---

This is not a bug but a design decision. If you consider the JavaDoc 
documentation of {{TextPosition.getFontSizeInPt}}, you'll see

{code:java}
/**
 * This will get the font size in pt. To get this size we have to multiply 
the font size from
 * {@link #getFontSize() getFontSize()} with the text matrix (set by the 
"Tm" operator)
 * horizontal scaling factor and truncate the result to integer. The actual 
rendering may appear
 * bigger or smaller depending on the current transformation matrix (set by 
the "cm" operator).
 * To get the size in rendering, use {@link #getXScale() getXScale()}.
 *
 * @return The font size in pt.
 */
public float getFontSizeInPt()
{code}

Thus, the behavior you observed is the documented behavior!

Nonetheless, one might wonder whether this _documented_ behavior is the 
_desired_ behavior. So you might consider changing your *bug* issue to an 
*improvement* or *wish* issue. Be aware, though, that this effectively would be 
an API change which would be unlikely to be included in a 2.x update. But maybe 
you're still in time for a 3.0 change.

That being said, though, in that case the proper improvement would be 
different: Both the existing and the proposed code only work (in their 
respective fashion) if the considered matrices only scale. As soon as 
non-trivial rotation is involved, the value returned by {{getFontSizeInPt}} can 
be any number whose absolute value is not larger than the value expected for 
the respective implementation. 

Furthermore, the page *UserUnit* value is ignored. As _The range of supported 
values shall be implementation-dependent,_ though, both the original 
implementation and your fix could claim that only the value {{1}} is 
supported... ;)

> PDFTextStripper does not use cm to infer correct font size
> --
>
> Key: PDFBOX-5420
> URL: https://issues.apache.org/jira/browse/PDFBOX-5420
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Franken
>Priority: Minor
> Attachments: TextStripperTest.kt, 
> TextStripperUsesTransformationMatrix.java, ec_2.fixed.pdf, 
> image-2022-04-23-14-46-34-929.png
>
>
> *Given*
> Given is a PDF where the cm operator is used to scale the transformation 
> matrix by a factor of 0.02834933. The font size is then set to 282 using the 
> Tf operator. 
> !image-2022-04-23-14-46-34-929.png|width=389,height=84!
>  
> *Error Description*
> When the PdfTextStripper is used to fetch the text from that pdf, the 
> internal representation of the Textpositions contains the wrong font size of 
> 282pt. The correct font size would be 10pt. The reason for this 
> miscalculation is the fact, that the PdfTextStripper does not scale the text 
> size based on the current transformation matrix. 
>  
>  *Proposed fix*
> In the file LegacyPDFStreamEngine.java that bug can be fixed in the showGlyph 
> function. There the fontSizeInPt must be calculated using the following code:
> {code:java}
> processTextPosition(new TextPosition(pageRotation, pageSize.getWidth(),
> pageSize.getHeight(), translatedTextRenderingMatrix, nextX, nextY,
> Math.abs(dyDisplay), dxDisplay,
> Math.abs(spaceWidthDisplay), unicodeMapping, new int[] { code }, font,
> fontSize,
> (int)(fontSize * textMatrix.getScalingFactorX() * 
> graphicsState.currentTransformationMatrix.scalingFactorX)));{code}
> *Further remarks*
> To easily triage the error, i attached a unit test and a sample file. The 
> sample was manually edited to remove all unnecessary data and fixed with 
> qpdf. However, i redacted only the content stream, other objects in the pdf 
> are still present, thus the pdf is pretty large. As i'm mainly programming 
> kotlin, i attached the original version of the test i used to debug that 
> issue. There is also a java version attached. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5411) PDFTextStripper could use text size in reconstruction

2022-04-10 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17520126#comment-17520126
 ] 

Michael Klink commented on PDFBOX-5411:
---

{quote}it could make use of glyph size to disambiguate "easy" cases like this 
one{quote}
In the example disambiguation by the glyph size would result in a better 
output. But there are other cases in which it would result in a worse result, 
e.g. in a poor man's caps/small caps emulation.

Of course, your example also offers slightly different base lines, overlapping 
actual glyph drawings, and different colors as hints. Each hint by itself would 
not suffice, all together probably would.

> PDFTextStripper could use text size in reconstruction
> -
>
> Key: PDFBOX-5411
> URL: https://issues.apache.org/jira/browse/PDFBOX-5411
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Text extraction
>Affects Versions: 2.0.25, 3.0.0 PDFBox
>Reporter: Lapo Luchini
>Priority: Minor
> Attachments: image-2022-04-08-16-13-17-334.png, textDoubleText.pdf
>
>
> When two texts are partially overlapping {{PDFTextStripper}} seems to return 
> a mix simply based on "leftmost x coordinate of the glyph", which makes 
> sense, but it could make use of glyph size to disambiguate "easy" cases like 
> this one:
> !image-2022-04-08-16-13-17-334.png!
> currently this is the first parameter of PDFTextStripper.writeString(String 
> string, List textPositions):
> {{"T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"}}
> I would of course hope for two calls:
> {{"TEST LINE"}}
> {{"051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5407) Fields visible on click if NeedAppearances = false

2022-04-04 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17517038#comment-17517038
 ] 

Michael Klink commented on PDFBOX-5407:
---

[~msahyoun],
{quote}
This is intented. Setting {{needAppearances(false)}} will skip generating the 
visual content of the field (the appearance).
{quote}

Hhmmm, the JavaDocs of {{PDAcroForm.setNeedAppearances(Boolean)}} claim that 
for {{false}} PDFBox *does generate the visual contents*:

{code:java}
/**
 * Set the NeedAppearances value. If this is false, PDFBox will create 
appearances for all field
 * widget.
 * 
 * @param value the value for NeedAppearances
 */
public void setNeedAppearances(Boolean value)
{
dictionary.setBoolean(COSName.NEED_APPEARANCES, value);
}
{code}

And this makes sense, setting *NeedAppearances* to *false* in the *AcroForm* 
dictionary effectively tells the next PDF processor that all the widget 
appearances are there and up-to-date...

> Fields visible on click if NeedAppearances = false
> --
>
> Key: PDFBOX-5407
> URL: https://issues.apache.org/jira/browse/PDFBOX-5407
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm, Rendering
>Affects Versions: 2.0.24
>Reporter: Dmitry Betanov
>Priority: Minor
>  Labels: Appearance
> Attachments: Main.java, input.pdf, invisible_fields.mov, output.pdf
>
>
> We have an issue that if we use NeedAppearances = false some of the input 
> field values only visible on click.
> This happens only on few viewers like in safari and MacOS default pdf viewer.
> The problem is that we cannot use NeedAppearances = true and similar issues 
> suggests to use it.
> Pdfbox version - 2.0.24, JDK - 11.
> Video shows an example of invisible values in MacOS default pdf viewer and 
> visible values in chrome.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5405) "Page tree root must be a dictionary" when attempting to parse pdf

2022-03-31 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515401#comment-17515401
 ] 

Michael Klink commented on PDFBOX-5405:
---

Indeed it's truncated, considerably so. According to its linearization 
dictionary the file should have had a size of 1886887 bytes and not merely 
491520 bytes...

> "Page tree root must be a dictionary" when attempting to parse pdf 
> ---
>
> Key: PDFBOX-5405
> URL: https://issues.apache.org/jira/browse/PDFBOX-5405
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.25
>Reporter: Johannes Wirkkala Westlund
>Priority: Minor
> Attachments: Grafiska riktlinjer, fordon LRV.pdf
>
>
> Hi,
> I have a PDF file that throws the following error when I try to parse it:
> {code:java}
> Caused by: java.io.IOException: Page tree root must be a dictionary
>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1228)
>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1202)
>     at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:191)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:149)
>     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
>     ... 5 more {code}
> I have attached the file in question with this issue.
> Might be related to PDFBOX-4915



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5401) A carefully crafted pdf can trigger an infinite loop while parsing

2022-03-25 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512489#comment-17512489
 ] 

Michael Klink commented on PDFBOX-5401:
---

Indeed, there is an inconsistency in 
{{org.apache.pdfbox.pdfparser.COSParser.parseXref(long)}}, in the {{prevSet}} 
the _actual_ start positions of the cross reference tables are stored but the 
tests to prevent recursions are done by checking the _claimed_ start positions 
(the *Prev* values).

> A carefully crafted pdf can trigger an infinite loop while parsing
> --
>
> Key: PDFBOX-5401
> URL: https://issues.apache.org/jira/browse/PDFBOX-5401
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing, PDModel
>Affects Versions: 3.0.0 PDFBox
> Environment: Mac OS 12.1 & Ubuntu Linux 16.04 (4.15.0-163-generic)
>Reporter: Xiaohan Zhang
>Priority: Major
> Attachments: verified.zip
>
>
> Hi, I found a crafted pdf that can trigger an infinite loop while parsing 
> using PDFBOX. I have tested on the latest commit of PDFBOX on Github.
>  
> This bug can be triggered by the following code.
> ```
> File ff = new File("path/to/the/sample");
> PDDocument document = Loader.loadPDF(ff);
> ```
>  
> I found that the root cause of this infinite loop resides in the while-loop 
> at line 321 of  [COSParse.java|#L321].]. When parsing the provided PDF files, 
> the variable $prev is never changed during this loop.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5398) Parsing fails in 2.0.26 that worked in 2.0.25

2022-03-24 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511722#comment-17511722
 ] 

Michael Klink commented on PDFBOX-5398:
---

Yes, that file is _kaputt_. It suffices to make sure that PDFBox does not 
seriously hang up or kill the VM for it. An exception during parsing is 
completely appropriate. I would prefer a declared one, though, not a 
RuntimeException or Error.

> Parsing fails in 2.0.26 that worked in 2.0.25
> -
>
> Key: PDFBOX-5398
> URL: https://issues.apache.org/jira/browse/PDFBOX-5398
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.26, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>Priority: Major
>  Labels: regression
> Attachments: 077867.pdf, 392443.pdf, 
> crash-024bde7e01045bb3a6ab9d86b13cf411bc35.pdf
>
>
> {noformat}
> März 23, 2022 4:14:13 AM org.apache.pdfbox.pdfparser.BaseParser 
> parseCOSDictionaryNameValuePair
> WARNUNG: Empty COSName at offset 12313
> Exception in thread "main" java.io.IOException: Unknown dir object c='>' 
> cInt=62 peek='>' peekInt=62 at offset 12326 (start offset: 12326)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:928)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:303)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:228)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:872)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:303)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:228)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:872)
> at 
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:916)
> at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:883)
> at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:796)
> at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:756)
> at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
> {noformat}
> The cause is not PDFBOX-5283.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5398) Parsing fails in 2.0.26 that worked in 2.0.25

2022-03-23 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511263#comment-17511263
 ] 

Michael Klink commented on PDFBOX-5398:
---

Indeed, to _stop reading dictionaries containing empty COSName entries, most 
likely they are broken_ can mean to ignore perfectly correct dictionaries.

> Parsing fails in 2.0.26 that worked in 2.0.25
> -
>
> Key: PDFBOX-5398
> URL: https://issues.apache.org/jira/browse/PDFBOX-5398
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.26, 3.0.0 PDFBox
>Reporter: Tilman Hausherr
>Assignee: Andreas Lehmkühler
>Priority: Major
>  Labels: regression
> Attachments: 077867.pdf, 392443.pdf
>
>
> {noformat}
> März 23, 2022 4:14:13 AM org.apache.pdfbox.pdfparser.BaseParser 
> parseCOSDictionaryNameValuePair
> WARNUNG: Empty COSName at offset 12313
> Exception in thread "main" java.io.IOException: Unknown dir object c='>' 
> cInt=62 peek='>' peekInt=62 at offset 12326 (start offset: 12326)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:928)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:303)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:228)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:872)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:303)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:228)
> at 
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:872)
> at 
> org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:916)
> at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:883)
> at 
> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:796)
> at 
> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:756)
> at 
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
> {noformat}
> The cause is not PDFBOX-5283.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5400) Page tree root must be a dictionary

2022-03-23 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511261#comment-17511261
 ] 

Michael Klink commented on PDFBOX-5400:
---

(PDF with broken cross reference table. PDFBox used to be able to repair. Might 
be related to changes from to PDFBOX-5283.)

> Page tree root must be a dictionary
> ---
>
> Key: PDFBOX-5400
> URL: https://issues.apache.org/jira/browse/PDFBOX-5400
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.26
>Reporter: Tilman Hausherr
>Priority: Major
>  Labels: regression
> Attachments: 4ECBGZDM5GUZG7UT75RV5GTUFWF5TSXK.pdf
>
>
> worked in 2.0.25
> {noformat}
> Caused by: java.io.IOException: Page tree root must be a dictionary
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1107)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5400) Page tree root must be a dictionary

2022-03-23 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511261#comment-17511261
 ] 

Michael Klink edited comment on PDFBOX-5400 at 3/23/22, 1:38 PM:
-

(PDF with broken cross reference table. PDFBox used to be able to repair. Might 
be related to changes from PDFBOX-5283.)


was (Author: mkl):
(PDF with broken cross reference table. PDFBox used to be able to repair. Might 
be related to changes from to PDFBOX-5283.)

> Page tree root must be a dictionary
> ---
>
> Key: PDFBOX-5400
> URL: https://issues.apache.org/jira/browse/PDFBOX-5400
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.26
>Reporter: Tilman Hausherr
>Priority: Major
>  Labels: regression
> Attachments: 4ECBGZDM5GUZG7UT75RV5GTUFWF5TSXK.pdf
>
>
> worked in 2.0.25
> {noformat}
> Caused by: java.io.IOException: Page tree root must be a dictionary
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:198)
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1107)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5391) showGlyph override not working

2022-03-18 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17508737#comment-17508737
 ] 

Michael Klink commented on PDFBOX-5391:
---

In PDFBox 3 there only is {{showGlyph(Matrix textRenderingMatrix, PDFont font, 
int code, Vector displacement)}}. The {{String unicode}} parameter has been 
removed from trunk two years ago.

If _drilling_ tells you something different, you probably have multiple PDFBox 
versions in the classes you use for drilling, and you actually end up in some 
2.x version.

> showGlyph override not working
> --
>
> Key: PDFBOX-5391
> URL: https://issues.apache.org/jira/browse/PDFBOX-5391
> Project: PDFBox
>  Issue Type: Wish
>Affects Versions: 3.0.0 PDFBox
>Reporter: Benjamin
>Priority: Major
>
> I use the showGlyph function by extending 
> org.apache.pdfbox.text.PDFTextStripperByArea, when testing using the 
> 3.0.0-RC1 this was not working, the error is that it does not override or 
> implement a method from a supertype, because the pdfbox I am using has a 
> different showGlyph signature.
> The signature I am using, as per
> https://pdfbox.apache.org/docs/2.0.1/javadocs/org/apache/pdfbox/text/PDFTextStripperByArea.html#showGlyph(org.apache.pdfbox.util.Matrix,%20org.apache.pdfbox.pdmodel.font.PDFont,%20int,%20java.lang.String,%20org.apache.pdfbox.util.Vector)
>  
> showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, 
> Vector displacement)
> the signature in 3 that is coming through with the error is:
> showGlyph(Matrix textRenderingMatrix, PDFont font, int code, Vector 
> displacement)
> When I drill into the super, this is not the signature that I see, which is 
> the same as in 2, but it is the signature that I am gettign an error from 
> when parsing using 17.0.2 openjdk - I can't explain this but can't get 
> through it either.
> To test, try to override the showGlyph by extending 
> org.apache.pdfbox.text.PDFTextStripperByArea;
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5386) Wrong image generated

2022-03-11 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505033#comment-17505033
 ] 

Michael Klink commented on PDFBOX-5386:
---

Actually the appearance of the screen shot with the error reminds me of images 
with transmission errors, in particular the sudden switch of colors in the 
middle of a line and then a shift of the image content.

> Wrong image generated
> -
>
> Key: PDFBOX-5386
> URL: https://issues.apache.org/jira/browse/PDFBOX-5386
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.25
>Reporter: med medin
>Priority: Major
> Attachments: Screenshot_20220310_223538.png, cover.pdf, 
> screenshot-1.png
>
>
> Only the cover of the pdf which has yellow color is wrongly converted to 
> BufferedImage
> !Screenshot_20220310_223538.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



  1   2   3   4   5   6   >