[jira] [Comment Edited] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882327#comment-17882327
 ] 

Tilman Hausherr edited comment on PDFBOX-5879 at 9/17/24 9:08 AM:
--

I added a simple test for the rotationMagic feature because it turns out we 
didn't have any. However this isn't a test of the fixed bug, that would have 
been more difficult to create a file, and there is no risk that this fix gets 
reverted anyway.


was (Author: tilman):
I added a simple test for the feature because it turns out we didn't have any. 
However this isn't a test of the fixed bug, that would have been more difficult 
to create a file, and there is no risk that this fix gets reverted anyway.

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
>     URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882326#comment-17882326
 ] 

ASF subversion and git services commented on PDFBOX-5879:
-

Commit 1920739 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920739 ]

PDFBOX-5879: remove test message

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882325#comment-17882325
 ] 

ASF subversion and git services commented on PDFBOX-5879:
-

Commit 1920738 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1920738 ]

PDFBOX-5879: remove test message

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882324#comment-17882324
 ] 

ASF subversion and git services commented on PDFBOX-5879:
-

Commit 1920737 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1920737 ]

PDFBOX-5879: add test for rotationMagic

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882322#comment-17882322
 ] 

ASF subversion and git services commented on PDFBOX-5879:
-

Commit 1920736 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920736 ]

PDFBOX-5879: add test for rotationMagic

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882318#comment-17882318
 ] 

ASF subversion and git services commented on PDFBOX-5879:
-

Commit 1920735 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1920735 ]

PDFBOX-5879: add test for rotationMagic

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882299#comment-17882299
 ] 

ASF subversion and git services commented on PDFBOX-5879:
-

Commit 1920732 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1920732 ]

PDFBOX-5879: remove unused import

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882298#comment-17882298
 ] 

ASF subversion and git services commented on PDFBOX-5879:
-

Commit 1920731 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920731 ]

PDFBOX-5879: remove unused import

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882297#comment-17882297
 ] 

ASF subversion and git services commented on PDFBOX-5879:
-

Commit 1920730 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1920730 ]

PDFBOX-5879: remove unused import

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Resolved] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5879.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

Thank you. It's not the commit, it's poor programming that got exposed because 
of the commit.

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5879:

Affects Version/s: 2.0.32

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Priority: Major
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882284#comment-17882284
 ] 

ASF subversion and git services commented on PDFBOX-5879:
-

Commit 1920729 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1920729 ]

PDFBOX-5879: avoid ClassCastException

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Priority: Major
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882282#comment-17882282
 ] 

ASF subversion and git services commented on PDFBOX-5879:
-

Commit 1920728 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920728 ]

PDFBOX-5879: avoid ClassCastException

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Priority: Major
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882281#comment-17882281
 ] 

ASF subversion and git services commented on PDFBOX-5879:
-

Commit 1920727 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1920727 ]

PDFBOX-5879: avoid ClassCastException

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Priority: Major
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882260#comment-17882260
 ] 

ASF subversion and git services commented on PDFBOX-5852:
-

Commit 1920726 from le...@apache.org in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920726 ]

PDFBOX-5852: deprecate IntPoint

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882258#comment-17882258
 ] 

ASF subversion and git services commented on PDFBOX-5852:
-

Commit 1920725 from le...@apache.org in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920725 ]

PDFBOX-5852: replace IntPoint with Point

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882256#comment-17882256
 ] 

ASF subversion and git services commented on PDFBOX-5852:
-

Commit 1920724 from le...@apache.org in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920724 ]

PDFBOX-5852: replace Map with a two-dimensional array

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882254#comment-17882254
 ] 

Andreas Lehmkühler commented on PDFBOX-5852:


[~tilman] thanks for the fast feedback. BTW, you gave a valuable hint yourself 
where to look for a possible optimization. :-)

I'm going to add the changes to the other branches as well

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882252#comment-17882252
 ] 

ASF subversion and git services commented on PDFBOX-5852:
-

Commit 1920723 from le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1920723 ]

PDFBOX-5852: remove no longer needed class

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882240#comment-17882240
 ] 

Tilman Hausherr commented on PDFBOX-5852:
-

Wow!

No regressions.

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882144#comment-17882144
 ] 

Andreas Lehmkühler edited comment on PDFBOX-5852 at 9/16/24 5:59 PM:
-

I've replaced the map for the pre-calculated color values with a 
two-dimensional array. It speeds up rendering by a factor 2 at 100%, a factor 
of 5 at 200% and a factor of 10 at 400% using the pdf debugger.

I guess there is some positive effect concerning the memory consumption as 
well, but not as clearly as the effect on the rendering time

[~tilman] please run some rendering tests to be sure there aren't any 
regressions. I'm going to backport the changes to 3.0.x and 2.0.x if everything 
works as expected


was (Author: lehmi):
I've replace the Map for the pre-calculated color values with a two-dimensional 
array. It speeds up rendering by a factor 2 at 100%, a factor of 5 at 200% and 
a factor of 10 at 400% using the pdf debugger.

I guess there is some positive effect concerning the memory consumption as 
well, but not as clearly as the effect on the rendering time

[~tilman] please run some rendering tests to be sure there aren't any 
regressions. I'm going to backport the changes to 3.0.x and 2.0.x if everything 
works as expected

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
>     URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882144#comment-17882144
 ] 

Andreas Lehmkühler commented on PDFBOX-5852:


I've replace the Map for the pre-calculated color values with a two-dimensional 
array. It speeds up rendering by a factor 2 at 100%, a factor of 5 at 200% and 
a factor of 10 at 400% using the pdf debugger.

I guess there is some positive effect concerning the memory consumption as 
well, but not as clearly as the effect on the rendering time

[~tilman] please run some rendering tests to be sure there aren't any 
regressions. I'm going to backport the changes to 3.0.x and 2.0.x if everything 
works as expected

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
>     URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread Jira


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-5852:
---
Fix Version/s: 3.0.3 PDFBox

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread Jira


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-5852:
---
Fix Version/s: 4.0.0

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882140#comment-17882140
 ] 

ASF subversion and git services commented on PDFBOX-5852:
-

Commit 1920718 from le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1920718 ]

PDFBOX-5852: replace Map with a two-dimensional array

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page

2024-09-16 Thread Jira
Gábor Stefanik created PDFBOX-5879:
--

 Summary: Regression from PDFBOX-5841: Text extraction with 
rotation magic fails for PDF with multiple content streams in a page
 Key: PDFBOX-5879
 URL: https://issues.apache.org/jira/browse/PDFBOX-5879
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 3.0.3 PDFBox
Reporter: Gábor Stefanik
 Attachments: MVM_Aram_augusztus.pdf

{code:java}
java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
-i="MVM_Aram_augusztus.pdf" {code}
fails with the following error:
{code:java}
java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
        at 
org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
        at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
        at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
        at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
        at picocli.CommandLine.access$1500(CommandLine.java:148)
        at 
picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
        at 
picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
        at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
        at picocli.CommandLine.execute(CommandLine.java:2174)
        at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
The same command succeeds in 3.0.2.

The triggering PDF can be downloaded from 
[https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
 and is also attached.

The root cause appears to be this change: 
[https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
 from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-15 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5852:

Description: 
We've observed excessive CPU and memory consumption when converting a PDF to 
images when the PDF contains type 4 shading.  This is especially noticeable 
when the conversion is done with a high DPI.  Can this be improved?

 

Conversation from the PDFBox users mailing list follows

Initial email:
{quote}
Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
users and maintainers,

We have a PDF that causes performance problems when we use PDFBox to
convert it to an image with renderImageWithDPI().  We're calling
renderImageWithDPI()
with 650 DPI.  I realize this is a very high value - we're using it for
high fidelity original images that will later be downsampled.  On my work
laptop which has fairly strong hardware, the conversion takes 25 minutes
and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
lower DPI.

The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
triangle meshes.  We've been aware of some performance issues with type 4
shading for a little while now, but the PDFs that contained the type 4
shading belonged to our customers and we were not authorized to share
them.  We finally found a problem input document that is non-sensitive and
that we are authorized to share.  I've attached a copy of the problem PDF
to this email.

I searched the archives for the users and the developers mailing list and I
didn't find anything specifically about this issue.
I searched through the PDFBox jira tickets and I found a couple of tickets
that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
and our issue still reproduces with PDFBox 2.0.28.

Should I refer this issue over to the developers mailing list or create a
PDFBox Jira ticket for this?

Thanks and Regards,
Larry Lynn {quote}
Response:
{quote}
Hi,

Yes shading can be very slow, especially at high dpi. The attachment 
didn't get through, please upload to a sharehoster or create a ticket. 
If you need to register then add a meaningful text, e.g. the subject of 
this post so we know you're not a spammer. Also retry with 2.0.31 and 
3.0.2 just to be sure. However I'm pessimistic that this can be fixed.

Tilman {quote}
 

  was:
We've observed excessive CPU and memory consumption when converting a PDF to 
images when the PDF contains type 4 shading.  This is especially noticeable 
when the conversion is done with a high DPI.  Can this be improved?

 

Conversation from the PDFBox users mailing list follows

Initial email:
{code:java}
Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
users and maintainers,

We have a PDF that causes performance problems when we use PDFBox to
convert it to an image with renderImageWithDPI().  We're calling
renderImageWithDPI()
with 650 DPI.  I realize this is a very high value - we're using it for
high fidelity original images that will later be downsampled.  On my work
laptop which has fairly strong hardware, the conversion takes 25 minutes
and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
lower DPI.

The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
triangle meshes.  We've been aware of some performance issues with type 4
shading for a little while now, but the PDFs that contained the type 4
shading belonged to our customers and we were not authorized to share
them.  We finally found a problem input document that is non-sensitive and
that we are authorized to share.  I've attached a copy of the problem PDF
to this email.

I searched the archives for the users and the developers mailing list and I
didn't find anything specifically about this issue.
I searched through the PDFBox jira tickets and I found a couple of tickets
that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
and our issue still reproduces with PDFBox 2.0.28.

Should I refer this issue over to the developers mailing list or create a
PDFBox Jira ticket for this?

Thanks and Regards,
Larry Lynn {code}
Response:
{code:java}
Hi,

Yes shading can be very slow, especially at high dpi. The attachment 
didn't get through, please upload to a sharehoster or create a ticket. 
If you need to register then add a meaningful text, e.g. the subject of 
this post so we know you're not a spammer. Also retry with 2.0.31 and 
3.0.2 just to be sure. However I'm pessimistic that this can be fixed.

Tilman {code}
 


> Hi CPU and memory usage when converting a PDF with type 4 shading
> --

[jira] [Resolved] (PDFBOX-5469) Make COSString immutable

2024-09-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-5469.

Resolution: Fixed

Set to resolved

> Make COSString immutable
> 
>
> Key: PDFBOX-5469
> URL: https://issues.apache.org/jira/browse/PDFBOX-5469
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel
>Affects Versions: 2.0.26, 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 4.0.0
>
>
> We should change COSString to be immutable as discussed in PDFBOX-5451. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5469) Make COSString immutable

2024-09-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881856#comment-17881856
 ] 

ASF subversion and git services commented on PDFBOX-5469:
-

Commit 1920693 from le...@apache.org in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920693 ]

PDFBOX-5469: use new constructor to avoid deprecated setter

> Make COSString immutable
> 
>
> Key: PDFBOX-5469
> URL: https://issues.apache.org/jira/browse/PDFBOX-5469
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel
>Affects Versions: 2.0.26, 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 4.0.0
>
>
> We should change COSString to be immutable as discussed in PDFBOX-5451. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5469) Make COSString immutable

2024-09-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881855#comment-17881855
 ] 

ASF subversion and git services commented on PDFBOX-5469:
-

Commit 1920692 from le...@apache.org in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920692 ]

PDFBOX-5469: add two constructors to avoid deprecated setter

> Make COSString immutable
> 
>
> Key: PDFBOX-5469
> URL: https://issues.apache.org/jira/browse/PDFBOX-5469
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing, PDModel
>Affects Versions: 2.0.26, 3.0.0 PDFBox
>Reporter: Andreas Lehmkühler
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 4.0.0
>
>
> We should change COSString to be immutable as discussed in PDFBOX-5451. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-09-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881854#comment-17881854
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 1920689 from le...@apache.org in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1920689 ]

PDFBOX-5660: close InputStream as MemoryCacheImageInputStream doesn't do it

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
>     URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-09-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881852#comment-17881852
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 1920687 from le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1920687 ]

PDFBOX-5660: close InputStream as MemoryCacheImageInputStream doesn't do it

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
>     URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-09-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881853#comment-17881853
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 1920688 from le...@apache.org in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920688 ]

PDFBOX-5660: close InputStream as MemoryCacheImageInputStream doesn't do it

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
>     URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-09-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881325#comment-17881325
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 1920596 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1920596 ]

PDFBOX-5660: update minimum maven version

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
> URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-09-12 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881324#comment-17881324
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 1920595 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920595 ]

PDFBOX-5660: update log4j

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
> URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-09 Thread Joseph Jezerinac (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880425#comment-17880425
 ] 

Joseph Jezerinac commented on PDFBOX-5878:
--

 Please could  any admin kindly delete the attachments. [~tilman]

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
>     pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-09 Thread Joseph Jezerinac (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Jezerinac updated PDFBOX-5878:
-
Attachment: (was: beforeFlattening.pdf)

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
>     pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Assigned] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-09 Thread Jira


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler reassigned PDFBOX-5852:
--

Assignee: Andreas Lehmkühler

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {code:java}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {code}
> Response:
> {code:java}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-08 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880141#comment-17880141
 ] 

Michael Klink edited comment on PDFBOX-5878 at 9/8/24 2:01 PM:
---

As an aside: The document has set *NeedAppearances* to *true*, so regenerating 
the appearances for flattening would be appropriate here whether there are any 
issues in the PDF or not.

Furthermore, when (re-)creating form field appearances, one should always 
create from scratch. The only exception may be if one has analyzed the existing 
appearance contents and has made sure that one has removed all old field 
contents from it and also that there is nothing in it that moves the 
to-be-drawn new content out of the bbox or otherwise obscures it. Else this 
might be a variation of one of the well-known www.pdf-insecurity.org attacks. 
But I think it in general is really hard to tell whether some "rectangles and 
such" are drawn as part of a specific style or as an forgery attempt. Thus, I'd 
really propose to always re-create from scratch.


was (Author: mkl):
As an aside: The document has set *NeedAppearances* to *true*, so regenerating 
the appearances for flattening would be appropriate here whether there are any 
issues in the PDF or not.

Furthermore, when (re-)creating form field appearances, one should always 
create from scratch. The only exception may be if one has analyzed the existing 
appearance contents and has made sure that one has removed all old field 
contents from it and also that there is nothing in it that moves the 
to-be-drawn new content out of the bbox or otherwise obscure it. Else this 
might be a variation of one of the well-known www.pdf-insecurity.org attacks. 
But I think it in general is really hard to tell whether some "rectangles and 
such" are drawn as part of a specific style or as an forgery attempt. Thus, I'd 
really propose to always re-create from scratch.

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
>     URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, 
> flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
>     if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-08 Thread Michael Klink (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880141#comment-17880141
 ] 

Michael Klink edited comment on PDFBOX-5878 at 9/8/24 1:59 PM:
---

As an aside: The document has set *NeedAppearances* to *true*, so regenerating 
the appearances for flattening would be appropriate here whether there are any 
issues in the PDF or not.

Furthermore, when (re-)creating form field appearances, one should always 
create from scratch. The only exception may be if one has analyzed the existing 
contents and has made sure that one has removed all old field contents from it 
and also that there is nothing in it that moves the to-be-drawn new content out 
of the bbox or otherwise obscure it. Else this might be a variation of one of 
the well-known www.pdf-insecurity.org attacks. But I think it in general is 
really hard to tell whether some "rectangles and such" are drawn as part of a 
specific style or as an forgery attempt. Thus, I'd really propose to always 
re-create from scratch.


was (Author: mkl):
As an aside: The document has set **NeedAppearances** to **true**, so 
regenerating the appearances for flattening would be appropriate here whether 
there are any issues in the PDF or not.

Furthermore, when (re-)creating form field appearances, one should always 
create from scratch. The only exception may be if one has analyzed the existing 
contents and has made sure that one has removed all old field contents from it 
and also that there is nothing in it that moves the to-be-drawn new content out 
of the bbox or otherwise obscure it. Else this might be a variation of one of 
the well-known www.pdf-insecurity.org attacks. But I think it in general is 
really hard to tell whether some "rectangles and such" are drawn as part of a 
specific style or as an forgery attempt. Thus, I'd really propose to always 
re-create from scratch.

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
>     URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, 
> flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
>     if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-09-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880031#comment-17880031
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 1920506 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1920506 ]

PDFBOX-5660: update log4j

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
> URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-06 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879888#comment-17879888
 ] 

Maruan Sahyoun commented on PDFBOX-5878:


it's not weird per se it's only weird here as there are multiple occurences 
within a fields appearance stream for the value and other stuff where there 
should be only one. But we can look at changing that maybe for 4.0 to create 
from scratch (which is more in line with what Adobe does but normally according 
to the Spec onyl with some field settings which do not apply here).  

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, 
> flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-06 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879832#comment-17879832
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 10:35 AM:
--

Here's what worked:
{code:java}
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (field instanceof PDVariableText)
{
for (PDAnnotationWidget widget : field.getWidgets())
{
widget.setAppearance(null);
}
}
}
}
acroForm.refreshAppearances();
{code}
 [^PDFBox5878-flattened.pdf]
[^PDFBox5878-saved.pdf] 
The only problem left is that the second multiline field starts a bit too low, 
but IIRC there's another issue about that.


was (Author: tilman):
Here's what worked:
{code:java}
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (field instanceof PDVariableText)
{
for (PDAnnotationWidget widget : field.getWidgets())
{
widget.setAppearance(null);
}
}
}
}
acroForm.refreshAppearances();
{code}
 [^PDFBox5878-flattened.pdf]
[^PDFBox5878-saved.pdf] 

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
>     URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, 
> flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-06 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879832#comment-17879832
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 10:31 AM:
--

Here's what worked:
{code:java}
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (field instanceof PDVariableText)
{
for (PDAnnotationWidget widget : field.getWidgets())
{
widget.setAppearance(null);
}
}
}
}
acroForm.refreshAppearances();
{code}
 [^PDFBox5878-flattened.pdf]
[^PDFBox5878-saved.pdf] 


was (Author: tilman):
Here's what worked:
{code:java}
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (field instanceof PDVariableText)
{
for (PDAnnotationWidget widget : field.getWidgets())
{
widget.setAppearance(null);
}
}
}
acroForm.refreshAppearances();
}
{code}
 [^PDFBox5878-flattened.pdf]
[^PDFBox5878-saved.pdf] 

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, 
> flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-06 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5878:

Attachment: PDFBox5878-flattened.pdf
PDFBox5878-saved.pdf

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, 
> flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-06 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879822#comment-17879822
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 9:30 AM:
-

I added this for the missing fonts, which is just a guess that it's the correct 
font
{code:java}
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
acroForm.setNeedAppearances(false);
PDFont font1 = PDType0Font.load(doc, new 
FileInputStream("c:/windows/fonts/times.ttf"), false);
PDFont font2 = PDType0Font.load(doc, new 
FileInputStream("c:/windows/fonts/timesbd.ttf"), false);
PDFont font3 = PDType0Font.load(doc, new 
FileInputStream("c:/windows/fonts/arial.ttf"), false);
acroForm.getDefaultResources().put(COSName.getPDFName("TimesNewRomanPSMT"), 
font1);
acroForm.getDefaultResources().put(COSName.getPDFName("TimesNewRomanPS-BoldMT"),
 font2);
acroForm.getDefaultResources().put(COSName.getPDFName("Helvetica"), font3);
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (((PDTextField) field).isMultiline())
{
field.setValue("XXX");
}
}
}
{code}
But when setting a value, this happens in 
AppearanceGeneratorHelper.setAppearanceContent():
{code}
if (bmcIndex == -1)
{
// append to existing stream
writer.writeTokens(tokens);
writer.writeTokens(COSName.TX, BMC);
}
{code}
So it appends to the existing appearance steam. This is the result after 
calling setValue("XXX"):
{code}
q
Q
q
  9.613575 0.4609071 430.9062 41.31819 re
  W
  n
  q
0.9781767 0 0 -0.9781767 -87.43936 478.0107 cm
BT
  11 0 0 -11 102.2182 458.5622 Tm
  /TT21 1 Tf
  [ (N) -0.2 (a) 0.2 (m) 0.2 (e) 0.2 ( c) 0.2 (ha) 0.2 (nge) 0.2 (d 09/) 
0.2 (26/) 0.2 (2020) ] TJ
ET
  Q
Q
q
  6.43259 0.3084 434.0872 41.6232 re
  W
  n
  q
0.9853977 0 0 0.9853977 9.388783 29.51731 cm
BT
  11 0 0 11 0 0 Tm
  /TT18 1 Tf
  [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 
(ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 
111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 
(ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof 
of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ
ET
  Q
  q
0.9853977 0 0 0.9853977 9.388783 17.51355 cm
BT
  11 0 0 11 0 0 Tm
  /TT18 1 Tf
  [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 
(i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ
ET
  Q
Q
q
  3.228123 0.1547671 437.2917 41.93047 re
  W
  n
  q
0.992672 0 0 0.992672 6.206139 29.5793 cm
BT
  11 0 0 11 0 0 Tm
  /TT19 1 Tf
  [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 
(ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 
111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 
(ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof 
of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ
ET
  Q
  q
0.992672 0 0 0.992672 6.206139 17.48693 cm
BT
  11 0 0 11 0 0 Tm
  /TT19 1 Tf
  [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 
(i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ
ET
  Q
Q
q
  0 0 440.5198 42.24 re
  W
  n
  /Cs6 cs
  0 sc
  q
1 0 0 1 3 29.64175 cm
BT
  11 0 0 11 0 0 Tm
  /TT20 1 Tf
  [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 
(ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 
111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 
(ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof 
of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ
ET
  Q
  q
1 0 0 1 3 17.46011 cm
BT
  11 0 0 11 0 0 Tm
  /TT20 1 Tf
  [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 
(i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ
ET
  Q
Q
/Tx BMC
  q
-2.252 1 441.7718 40.24 re
W
n
BT
  /TimesNewRomanPSMT 11 Tf
  /DeviceGray cs
  0 sc
  -1.252 25.4319 Td
  (\000;\000;\000;) Tj
ET
  Q
EMC
{code}
So the XXX is there, but also all the previous content.


was (Author: tilman):
I added this for the missing fonts, which is just a guess that it's the correct 
font
{code:java}
acroForm.setNeedAppearances(false);
PDFont font1 = PDType0Font.load(doc, new 
FileInputStream("c:/windows/fonts/times.ttf"), false);
PDFont font2 

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-06 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879796#comment-17879796
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 8:00 AM:
-

There are so many things wrong with this PDF that I don't see a specific 
solution. I'm doing this just for fun. I was able to fix some of the fields 
(e.g. Last1) but not yet all (e.g. the multiline fields and some others), for 
some unknown reason. (I added the missing fonts to the default resources) Not 
all appearances are redrawn. Either there's a bug in my code or there is 
something in our code that skips the recreation of the appearances and I forgot 
about it.

It's not even recreated when changing to the value to something else?!


was (Author: tilman):
There are so many things wrong with this PDF that I don't see a specific 
solution. I'm doing this just for fun. I was able to fix some of the fields 
(e.g. Last1) but not yet all (e.g. the multiline fields and some others), for 
some unknown reason. (I added the missing fonts to the default resources) Not 
all appearances are redrawn. Either there's a bug in my code or there is 
something in our code that skips the recreation of the appearances and I forgot 
about it.

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> beforeFlattening.pdf, flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879753#comment-17879753
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 4:04 AM:
-

I could try to getValue() and setValue() on the text fields and see whether it 
looks better when PDFBox recreates the appearances. These fields have a value 
that makes sense. I'm just wondering whether this person will have legal 
disadvantages if the file is refused? (Although I doubt that the content of 
field {{Root/Pages/Kids/[0]/Annots/[7]/V}} will work for the petitioner). OTOH 
it's from 22.2 so it may already have been decided in some way.


was (Author: tilman):
I could try to getValue() and setValue() on the text fields and see whether it 
looks better when PDFBox recreates the appearances. These fields have a value 
that makes sense. I'm just wondering whether this person will have legal 
disadvantages if the file is refused? (Although I doubt that the content of 
field {{Root/Pages/Kids/[0]/Annots/[7]/V}} will work for the petitioner). OTOH 
it's from 22.2 so it may already have been processed.

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> beforeFlattening.pdf, flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-05 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879480#comment-17879480
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/5/24 8:16 AM:
-

{code}
q
Q
q
  9.469598 0.4248199 206.7517 18.55036 re
  W
  n
  q
0.9562042 0 0 -0.9562042 -55.6218 672.8725 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT21 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  6.360067 0.2853218 209.8612 18.82936 re
  W
  n
  q
0.9705854 0 0 -0.9705854 -59.7103 682.8466 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT18 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  3.203769 0.1437257 213.0175 19.11255 re
  W
  n
  q
0.9851829 0 0 -0.9851829 -63.86029 692.9707 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT19 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  0 0 216.2213 19.4 re
  W
  n
  /Cs6 cs
  0 sc
  q
1 0 0 -1 -68.0727 703.247 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT20 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
{code}
The text appears 3 times at slightly different positions in this appearance 
stream.


was (Author: tilman):
{code}
q
Q
q
  9.469598 0.4248199 206.7517 18.55036 re
  W
  n
  q
0.9562042 0 0 -0.9562042 -55.6218 672.8725 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT21 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  6.360067 0.2853218 209.8612 18.82936 re
  W
  n
  q
0.9705854 0 0 -0.9705854 -59.7103 682.8466 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT18 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  3.203769 0.1437257 213.0175 19.11255 re
  W
  n
  q
0.9851829 0 0 -0.9851829 -63.86029 692.9707 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT19 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  0 0 216.2213 19.4 re
  W
  n
  /Cs6 cs
  0 sc
  q
1 0 0 -1 -68.0727 703.247 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT20 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
{code}
The text appears 3 times.

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> beforeFlattening.pdf, flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
>     pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-05 Thread Maruan Sahyoun (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-5878:
---
Labels: Appearance  (was: )

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> beforeFlattening.pdf, flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-04 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879145#comment-17879145
 ] 

ASF subversion and git services commented on PDFBOX-5876:
-

Commit 1920451 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1920451 ]

PDFBOX-5876: revert due to rendering regression test failure

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Reopened] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-04 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reopened PDFBOX-5876:
-

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Closed] (PDFBOX-5877) After flattening a form pdf, the pdf loses content

2024-09-03 Thread Jira


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler closed PDFBOX-5877.
--
Resolution: Not A Problem

> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content

2024-09-03 Thread Joseph Jezerinac (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879066#comment-17879066
 ] 

Joseph Jezerinac commented on PDFBOX-5877:
--

Writing to a different file does solve this and many other issues. Will keep 
checking other pdfs. Thank you very much for your help

> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content

2024-09-03 Thread Joseph Jezerinac (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879031#comment-17879031
 ] 

Joseph Jezerinac commented on PDFBOX-5877:
--

Thank you for a quick response. Yes, I tried in 3.0.3. Sorry about 
PdfResourceCache, that is our class. Our code was not changed and works fine 
with older version of PDFBox so I thought maybe something got changed in the 
new version. Will look into what was pointed out by Lehmkühler and let you know

> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content

2024-09-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878964#comment-17878964
 ] 

Tilman Hausherr commented on PDFBOX-5877:
-

Yeah!! There's a log message, so it means you also disabled or disregarded logs 
:-(

> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content

2024-09-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878963#comment-17878963
 ] 

Andreas Lehmkühler commented on PDFBOX-5877:


Maybe more important: don't use the input file as output. The on demand  parser 
may read from the input file until it is closed. Most likely your are 
overwriting the source while saving the resulting file. 

> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5877) After flattening a form pdf, the pdf loses content

2024-09-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878961#comment-17878961
 ] 

Tilman Hausherr edited comment on PDFBOX-5877 at 9/3/24 5:55 PM:
-

What's this?
{code}
pdDocument.setResourceCache(new PdfResourceCache())
{code}
We have no class {{PdfResourceCache}}.


was (Author: tilman):
What's this?

pdDocument.setResourceCache(new PdfResourceCache())



> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content

2024-09-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878961#comment-17878961
 ] 

Tilman Hausherr commented on PDFBOX-5877:
-

What's this?

pdDocument.setResourceCache(new PdfResourceCache())



> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content

2024-09-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878960#comment-17878960
 ] 

Tilman Hausherr commented on PDFBOX-5877:
-

Are you sure you used 3.0.3 and not 3.0.2 ? I just tried with the trunk and 
3.0.4-SNAPSHOT with our test and I got only invisible differences (yours are 
clearly visible and are because all fonts are lost in the PDF)

> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5877) After flattening a form pdf, the pdf loses content

2024-09-03 Thread Jira


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-5877:
---
Description: 
After flattening the pdf form content changes. Pls take a look at before and 
after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
started getting many  issues with pdf forms after flattening. 
The code that used for flattening is as follows
{code}
PDDocument pdDocument = Loader.loadPDF(file, “”);
pdDocument.setResourceCache(new PdfResourceCache())
try {
    boolean save = false;
    if (pdDocument.isEncrypted()) {      
        pdDocument.setAllSecurityToBeRemoved(true);
        save = true;
    }
    final PDDocumentCatalog pdDocumentCatalog = pdDocument.getDocumentCatalog();
    if (pdDocumentCatalog != null) {
        final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
        if (pdForm != null) {       
            pdForm.flatten();         
            save = true;
        }
    }
    if (save) {
        pdDocument.save(file);        
    }
}
{code}


  was:
After flattening the pdf form content changes. Pls take a look at before and 
after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
started getting many  issues with pdf forms after flattening. 
The code that used for flattening is as follows

PDDocument pdDocument = Loader.loadPDF(file, “”);
pdDocument.setResourceCache(new PdfResourceCache())
try {
    boolean save = false;
    if (pdDocument.isEncrypted()) {      
        pdDocument.setAllSecurityToBeRemoved(true);
        save = true;
    }
    final PDDocumentCatalog pdDocumentCatalog = pdDocument.getDocumentCatalog();
    if (pdDocumentCatalog != null) {
        final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
        if (pdForm != null) {       
            pdForm.flatten();         
            save = true;
        }
    }
    if (save) {
        pdDocument.save(file);        
    }
}


> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5877) After flattening a form pdf, the pdf loses content

2024-09-03 Thread Jira


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-5877:
---
Affects Version/s: 3.0.3 PDFBox
   (was: 3.0.3 JBIG2)

> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5877) After flattening a form pdf, the pdf loses content

2024-09-03 Thread Joseph Jezerinac (Jira)
Joseph Jezerinac created PDFBOX-5877:


 Summary: After flattening a form pdf, the pdf loses content
 Key: PDFBOX-5877
 URL: https://issues.apache.org/jira/browse/PDFBOX-5877
 Project: PDFBox
  Issue Type: Bug
  Components: AcroForm
Affects Versions: 3.0.3 JBIG2
 Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
Reporter: Joseph Jezerinac
 Attachments: beforeFalttening.pdf, flattenedPdf.pdf

After flattening the pdf form content changes. Pls take a look at before and 
after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
started getting many  issues with pdf forms after flattening. 
The code that used for flattening is as follows

PDDocument pdDocument = Loader.loadPDF(file, “”);
pdDocument.setResourceCache(new PdfResourceCache())
try {
    boolean save = false;
    if (pdDocument.isEncrypted()) {      
        pdDocument.setAllSecurityToBeRemoved(true);
        save = true;
    }
    final PDDocumentCatalog pdDocumentCatalog = pdDocument.getDocumentCatalog();
    if (pdDocumentCatalog != null) {
        final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
        if (pdForm != null) {       
            pdForm.flatten();         
            save = true;
        }
    }
    if (save) {
        pdDocument.save(file);        
    }
}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread liu (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878880#comment-17878880
 ] 

liu edited comment on PDFBOX-5876 at 9/3/24 1:23 PM:
-

4G JVM can only convert 8 pictures concurrently...It overflows so easily..


was (Author: JIRAUSER297279):
4G JVM can only convert 8 pictures concurrently...

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread liu (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878880#comment-17878880
 ] 

liu commented on PDFBOX-5876:
-

4G JVM can only convert 8 pictures concurrently...

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878879#comment-17878879
 ] 

Tilman Hausherr commented on PDFBOX-5876:
-

No... I used -Xmx4G for a production project.

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread liu (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878878#comment-17878878
 ] 

liu commented on PDFBOX-5876:
-

It's still very large, one picture takes up 500M. Are there any other 
optimization solutions?

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878846#comment-17878846
 ] 

Tilman Hausherr commented on PDFBOX-5876:
-

Are you sure you are using the new version? You have to build yourself or wait 
until a new snapshot build is available. Instead of using PDFDebugger now I 
just tried your code as it is with a locally built 3.0.4-SNAPSHOT and it did 
work with -Xmx600m. (Also with 550, but not with 500)

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread liu (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878837#comment-17878837
 ] 

liu commented on PDFBOX-5876:
-

I tried it, but it still seems to overflow.

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Resolved] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5876.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5876:

Affects Version/s: 2.0.32

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5876:

Component/s: Rendering

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878835#comment-17878835
 ] 

Tilman Hausherr commented on PDFBOX-5876:
-

The JPX image in that file is 7020 x 4964, which is quite big, and -Xmx600m is 
quite low. But I noticed that the subsampling parameter wasn't used when 
reading the JPX image the second time, which was the cause for the OOM. (JPX 
images have to be read twice because of some weirdness in the specification) It 
should work now, I tried it with PDFDebugger, which doesn't allow to set a temp 
cache.

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878827#comment-17878827
 ] 

ASF subversion and git services commented on PDFBOX-5876:
-

Commit 1920420 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1920420 ]

PDFBOX-5876: pass subsampling for second read

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878829#comment-17878829
 ] 

ASF subversion and git services commented on PDFBOX-5876:
-

Commit 1920422 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1920422 ]

PDFBOX-5876: pass subsampling for second read

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-03 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878828#comment-17878828
 ] 

ASF subversion and git services commented on PDFBOX-5876:
-

Commit 1920421 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920421 ]

PDFBOX-5876: pass subsampling for second read

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-02 Thread liu (Jira)
liu created PDFBOX-5876:
---

 Summary: This jpeg2000 takes up a lot of memory, causing overflow.
 Key: PDFBOX-5876
 URL: https://issues.apache.org/jira/browse/PDFBOX-5876
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 3.0.2 PDFBox
Reporter: liu
 Attachments: jpeg2000.pdf

pdf:[^jpeg2000.pdf]
JVM:-Xmx600m
{code:java}
//代码占位符
public static void main(String[] args) throws IOException, InterruptedException 
{
   File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
   PDDocument pdf = Loader.loadPDF(file, 
IOUtils.createTempFileOnlyStreamCache());
   PDFRenderer renderer = new PDFRenderer(pdf);
   int numPages = 0;
   renderer.setSubsamplingAllowed(true);
   BufferedImage bi = renderer.renderImage(numPages, 0.5f);
   pdf.close();
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-02 Thread liu (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liu updated PDFBOX-5876:

Attachment: jpeg2000.pdf

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf:[^jpeg2000.pdf]
> JVM:-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-09-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878493#comment-17878493
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 1920378 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1920378 ]

PDFBOX-5660: update owasp plugin

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
> URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-09-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878492#comment-17878492
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 1920377 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920377 ]

PDFBOX-5660: update owasp plugin

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
> URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-09-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878491#comment-17878491
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 1920376 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1920376 ]

PDFBOX-5660: update owasp plugin

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
> URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-09-01 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878490#comment-17878490
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 2ec6fce55444c6b137268ebb5957e0960d75276d in pdfbox-jbig2's branch 
refs/heads/master from Tilman Hausherr
[ https://gitbox.apache.org/repos/asf?p=pdfbox-jbig2.git;h=2ec6fce ]

PDFBOX-5660: update owasp plugin

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
> URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5873) Improve ExtractTTFFonts

2024-08-30 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878211#comment-17878211
 ] 

ASF subversion and git services commented on PDFBOX-5873:
-

Commit 1920306 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920306 ]

PDFBOX-5873: avoid NPE

> Improve ExtractTTFFonts
> ---
>
> Key: PDFBOX-5873
> URL: https://issues.apache.org/jira/browse/PDFBOX-5873
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Utilities
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>
> Add more places where resources exist; don't save fonts twice



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5873) Improve ExtractTTFFonts

2024-08-30 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878209#comment-17878209
 ] 

ASF subversion and git services commented on PDFBOX-5873:
-

Commit 1920304 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1920304 ]

PDFBOX-5873: avoid NPE

> Improve ExtractTTFFonts
> ---
>
> Key: PDFBOX-5873
> URL: https://issues.apache.org/jira/browse/PDFBOX-5873
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Utilities
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>
> Add more places where resources exist; don't save fonts twice



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5875) using font data to process ligatures

2024-08-30 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5875:

Fix Version/s: (was: 3.0.4 PDFBox)

> using font data to process ligatures
> 
>
> Key: PDFBOX-5875
> URL: https://issues.apache.org/jira/browse/PDFBOX-5875
> Project: PDFBox
>  Issue Type: New Feature
>  Components: Parsing, PDModel, Text extraction
>Affects Versions: 3.0.3 PDFBox
>Reporter: Manish S N
>Priority: Major
>  Labels: Asian, CIDFont, font, ligatures, unicodemapping
> Attachments: page.pdf
>
>
> To process ligatures from Asian languages (where a glyph is the combination 
> of two unicode characters) using the data in embedded fonts.
>  
> *The problem:*
> currently modern PDF creators put these ligatures in /ActualText field which 
> we only recently considered to support in this issue . But this is not the 
> case in old PDFs with embedded CID fonts like [^page.pdf] where the glyphs of 
> ligatures lack a /toUnicode character mapping because there is no single 
> unicode codepoint for these as these are combination of more than one unicode 
> characters. 
>  
> *The Potential Solution (if not perfect):* 
> I managed to extract the font files using pdfbox 
> ([code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java])
>  and when i viewed the fontfiles using fontforge i found the data about 
> ligatures intact in it. So we can use this data to map the glyphs that are 
> ligatures to the unicodes of its constituent glyphs
>  
> *Problems:*
> In some cases the constituent glyphs may not be present in the cmap at all. 
> removed by PDF optimiser as it is never directly used in the PDF apart from 
> in ligatures. such glyphs are empty with only glyph id and no /toUnicode 
> mapping even if that particular glyph has a corresponding unicode character.
>  
> *The Hope:*
> This is not a common problem in large PDFs. and basic spell checkers could 
> easily rectify the problem. some comprehension is better than no 
> comprehension when it comes to dealing with data. this will greatly enhance 
> the parsing of non-Latin Asian languages.
>  
> (the PDF sample i attached is in Tamil language)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-5875) using font data to process ligatures

2024-08-30 Thread Manish S N (Jira)
Manish S N created PDFBOX-5875:
--

 Summary: using font data to process ligatures
 Key: PDFBOX-5875
 URL: https://issues.apache.org/jira/browse/PDFBOX-5875
 Project: PDFBox
  Issue Type: New Feature
  Components: Parsing, PDModel, Text extraction
Affects Versions: 3.0.3 PDFBox
Reporter: Manish S N
 Fix For: 3.0.4 PDFBox
 Attachments: page.pdf

To process ligatures from Asian languages (where a glyph is the combination of 
two unicode characters) using the data in embedded fonts.

 

*The problem:*

currently modern PDF creators put these ligatures in /ActualText field which we 
only recently considered to support in this issue . But this is not the case in 
old PDFs with embedded CID fonts like [^page.pdf] where the glyphs of ligatures 
lack a /toUnicode character mapping because there is no single unicode 
codepoint for these as these are combination of more than one unicode 
characters. 

 

*The Potential Solution (if not perfect):* 

I managed to extract the font files using pdfbox 
([code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java])
 and when i viewed the fontfiles using fontforge i found the data about 
ligatures intact in it. So we can use this data to map the glyphs that are 
ligatures to the unicodes of its constituent glyphs

 

*Problems:*

In some cases the constituent glyphs may not be present in the cmap at all. 
removed by PDF optimiser as it is never directly used in the PDF apart from in 
ligatures. such glyphs are empty with only glyph id and no /toUnicode mapping 
even if that particular glyph has a corresponding unicode character.

 

*The Hope:*

This is not a common problem in large PDFs. and basic spell checkers could 
easily rectify the problem. some comprehension is better than no comprehension 
when it comes to dealing with data. this will greatly enhance the parsing of 
non-Latin Asian languages.

 

(the PDF sample i attached is in Tamil language)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-30 Thread Manish S N (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878095#comment-17878095
 ] 

Manish S N commented on PDFBOX-5868:


the data is right there in the fonts, within reach

...

on second thought, yes there are problems.

this approach assumes that every glyph that has a corresponding unicode 
character is _present_ is the cmap which isn't always true.

when seeing ligature data for நூ, we see

!image-2024-08-30-17-55-41-423.png!

here we can see glyph92 instead of the unicode character for dependent vowel  ூ 
which is not present in cmap because that actual glyph is never used in the pdf 
(all pure tamil ligatures of ூ are irrugular and have their own glyphs and not 
combined side by side unlike other dependent vowel glyphs so no use for the 
actual glyph. hence pdf optimizers will chuck it away along with its unicode 
mapping)

the problem is replicated by all ligatures of ூ (dependent vowel uu)

it is the case in languages like tamil but most other non-latin languages can 
be fine. like hindi; it is a more regular language than tamil (when comes to 
letters)

In the end there are also other problems like mangled cmap as a method of 
obfuscation

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, image-2024-08-30-17-55-41-423.png, 
> multilingual_test.pdf, okular_out.txt, page.pdf, pdfbox_out.txt, 
> poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-30 Thread Manish S N (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish S N updated PDFBOX-5868:
---
Attachment: image-2024-08-30-17-55-41-423.png

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, image-2024-08-30-17-55-41-423.png, 
> multilingual_test.pdf, okular_out.txt, page.pdf, pdfbox_out.txt, 
> poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-30 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878089#comment-17878089
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Yes. But consider that Adobe didn't do it and they're smarter than us, I just 
tried copy / paste and save as text. The ligature thing in fonts are meant to 
be used when creating PDFs, I don't know if these would work in extraction.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
>     URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-30 Thread Manish S N (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878084#comment-17878084
 ] 

Manish S N commented on PDFBOX-5868:


so shall i open this as a feature/improvement type issue then?

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-30 Thread Manish S N (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878081#comment-17878081
 ] 

Manish S N commented on PDFBOX-5868:


{quote}this is a different problem
{quote}
Initially yes it was. then upon closer inspection I saw a solution referring to 
this existing problem ;)

 

P.S: I know the cmap is incomplete and no library can extract it (including 
adobe). but should we follow the cmap (and other libraries XD), I can open 
extracted font in fontforge>view>combinations to find unicode combination for 
the glyphs that are ligatures so we do not need to rely on actualText data 
anymore to process these ligatures

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
>     URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-30 Thread Manish S N (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878077#comment-17878077
 ] 

Manish S N commented on PDFBOX-5868:


It is the modern word processors that put actual unicode of glyphs in actual 
text tags but in older PDFs with embedded CID fonts like [^page.pdf] there is 
no such tags and the ligatures are left out without unicode mappings (there is 
no single unicode for these). but when i extracted font files using pdfbox ( 
[code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java]
 ) and analysed it using font forge to find the font files do contain data 
about these ligatures..

So if we can process that data and assign these glyphs the data (unicode 
combination for that ligature) from font, there is no need to worry about 
parsing actualText and I believe it would improve text extraction from 
non-latin languages to a great extent. thus solution to your problem of 
misusing actual text to prevent text extraction [~tilman] 

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-30 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878076#comment-17878076
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/30/24 11:50 AM:
---

Please create a new ticket for the file you just added because this is a 
different problem (only if you manage to extract this properly from Adobe 
Reader).


was (Author: tilman):
Please create a new ticket for the file you just added because this is a 
different problem.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-30 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878076#comment-17878076
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Please create a new ticket for the file you just added because this is a 
different problem.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-30 Thread Manish S N (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish S N updated PDFBOX-5868:
---
Attachment: page.pdf

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-08-30 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878069#comment-17878069
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 1920288 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1920288 ]

PDFBOX-5660: update ant

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
> URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-08-30 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878070#comment-17878070
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 1920289 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1920289 ]

PDFBOX-5660: update ant

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
> URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5660) Improve code quality (5)

2024-08-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877373#comment-17877373
 ] 

ASF subversion and git services commented on PDFBOX-5660:
-

Commit 1920252 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1920252 ]

PDFBOX-5660: update maven-plugin-annotations, mockito

> Improve code quality (5)
> 
>
> Key: PDFBOX-5660
> URL: https://issues.apache.org/jira/browse/PDFBOX-5660
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
> Attachments: AnnotationSample.Standard.pdf, 
> DRY_refactoring_Typ2CharStringParser.patch, 
> Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch,
>  
> Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch,
>  Simplify_string_conversion_in_PDFHighlighter.patch, 
> Update_string_handling_and_regex_in_several_classes.patch, 
> avoid_multiple_unboxing.patch, code_cleanup.patch, 
> do_not_create_temporary_File_instance.patch, 
> extract_common_code,_move_toUpperCase()_out_of_loop.patch, 
> fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, 
> introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch,
>  introduce_StringUtil_class_for_reusable_functionality.patch, 
> introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch,
>  make_inner_class_static.patch, refactor_isEndOfName.patch, 
> remove_code_duplication_in_Type2CharStringParser.patch, 
> remove_obsolete_class_NullOutputStream.patch, 
> remove_unnecessary_calls_to_toString()_String_valueOf().patch, 
> replace_System_getProperty()_calls.patch, screenshot-1.png, 
> simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch,
>  simplify_stream_operations.patch, use_Map_ofEntries().patch, 
> use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, 
> use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch,
>  use_String_join().patch, use_switch_for_readability.patch, 
> use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch
>
>
> This is a longterm issue for the task to improve code quality, by using the 
> SonarQube report, hints in different IDEs, the FindBugs tool and other code 
> quality tools.
> This is a follow-up of PDFBOX-4892, which was getting too long.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Resolved] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache

2024-08-28 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5874.
-
  Assignee: Tilman Hausherr
Resolution: Fixed

Thank you, you're right, there's no need to warn about something that harmless.

> Change Loglevel from Warn to info when rebuilding font cache
> 
>
> Key: PDFBOX-5874
> URL: https://issues.apache.org/jira/browse/PDFBOX-5874
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Thomas Hoffmann
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>
> We have a monitoring system for our logfiles and some people get notified 
> whenever there is an error or a warning in the logfiles.
> Due to OS updates, the fonts might be updated or changed. This triggers a 
> rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning 
> and this triggers an alarm.
> The warnings occur in:
> org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java
> The logfile shows the following three entries:
> 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, 
> font cache will be re-built
> 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk 
> font cache, this may take a while
> 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building 
> on-disk font cache, found 96 fonts
>  
> Imho the message is more informational and not necessary a warning. It just 
> gives me the information, that the cache is getting rebuilt.
> It would be great if you could consider setting these messages to info level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache

2024-08-28 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877357#comment-17877357
 ] 

ASF subversion and git services commented on PDFBOX-5874:
-

Commit 1920251 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1920251 ]

PDFBOX-5874: change Loglevel from warn to info when rebuilding font cache, as 
suggested by Thomas Hoffmann

> Change Loglevel from Warn to info when rebuilding font cache
> 
>
> Key: PDFBOX-5874
> URL: https://issues.apache.org/jira/browse/PDFBOX-5874
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Thomas Hoffmann
>Priority: Minor
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>
> We have a monitoring system for our logfiles and some people get notified 
> whenever there is an error or a warning in the logfiles.
> Due to OS updates, the fonts might be updated or changed. This triggers a 
> rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning 
> and this triggers an alarm.
> The warnings occur in:
> org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java
> The logfile shows the following three entries:
> 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, 
> font cache will be re-built
> 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk 
> font cache, this may take a while
> 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building 
> on-disk font cache, found 96 fonts
>  
> Imho the message is more informational and not necessary a warning. It just 
> gives me the information, that the cache is getting rebuilt.
> It would be great if you could consider setting these messages to info level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



<    1   2   3   4   5   6   7   8   9   10   >