[jira] [Comment Edited] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882327#comment-17882327 ] Tilman Hausherr edited comment on PDFBOX-5879 at 9/17/24 9:08 AM: -- I added a simple test for the rotationMagic feature because it turns out we didn't have any. However this isn't a test of the fixed bug, that would have been more difficult to create a file, and there is no risk that this fix gets reverted anyway. was (Author: tilman): I added a simple test for the feature because it turns out we didn't have any. However this isn't a test of the fixed bug, that would have been more difficult to create a file, and there is no risk that this fix gets reverted anyway. > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882326#comment-17882326 ] ASF subversion and git services commented on PDFBOX-5879: - Commit 1920739 from Tilman Hausherr in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920739 ] PDFBOX-5879: remove test message > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882325#comment-17882325 ] ASF subversion and git services commented on PDFBOX-5879: - Commit 1920738 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1920738 ] PDFBOX-5879: remove test message > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882324#comment-17882324 ] ASF subversion and git services commented on PDFBOX-5879: - Commit 1920737 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1920737 ] PDFBOX-5879: add test for rotationMagic > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882322#comment-17882322 ] ASF subversion and git services commented on PDFBOX-5879: - Commit 1920736 from Tilman Hausherr in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920736 ] PDFBOX-5879: add test for rotationMagic > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882318#comment-17882318 ] ASF subversion and git services commented on PDFBOX-5879: - Commit 1920735 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1920735 ] PDFBOX-5879: add test for rotationMagic > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882299#comment-17882299 ] ASF subversion and git services commented on PDFBOX-5879: - Commit 1920732 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1920732 ] PDFBOX-5879: remove unused import > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882298#comment-17882298 ] ASF subversion and git services commented on PDFBOX-5879: - Commit 1920731 from Tilman Hausherr in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920731 ] PDFBOX-5879: remove unused import > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882297#comment-17882297 ] ASF subversion and git services commented on PDFBOX-5879: - Commit 1920730 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1920730 ] PDFBOX-5879: remove unused import > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-5879. - Fix Version/s: 2.0.33 3.0.4 PDFBox 4.0.0 Assignee: Tilman Hausherr Resolution: Fixed Thank you. It's not the commit, it's poor programming that got exposed because of the commit. > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5879: Affects Version/s: 2.0.32 > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Priority: Major > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882284#comment-17882284 ] ASF subversion and git services commented on PDFBOX-5879: - Commit 1920729 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1920729 ] PDFBOX-5879: avoid ClassCastException > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 3.0.3 PDFBox >Reporter: Gábor Stefanik >Priority: Major > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882282#comment-17882282 ] ASF subversion and git services commented on PDFBOX-5879: - Commit 1920728 from Tilman Hausherr in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920728 ] PDFBOX-5879: avoid ClassCastException > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 3.0.3 PDFBox >Reporter: Gábor Stefanik >Priority: Major > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882281#comment-17882281 ] ASF subversion and git services commented on PDFBOX-5879: - Commit 1920727 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1920727 ] PDFBOX-5879: avoid ClassCastException > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 3.0.3 PDFBox >Reporter: Gábor Stefanik >Priority: Major > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882260#comment-17882260 ] ASF subversion and git services commented on PDFBOX-5852: - Commit 1920726 from le...@apache.org in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920726 ] PDFBOX-5852: deprecate IntPoint > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882258#comment-17882258 ] ASF subversion and git services commented on PDFBOX-5852: - Commit 1920725 from le...@apache.org in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920725 ] PDFBOX-5852: replace IntPoint with Point > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882256#comment-17882256 ] ASF subversion and git services commented on PDFBOX-5852: - Commit 1920724 from le...@apache.org in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920724 ] PDFBOX-5852: replace Map with a two-dimensional array > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882254#comment-17882254 ] Andreas Lehmkühler commented on PDFBOX-5852: [~tilman] thanks for the fast feedback. BTW, you gave a valuable hint yourself where to look for a possible optimization. :-) I'm going to add the changes to the other branches as well > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882252#comment-17882252 ] ASF subversion and git services commented on PDFBOX-5852: - Commit 1920723 from le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1920723 ] PDFBOX-5852: remove no longer needed class > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882240#comment-17882240 ] Tilman Hausherr commented on PDFBOX-5852: - Wow! No regressions. > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882144#comment-17882144 ] Andreas Lehmkühler edited comment on PDFBOX-5852 at 9/16/24 5:59 PM: - I've replaced the map for the pre-calculated color values with a two-dimensional array. It speeds up rendering by a factor 2 at 100%, a factor of 5 at 200% and a factor of 10 at 400% using the pdf debugger. I guess there is some positive effect concerning the memory consumption as well, but not as clearly as the effect on the rendering time [~tilman] please run some rendering tests to be sure there aren't any regressions. I'm going to backport the changes to 3.0.x and 2.0.x if everything works as expected was (Author: lehmi): I've replace the Map for the pre-calculated color values with a two-dimensional array. It speeds up rendering by a factor 2 at 100%, a factor of 5 at 200% and a factor of 10 at 400% using the pdf debugger. I guess there is some positive effect concerning the memory consumption as well, but not as clearly as the effect on the rendering time [~tilman] please run some rendering tests to be sure there aren't any regressions. I'm going to backport the changes to 3.0.x and 2.0.x if everything works as expected > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882144#comment-17882144 ] Andreas Lehmkühler commented on PDFBOX-5852: I've replace the Map for the pre-calculated color values with a two-dimensional array. It speeds up rendering by a factor 2 at 100%, a factor of 5 at 200% and a factor of 10 at 400% using the pdf debugger. I guess there is some positive effect concerning the memory consumption as well, but not as clearly as the effect on the rendering time [~tilman] please run some rendering tests to be sure there aren't any regressions. I'm going to backport the changes to 3.0.x and 2.0.x if everything works as expected > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-5852: --- Fix Version/s: 3.0.3 PDFBox > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-5852: --- Fix Version/s: 4.0.0 > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882140#comment-17882140 ] ASF subversion and git services commented on PDFBOX-5852: - Commit 1920718 from le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1920718 ] PDFBOX-5852: replace Map with a two-dimensional array > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
Gábor Stefanik created PDFBOX-5879: -- Summary: Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page Key: PDFBOX-5879 URL: https://issues.apache.org/jira/browse/PDFBOX-5879 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 3.0.3 PDFBox Reporter: Gábor Stefanik Attachments: MVM_Aram_augusztus.pdf {code:java} java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic -i="MVM_Aram_augusztus.pdf" {code} fails with the following error: {code:java} java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) at picocli.CommandLine.executeUserObject(CommandLine.java:2045) at picocli.CommandLine.access$1500(CommandLine.java:148) at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) at picocli.CommandLine.execute(CommandLine.java:2174) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} The same command succeeds in 3.0.2. The triggering PDF can be downloaded from [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] and is also attached. The root cause appears to be this change: [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5852: Description: We've observed excessive CPU and memory consumption when converting a PDF to images when the PDF contains type 4 shading. This is especially noticeable when the conversion is done with a high DPI. Can this be improved? Conversation from the PDFBox users mailing list follows Initial email: {quote} Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox users and maintainers, We have a PDF that causes performance problems when we use PDFBox to convert it to an image with renderImageWithDPI(). We're calling renderImageWithDPI() with 650 DPI. I realize this is a very high value - we're using it for high fidelity original images that will later be downsampled. On my work laptop which has fairly strong hardware, the conversion takes 25 minutes and consumes 20GB of memory. CPU and memory usage is reduced if we use a lower DPI. The PDF is 1 page long. It contains type 4 shading / Gouraud free form triangle meshes. We've been aware of some performance issues with type 4 shading for a little while now, but the PDFs that contained the type 4 shading belonged to our customers and we were not authorized to share them. We finally found a problem input document that is non-sensitive and that we are authorized to share. I've attached a copy of the problem PDF to this email. I searched the archives for the users and the developers mailing list and I didn't find anything specifically about this issue. I searched through the PDFBox jira tickets and I found a couple of tickets that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most closely describe what we're seeing, but that was closed in PDFBox 2.0.0, and our issue still reproduces with PDFBox 2.0.28. Should I refer this issue over to the developers mailing list or create a PDFBox Jira ticket for this? Thanks and Regards, Larry Lynn {quote} Response: {quote} Hi, Yes shading can be very slow, especially at high dpi. The attachment didn't get through, please upload to a sharehoster or create a ticket. If you need to register then add a meaningful text, e.g. the subject of this post so we know you're not a spammer. Also retry with 2.0.31 and 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. Tilman {quote} was: We've observed excessive CPU and memory consumption when converting a PDF to images when the PDF contains type 4 shading. This is especially noticeable when the conversion is done with a high DPI. Can this be improved? Conversation from the PDFBox users mailing list follows Initial email: {code:java} Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox users and maintainers, We have a PDF that causes performance problems when we use PDFBox to convert it to an image with renderImageWithDPI(). We're calling renderImageWithDPI() with 650 DPI. I realize this is a very high value - we're using it for high fidelity original images that will later be downsampled. On my work laptop which has fairly strong hardware, the conversion takes 25 minutes and consumes 20GB of memory. CPU and memory usage is reduced if we use a lower DPI. The PDF is 1 page long. It contains type 4 shading / Gouraud free form triangle meshes. We've been aware of some performance issues with type 4 shading for a little while now, but the PDFs that contained the type 4 shading belonged to our customers and we were not authorized to share them. We finally found a problem input document that is non-sensitive and that we are authorized to share. I've attached a copy of the problem PDF to this email. I searched the archives for the users and the developers mailing list and I didn't find anything specifically about this issue. I searched through the PDFBox jira tickets and I found a couple of tickets that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most closely describe what we're seeing, but that was closed in PDFBox 2.0.0, and our issue still reproduces with PDFBox 2.0.28. Should I refer this issue over to the developers mailing list or create a PDFBox Jira ticket for this? Thanks and Regards, Larry Lynn {code} Response: {code:java} Hi, Yes shading can be very slow, especially at high dpi. The attachment didn't get through, please upload to a sharehoster or create a ticket. If you need to register then add a meaningful text, e.g. the subject of this post so we know you're not a spammer. Also retry with 2.0.31 and 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. Tilman {code} > Hi CPU and memory usage when converting a PDF with type 4 shading > --
[jira] [Resolved] (PDFBOX-5469) Make COSString immutable
[ https://issues.apache.org/jira/browse/PDFBOX-5469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler resolved PDFBOX-5469. Resolution: Fixed Set to resolved > Make COSString immutable > > > Key: PDFBOX-5469 > URL: https://issues.apache.org/jira/browse/PDFBOX-5469 > Project: PDFBox > Issue Type: Improvement > Components: Parsing, PDModel >Affects Versions: 2.0.26, 3.0.0 PDFBox >Reporter: Andreas Lehmkühler >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 4.0.0 > > > We should change COSString to be immutable as discussed in PDFBOX-5451. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5469) Make COSString immutable
[ https://issues.apache.org/jira/browse/PDFBOX-5469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881856#comment-17881856 ] ASF subversion and git services commented on PDFBOX-5469: - Commit 1920693 from le...@apache.org in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920693 ] PDFBOX-5469: use new constructor to avoid deprecated setter > Make COSString immutable > > > Key: PDFBOX-5469 > URL: https://issues.apache.org/jira/browse/PDFBOX-5469 > Project: PDFBox > Issue Type: Improvement > Components: Parsing, PDModel >Affects Versions: 2.0.26, 3.0.0 PDFBox >Reporter: Andreas Lehmkühler >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 4.0.0 > > > We should change COSString to be immutable as discussed in PDFBOX-5451. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5469) Make COSString immutable
[ https://issues.apache.org/jira/browse/PDFBOX-5469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881855#comment-17881855 ] ASF subversion and git services commented on PDFBOX-5469: - Commit 1920692 from le...@apache.org in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920692 ] PDFBOX-5469: add two constructors to avoid deprecated setter > Make COSString immutable > > > Key: PDFBOX-5469 > URL: https://issues.apache.org/jira/browse/PDFBOX-5469 > Project: PDFBox > Issue Type: Improvement > Components: Parsing, PDModel >Affects Versions: 2.0.26, 3.0.0 PDFBox >Reporter: Andreas Lehmkühler >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 4.0.0 > > > We should change COSString to be immutable as discussed in PDFBOX-5451. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881854#comment-17881854 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 1920689 from le...@apache.org in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1920689 ] PDFBOX-5660: close InputStream as MemoryCacheImageInputStream doesn't do it > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881852#comment-17881852 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 1920687 from le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1920687 ] PDFBOX-5660: close InputStream as MemoryCacheImageInputStream doesn't do it > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881853#comment-17881853 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 1920688 from le...@apache.org in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920688 ] PDFBOX-5660: close InputStream as MemoryCacheImageInputStream doesn't do it > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881325#comment-17881325 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 1920596 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1920596 ] PDFBOX-5660: update minimum maven version > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881324#comment-17881324 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 1920595 from Tilman Hausherr in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920595 ] PDFBOX-5660: update log4j > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880425#comment-17880425 ] Joseph Jezerinac commented on PDFBOX-5878: -- Please could any admin kindly delete the attachments. [~tilman] > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Jezerinac updated PDFBOX-5878: - Attachment: (was: beforeFlattening.pdf) > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Assigned] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler reassigned PDFBOX-5852: -- Assignee: Andreas Lehmkühler > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {code:java} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {code} > Response: > {code:java} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880141#comment-17880141 ] Michael Klink edited comment on PDFBOX-5878 at 9/8/24 2:01 PM: --- As an aside: The document has set *NeedAppearances* to *true*, so regenerating the appearances for flattening would be appropriate here whether there are any issues in the PDF or not. Furthermore, when (re-)creating form field appearances, one should always create from scratch. The only exception may be if one has analyzed the existing appearance contents and has made sure that one has removed all old field contents from it and also that there is nothing in it that moves the to-be-drawn new content out of the bbox or otherwise obscures it. Else this might be a variation of one of the well-known www.pdf-insecurity.org attacks. But I think it in general is really hard to tell whether some "rectangles and such" are drawn as part of a specific style or as an forgery attempt. Thus, I'd really propose to always re-create from scratch. was (Author: mkl): As an aside: The document has set *NeedAppearances* to *true*, so regenerating the appearances for flattening would be appropriate here whether there are any issues in the PDF or not. Furthermore, when (re-)creating form field appearances, one should always create from scratch. The only exception may be if one has analyzed the existing appearance contents and has made sure that one has removed all old field contents from it and also that there is nothing in it that moves the to-be-drawn new content out of the bbox or otherwise obscure it. Else this might be a variation of one of the well-known www.pdf-insecurity.org attacks. But I think it in general is really hard to tell whether some "rectangles and such" are drawn as part of a specific style or as an forgery attempt. Thus, I'd really propose to always re-create from scratch. > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, > flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880141#comment-17880141 ] Michael Klink edited comment on PDFBOX-5878 at 9/8/24 1:59 PM: --- As an aside: The document has set *NeedAppearances* to *true*, so regenerating the appearances for flattening would be appropriate here whether there are any issues in the PDF or not. Furthermore, when (re-)creating form field appearances, one should always create from scratch. The only exception may be if one has analyzed the existing contents and has made sure that one has removed all old field contents from it and also that there is nothing in it that moves the to-be-drawn new content out of the bbox or otherwise obscure it. Else this might be a variation of one of the well-known www.pdf-insecurity.org attacks. But I think it in general is really hard to tell whether some "rectangles and such" are drawn as part of a specific style or as an forgery attempt. Thus, I'd really propose to always re-create from scratch. was (Author: mkl): As an aside: The document has set **NeedAppearances** to **true**, so regenerating the appearances for flattening would be appropriate here whether there are any issues in the PDF or not. Furthermore, when (re-)creating form field appearances, one should always create from scratch. The only exception may be if one has analyzed the existing contents and has made sure that one has removed all old field contents from it and also that there is nothing in it that moves the to-be-drawn new content out of the bbox or otherwise obscure it. Else this might be a variation of one of the well-known www.pdf-insecurity.org attacks. But I think it in general is really hard to tell whether some "rectangles and such" are drawn as part of a specific style or as an forgery attempt. Thus, I'd really propose to always re-create from scratch. > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, > flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880031#comment-17880031 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 1920506 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1920506 ] PDFBOX-5660: update log4j > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879888#comment-17879888 ] Maruan Sahyoun commented on PDFBOX-5878: it's not weird per se it's only weird here as there are multiple occurences within a fields appearance stream for the value and other stuff where there should be only one. But we can look at changing that maybe for 4.0 to create from scratch (which is more in line with what Adobe does but normally according to the Spec onyl with some field settings which do not apply here). > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, > flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879832#comment-17879832 ] Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 10:35 AM: -- Here's what worked: {code:java} for (PDField field: acroForm.getFieldTree()) { if (field instanceof PDTextField) { if (field instanceof PDVariableText) { for (PDAnnotationWidget widget : field.getWidgets()) { widget.setAppearance(null); } } } } acroForm.refreshAppearances(); {code} [^PDFBox5878-flattened.pdf] [^PDFBox5878-saved.pdf] The only problem left is that the second multiline field starts a bit too low, but IIRC there's another issue about that. was (Author: tilman): Here's what worked: {code:java} for (PDField field: acroForm.getFieldTree()) { if (field instanceof PDTextField) { if (field instanceof PDVariableText) { for (PDAnnotationWidget widget : field.getWidgets()) { widget.setAppearance(null); } } } } acroForm.refreshAppearances(); {code} [^PDFBox5878-flattened.pdf] [^PDFBox5878-saved.pdf] > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, > flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879832#comment-17879832 ] Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 10:31 AM: -- Here's what worked: {code:java} for (PDField field: acroForm.getFieldTree()) { if (field instanceof PDTextField) { if (field instanceof PDVariableText) { for (PDAnnotationWidget widget : field.getWidgets()) { widget.setAppearance(null); } } } } acroForm.refreshAppearances(); {code} [^PDFBox5878-flattened.pdf] [^PDFBox5878-saved.pdf] was (Author: tilman): Here's what worked: {code:java} for (PDField field: acroForm.getFieldTree()) { if (field instanceof PDTextField) { if (field instanceof PDVariableText) { for (PDAnnotationWidget widget : field.getWidgets()) { widget.setAppearance(null); } } } acroForm.refreshAppearances(); } {code} [^PDFBox5878-flattened.pdf] [^PDFBox5878-saved.pdf] > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, > flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5878: Attachment: PDFBox5878-flattened.pdf PDFBox5878-saved.pdf > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, > flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879822#comment-17879822 ] Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 9:30 AM: - I added this for the missing fonts, which is just a guess that it's the correct font {code:java} PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm(); acroForm.setNeedAppearances(false); PDFont font1 = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/times.ttf"), false); PDFont font2 = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/timesbd.ttf"), false); PDFont font3 = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/arial.ttf"), false); acroForm.getDefaultResources().put(COSName.getPDFName("TimesNewRomanPSMT"), font1); acroForm.getDefaultResources().put(COSName.getPDFName("TimesNewRomanPS-BoldMT"), font2); acroForm.getDefaultResources().put(COSName.getPDFName("Helvetica"), font3); for (PDField field: acroForm.getFieldTree()) { if (field instanceof PDTextField) { if (((PDTextField) field).isMultiline()) { field.setValue("XXX"); } } } {code} But when setting a value, this happens in AppearanceGeneratorHelper.setAppearanceContent(): {code} if (bmcIndex == -1) { // append to existing stream writer.writeTokens(tokens); writer.writeTokens(COSName.TX, BMC); } {code} So it appends to the existing appearance steam. This is the result after calling setValue("XXX"): {code} q Q q 9.613575 0.4609071 430.9062 41.31819 re W n q 0.9781767 0 0 -0.9781767 -87.43936 478.0107 cm BT 11 0 0 -11 102.2182 458.5622 Tm /TT21 1 Tf [ (N) -0.2 (a) 0.2 (m) 0.2 (e) 0.2 ( c) 0.2 (ha) 0.2 (nge) 0.2 (d 09/) 0.2 (26/) 0.2 (2020) ] TJ ET Q Q q 6.43259 0.3084 434.0872 41.6232 re W n q 0.9853977 0 0 0.9853977 9.388783 29.51731 cm BT 11 0 0 11 0 0 Tm /TT18 1 Tf [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 (ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 (ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ ET Q q 0.9853977 0 0 0.9853977 9.388783 17.51355 cm BT 11 0 0 11 0 0 Tm /TT18 1 Tf [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 (i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ ET Q Q q 3.228123 0.1547671 437.2917 41.93047 re W n q 0.992672 0 0 0.992672 6.206139 29.5793 cm BT 11 0 0 11 0 0 Tm /TT19 1 Tf [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 (ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 (ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ ET Q q 0.992672 0 0 0.992672 6.206139 17.48693 cm BT 11 0 0 11 0 0 Tm /TT19 1 Tf [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 (i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ ET Q Q q 0 0 440.5198 42.24 re W n /Cs6 cs 0 sc q 1 0 0 1 3 29.64175 cm BT 11 0 0 11 0 0 Tm /TT20 1 Tf [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 (ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 (ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ ET Q q 1 0 0 1 3 17.46011 cm BT 11 0 0 11 0 0 Tm /TT20 1 Tf [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 (i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ ET Q Q /Tx BMC q -2.252 1 441.7718 40.24 re W n BT /TimesNewRomanPSMT 11 Tf /DeviceGray cs 0 sc -1.252 25.4319 Td (\000;\000;\000;) Tj ET Q EMC {code} So the XXX is there, but also all the previous content. was (Author: tilman): I added this for the missing fonts, which is just a guess that it's the correct font {code:java} acroForm.setNeedAppearances(false); PDFont font1 = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/times.ttf"), false); PDFont font2
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879796#comment-17879796 ] Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 8:00 AM: - There are so many things wrong with this PDF that I don't see a specific solution. I'm doing this just for fun. I was able to fix some of the fields (e.g. Last1) but not yet all (e.g. the multiline fields and some others), for some unknown reason. (I added the missing fonts to the default resources) Not all appearances are redrawn. Either there's a bug in my code or there is something in our code that skips the recreation of the appearances and I forgot about it. It's not even recreated when changing to the value to something else?! was (Author: tilman): There are so many things wrong with this PDF that I don't see a specific solution. I'm doing this just for fun. I was able to fix some of the fields (e.g. Last1) but not yet all (e.g. the multiline fields and some others), for some unknown reason. (I added the missing fonts to the default resources) Not all appearances are redrawn. Either there's a bug in my code or there is something in our code that skips the recreation of the appearances and I forgot about it. > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > beforeFlattening.pdf, flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879753#comment-17879753 ] Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 4:04 AM: - I could try to getValue() and setValue() on the text fields and see whether it looks better when PDFBox recreates the appearances. These fields have a value that makes sense. I'm just wondering whether this person will have legal disadvantages if the file is refused? (Although I doubt that the content of field {{Root/Pages/Kids/[0]/Annots/[7]/V}} will work for the petitioner). OTOH it's from 22.2 so it may already have been decided in some way. was (Author: tilman): I could try to getValue() and setValue() on the text fields and see whether it looks better when PDFBox recreates the appearances. These fields have a value that makes sense. I'm just wondering whether this person will have legal disadvantages if the file is refused? (Although I doubt that the content of field {{Root/Pages/Kids/[0]/Annots/[7]/V}} will work for the petitioner). OTOH it's from 22.2 so it may already have been processed. > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > beforeFlattening.pdf, flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879480#comment-17879480 ] Tilman Hausherr edited comment on PDFBOX-5878 at 9/5/24 8:16 AM: - {code} q Q q 9.469598 0.4248199 206.7517 18.55036 re W n q 0.9562042 0 0 -0.9562042 -55.6218 672.8725 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT21 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q q 6.360067 0.2853218 209.8612 18.82936 re W n q 0.9705854 0 0 -0.9705854 -59.7103 682.8466 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT18 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q q 3.203769 0.1437257 213.0175 19.11255 re W n q 0.9851829 0 0 -0.9851829 -63.86029 692.9707 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT19 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q q 0 0 216.2213 19.4 re W n /Cs6 cs 0 sc q 1 0 0 -1 -68.0727 703.247 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT20 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q {code} The text appears 3 times at slightly different positions in this appearance stream. was (Author: tilman): {code} q Q q 9.469598 0.4248199 206.7517 18.55036 re W n q 0.9562042 0 0 -0.9562042 -55.6218 672.8725 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT21 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q q 6.360067 0.2853218 209.8612 18.82936 re W n q 0.9705854 0 0 -0.9705854 -59.7103 682.8466 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT18 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q q 3.203769 0.1437257 213.0175 19.11255 re W n q 0.9851829 0 0 -0.9851829 -63.86029 692.9707 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT19 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q q 0 0 216.2213 19.4 re W n /Cs6 cs 0 sc q 1 0 0 -1 -68.0727 703.247 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT20 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q {code} The text appears 3 times. > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > beforeFlattening.pdf, flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-5878: --- Labels: Appearance (was: ) > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > beforeFlattening.pdf, flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879145#comment-17879145 ] ASF subversion and git services commented on PDFBOX-5876: - Commit 1920451 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1920451 ] PDFBOX-5876: revert due to rendering regression test failure > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Reopened] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reopened PDFBOX-5876: - > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Closed] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler closed PDFBOX-5877. -- Resolution: Not A Problem > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879066#comment-17879066 ] Joseph Jezerinac commented on PDFBOX-5877: -- Writing to a different file does solve this and many other issues. Will keep checking other pdfs. Thank you very much for your help > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879031#comment-17879031 ] Joseph Jezerinac commented on PDFBOX-5877: -- Thank you for a quick response. Yes, I tried in 3.0.3. Sorry about PdfResourceCache, that is our class. Our code was not changed and works fine with older version of PDFBox so I thought maybe something got changed in the new version. Will look into what was pointed out by Lehmkühler and let you know > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878964#comment-17878964 ] Tilman Hausherr commented on PDFBOX-5877: - Yeah!! There's a log message, so it means you also disabled or disregarded logs :-( > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878963#comment-17878963 ] Andreas Lehmkühler commented on PDFBOX-5877: Maybe more important: don't use the input file as output. The on demand parser may read from the input file until it is closed. Most likely your are overwriting the source while saving the resulting file. > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878961#comment-17878961 ] Tilman Hausherr edited comment on PDFBOX-5877 at 9/3/24 5:55 PM: - What's this? {code} pdDocument.setResourceCache(new PdfResourceCache()) {code} We have no class {{PdfResourceCache}}. was (Author: tilman): What's this? pdDocument.setResourceCache(new PdfResourceCache()) > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878961#comment-17878961 ] Tilman Hausherr commented on PDFBOX-5877: - What's this? pdDocument.setResourceCache(new PdfResourceCache()) > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878960#comment-17878960 ] Tilman Hausherr commented on PDFBOX-5877: - Are you sure you used 3.0.3 and not 3.0.2 ? I just tried with the trunk and 3.0.4-SNAPSHOT with our test and I got only invisible differences (yours are clearly visible and are because all fonts are lost in the PDF) > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-5877: --- Description: After flattening the pdf form content changes. Pls take a look at before and after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we started getting many issues with pdf forms after flattening. The code that used for flattening is as follows {code} PDDocument pdDocument = Loader.loadPDF(file, “”); pdDocument.setResourceCache(new PdfResourceCache()) try { boolean save = false; if (pdDocument.isEncrypted()) { pdDocument.setAllSecurityToBeRemoved(true); save = true; } final PDDocumentCatalog pdDocumentCatalog = pdDocument.getDocumentCatalog(); if (pdDocumentCatalog != null) { final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); if (pdForm != null) { pdForm.flatten(); save = true; } } if (save) { pdDocument.save(file); } } {code} was: After flattening the pdf form content changes. Pls take a look at before and after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we started getting many issues with pdf forms after flattening. The code that used for flattening is as follows PDDocument pdDocument = Loader.loadPDF(file, “”); pdDocument.setResourceCache(new PdfResourceCache()) try { boolean save = false; if (pdDocument.isEncrypted()) { pdDocument.setAllSecurityToBeRemoved(true); save = true; } final PDDocumentCatalog pdDocumentCatalog = pdDocument.getDocumentCatalog(); if (pdDocumentCatalog != null) { final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); if (pdForm != null) { pdForm.flatten(); save = true; } } if (save) { pdDocument.save(file); } } > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Lehmkühler updated PDFBOX-5877: --- Affects Version/s: 3.0.3 PDFBox (was: 3.0.3 JBIG2) > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
Joseph Jezerinac created PDFBOX-5877: Summary: After flattening a form pdf, the pdf loses content Key: PDFBOX-5877 URL: https://issues.apache.org/jira/browse/PDFBOX-5877 Project: PDFBox Issue Type: Bug Components: AcroForm Affects Versions: 3.0.3 JBIG2 Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 Reporter: Joseph Jezerinac Attachments: beforeFalttening.pdf, flattenedPdf.pdf After flattening the pdf form content changes. Pls take a look at before and after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we started getting many issues with pdf forms after flattening. The code that used for flattening is as follows PDDocument pdDocument = Loader.loadPDF(file, “”); pdDocument.setResourceCache(new PdfResourceCache()) try { boolean save = false; if (pdDocument.isEncrypted()) { pdDocument.setAllSecurityToBeRemoved(true); save = true; } final PDDocumentCatalog pdDocumentCatalog = pdDocument.getDocumentCatalog(); if (pdDocumentCatalog != null) { final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); if (pdForm != null) { pdForm.flatten(); save = true; } } if (save) { pdDocument.save(file); } } -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878880#comment-17878880 ] liu edited comment on PDFBOX-5876 at 9/3/24 1:23 PM: - 4G JVM can only convert 8 pictures concurrently...It overflows so easily.. was (Author: JIRAUSER297279): 4G JVM can only convert 8 pictures concurrently... > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878880#comment-17878880 ] liu commented on PDFBOX-5876: - 4G JVM can only convert 8 pictures concurrently... > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878879#comment-17878879 ] Tilman Hausherr commented on PDFBOX-5876: - No... I used -Xmx4G for a production project. > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878878#comment-17878878 ] liu commented on PDFBOX-5876: - It's still very large, one picture takes up 500M. Are there any other optimization solutions? > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878846#comment-17878846 ] Tilman Hausherr commented on PDFBOX-5876: - Are you sure you are using the new version? You have to build yourself or wait until a new snapshot build is available. Instead of using PDFDebugger now I just tried your code as it is with a locally built 3.0.4-SNAPSHOT and it did work with -Xmx600m. (Also with 550, but not with 500) > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878837#comment-17878837 ] liu commented on PDFBOX-5876: - I tried it, but it still seems to overflow. > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-5876. - Fix Version/s: 2.0.33 3.0.4 PDFBox 4.0.0 Assignee: Tilman Hausherr Resolution: Fixed > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5876: Affects Version/s: 2.0.32 > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Priority: Major > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5876: Component/s: Rendering > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Priority: Major > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878835#comment-17878835 ] Tilman Hausherr commented on PDFBOX-5876: - The JPX image in that file is 7020 x 4964, which is quite big, and -Xmx600m is quite low. But I noticed that the subsampling parameter wasn't used when reading the JPX image the second time, which was the cause for the OOM. (JPX images have to be read twice because of some weirdness in the specification) It should work now, I tried it with PDFDebugger, which doesn't allow to set a temp cache. > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug >Affects Versions: 3.0.2 PDFBox >Reporter: liu >Priority: Major > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878827#comment-17878827 ] ASF subversion and git services commented on PDFBOX-5876: - Commit 1920420 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1920420 ] PDFBOX-5876: pass subsampling for second read > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug >Affects Versions: 3.0.2 PDFBox >Reporter: liu >Priority: Major > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878829#comment-17878829 ] ASF subversion and git services commented on PDFBOX-5876: - Commit 1920422 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1920422 ] PDFBOX-5876: pass subsampling for second read > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug >Affects Versions: 3.0.2 PDFBox >Reporter: liu >Priority: Major > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878828#comment-17878828 ] ASF subversion and git services commented on PDFBOX-5876: - Commit 1920421 from Tilman Hausherr in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920421 ] PDFBOX-5876: pass subsampling for second read > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug >Affects Versions: 3.0.2 PDFBox >Reporter: liu >Priority: Major > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
liu created PDFBOX-5876: --- Summary: This jpeg2000 takes up a lot of memory, causing overflow. Key: PDFBOX-5876 URL: https://issues.apache.org/jira/browse/PDFBOX-5876 Project: PDFBox Issue Type: Bug Affects Versions: 3.0.2 PDFBox Reporter: liu Attachments: jpeg2000.pdf pdf:[^jpeg2000.pdf] JVM:-Xmx600m {code:java} //代码占位符 public static void main(String[] args) throws IOException, InterruptedException { File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); PDDocument pdf = Loader.loadPDF(file, IOUtils.createTempFileOnlyStreamCache()); PDFRenderer renderer = new PDFRenderer(pdf); int numPages = 0; renderer.setSubsamplingAllowed(true); BufferedImage bi = renderer.renderImage(numPages, 0.5f); pdf.close(); } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liu updated PDFBOX-5876: Attachment: jpeg2000.pdf > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug >Affects Versions: 3.0.2 PDFBox >Reporter: liu >Priority: Major > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878493#comment-17878493 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 1920378 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1920378 ] PDFBOX-5660: update owasp plugin > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878492#comment-17878492 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 1920377 from Tilman Hausherr in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920377 ] PDFBOX-5660: update owasp plugin > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878491#comment-17878491 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 1920376 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1920376 ] PDFBOX-5660: update owasp plugin > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878490#comment-17878490 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 2ec6fce55444c6b137268ebb5957e0960d75276d in pdfbox-jbig2's branch refs/heads/master from Tilman Hausherr [ https://gitbox.apache.org/repos/asf?p=pdfbox-jbig2.git;h=2ec6fce ] PDFBOX-5660: update owasp plugin > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5873) Improve ExtractTTFFonts
[ https://issues.apache.org/jira/browse/PDFBOX-5873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878211#comment-17878211 ] ASF subversion and git services commented on PDFBOX-5873: - Commit 1920306 from Tilman Hausherr in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920306 ] PDFBOX-5873: avoid NPE > Improve ExtractTTFFonts > --- > > Key: PDFBOX-5873 > URL: https://issues.apache.org/jira/browse/PDFBOX-5873 > Project: PDFBox > Issue Type: Improvement > Components: Utilities >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > > Add more places where resources exist; don't save fonts twice -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5873) Improve ExtractTTFFonts
[ https://issues.apache.org/jira/browse/PDFBOX-5873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878209#comment-17878209 ] ASF subversion and git services commented on PDFBOX-5873: - Commit 1920304 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1920304 ] PDFBOX-5873: avoid NPE > Improve ExtractTTFFonts > --- > > Key: PDFBOX-5873 > URL: https://issues.apache.org/jira/browse/PDFBOX-5873 > Project: PDFBox > Issue Type: Improvement > Components: Utilities >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > > Add more places where resources exist; don't save fonts twice -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5875) using font data to process ligatures
[ https://issues.apache.org/jira/browse/PDFBOX-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5875: Fix Version/s: (was: 3.0.4 PDFBox) > using font data to process ligatures > > > Key: PDFBOX-5875 > URL: https://issues.apache.org/jira/browse/PDFBOX-5875 > Project: PDFBox > Issue Type: New Feature > Components: Parsing, PDModel, Text extraction >Affects Versions: 3.0.3 PDFBox >Reporter: Manish S N >Priority: Major > Labels: Asian, CIDFont, font, ligatures, unicodemapping > Attachments: page.pdf > > > To process ligatures from Asian languages (where a glyph is the combination > of two unicode characters) using the data in embedded fonts. > > *The problem:* > currently modern PDF creators put these ligatures in /ActualText field which > we only recently considered to support in this issue . But this is not the > case in old PDFs with embedded CID fonts like [^page.pdf] where the glyphs of > ligatures lack a /toUnicode character mapping because there is no single > unicode codepoint for these as these are combination of more than one unicode > characters. > > *The Potential Solution (if not perfect):* > I managed to extract the font files using pdfbox > ([code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java]) > and when i viewed the fontfiles using fontforge i found the data about > ligatures intact in it. So we can use this data to map the glyphs that are > ligatures to the unicodes of its constituent glyphs > > *Problems:* > In some cases the constituent glyphs may not be present in the cmap at all. > removed by PDF optimiser as it is never directly used in the PDF apart from > in ligatures. such glyphs are empty with only glyph id and no /toUnicode > mapping even if that particular glyph has a corresponding unicode character. > > *The Hope:* > This is not a common problem in large PDFs. and basic spell checkers could > easily rectify the problem. some comprehension is better than no > comprehension when it comes to dealing with data. this will greatly enhance > the parsing of non-Latin Asian languages. > > (the PDF sample i attached is in Tamil language) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5875) using font data to process ligatures
Manish S N created PDFBOX-5875: -- Summary: using font data to process ligatures Key: PDFBOX-5875 URL: https://issues.apache.org/jira/browse/PDFBOX-5875 Project: PDFBox Issue Type: New Feature Components: Parsing, PDModel, Text extraction Affects Versions: 3.0.3 PDFBox Reporter: Manish S N Fix For: 3.0.4 PDFBox Attachments: page.pdf To process ligatures from Asian languages (where a glyph is the combination of two unicode characters) using the data in embedded fonts. *The problem:* currently modern PDF creators put these ligatures in /ActualText field which we only recently considered to support in this issue . But this is not the case in old PDFs with embedded CID fonts like [^page.pdf] where the glyphs of ligatures lack a /toUnicode character mapping because there is no single unicode codepoint for these as these are combination of more than one unicode characters. *The Potential Solution (if not perfect):* I managed to extract the font files using pdfbox ([code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java]) and when i viewed the fontfiles using fontforge i found the data about ligatures intact in it. So we can use this data to map the glyphs that are ligatures to the unicodes of its constituent glyphs *Problems:* In some cases the constituent glyphs may not be present in the cmap at all. removed by PDF optimiser as it is never directly used in the PDF apart from in ligatures. such glyphs are empty with only glyph id and no /toUnicode mapping even if that particular glyph has a corresponding unicode character. *The Hope:* This is not a common problem in large PDFs. and basic spell checkers could easily rectify the problem. some comprehension is better than no comprehension when it comes to dealing with data. this will greatly enhance the parsing of non-Latin Asian languages. (the PDF sample i attached is in Tamil language) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878095#comment-17878095 ] Manish S N commented on PDFBOX-5868: the data is right there in the fonts, within reach ... on second thought, yes there are problems. this approach assumes that every glyph that has a corresponding unicode character is _present_ is the cmap which isn't always true. when seeing ligature data for நூ, we see !image-2024-08-30-17-55-41-423.png! here we can see glyph92 instead of the unicode character for dependent vowel ூ which is not present in cmap because that actual glyph is never used in the pdf (all pure tamil ligatures of ூ are irrugular and have their own glyphs and not combined side by side unlike other dependent vowel glyphs so no use for the actual glyph. hence pdf optimizers will chuck it away along with its unicode mapping) the problem is replicated by all ligatures of ூ (dependent vowel uu) it is the case in languages like tamil but most other non-latin languages can be fine. like hindi; it is a more regular language than tamil (when comes to letters) In the end there are also other problems like mangled cmap as a method of obfuscation > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, image-2024-08-30-17-55-41-423.png, > multilingual_test.pdf, okular_out.txt, page.pdf, pdfbox_out.txt, > poppler_out.txt, screenshot-1.png, screenshot-2.png, > suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manish S N updated PDFBOX-5868: --- Attachment: image-2024-08-30-17-55-41-423.png > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, image-2024-08-30-17-55-41-423.png, > multilingual_test.pdf, okular_out.txt, page.pdf, pdfbox_out.txt, > poppler_out.txt, screenshot-1.png, screenshot-2.png, > suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878089#comment-17878089 ] Tilman Hausherr commented on PDFBOX-5868: - Yes. But consider that Adobe didn't do it and they're smarter than us, I just tried copy / paste and save as text. The ligature thing in fonts are meant to be used when creating PDFs, I don't know if these would work in extraction. > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878084#comment-17878084 ] Manish S N commented on PDFBOX-5868: so shall i open this as a feature/improvement type issue then? > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878081#comment-17878081 ] Manish S N commented on PDFBOX-5868: {quote}this is a different problem {quote} Initially yes it was. then upon closer inspection I saw a solution referring to this existing problem ;) P.S: I know the cmap is incomplete and no library can extract it (including adobe). but should we follow the cmap (and other libraries XD), I can open extracted font in fontforge>view>combinations to find unicode combination for the glyphs that are ligatures so we do not need to rely on actualText data anymore to process these ligatures > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878077#comment-17878077 ] Manish S N commented on PDFBOX-5868: It is the modern word processors that put actual unicode of glyphs in actual text tags but in older PDFs with embedded CID fonts like [^page.pdf] there is no such tags and the ligatures are left out without unicode mappings (there is no single unicode for these). but when i extracted font files using pdfbox ( [code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java] ) and analysed it using font forge to find the font files do contain data about these ligatures.. So if we can process that data and assign these glyphs the data (unicode combination for that ligature) from font, there is no need to worry about parsing actualText and I believe it would improve text extraction from non-latin languages to a great extent. thus solution to your problem of misusing actual text to prevent text extraction [~tilman] > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878076#comment-17878076 ] Tilman Hausherr edited comment on PDFBOX-5868 at 8/30/24 11:50 AM: --- Please create a new ticket for the file you just added because this is a different problem (only if you manage to extract this properly from Adobe Reader). was (Author: tilman): Please create a new ticket for the file you just added because this is a different problem. > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878076#comment-17878076 ] Tilman Hausherr commented on PDFBOX-5868: - Please create a new ticket for the file you just added because this is a different problem. > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manish S N updated PDFBOX-5868: --- Attachment: page.pdf > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878069#comment-17878069 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 1920288 from Tilman Hausherr in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1920288 ] PDFBOX-5660: update ant > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878070#comment-17878070 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 1920289 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1920289 ] PDFBOX-5660: update ant > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5660) Improve code quality (5)
[ https://issues.apache.org/jira/browse/PDFBOX-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877373#comment-17877373 ] ASF subversion and git services commented on PDFBOX-5660: - Commit 1920252 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1920252 ] PDFBOX-5660: update maven-plugin-annotations, mockito > Improve code quality (5) > > > Key: PDFBOX-5660 > URL: https://issues.apache.org/jira/browse/PDFBOX-5660 > Project: PDFBox > Issue Type: Improvement >Reporter: Tilman Hausherr >Priority: Minor > Attachments: AnnotationSample.Standard.pdf, > DRY_refactoring_Typ2CharStringParser.patch, > Removed_the_readFully_method_in_the_PfbParser_class_and_replaced__with_calling_readAllByte.patch, > > Simplify_list_and_map_operations,_use_known_size_when_creating_StringBuilder.patch, > Simplify_string_conversion_in_PDFHighlighter.patch, > Update_string_handling_and_regex_in_several_classes.patch, > avoid_multiple_unboxing.patch, code_cleanup.patch, > do_not_create_temporary_File_instance.patch, > extract_common_code,_move_toUpperCase()_out_of_loop.patch, > fix_HTML_error_in_Javadoc.patch, fix_javadoc_problems.patch, > introduce_COSArray_of(float___)_to_make_the_code_more_concise_and_avoid_creating_and_copyi.patch, > introduce_StringUtil_class_for_reusable_functionality.patch, > introduce_constants_COSFLOAT_ZERO_and_COSFloat_ONE_to_avoid_creating_unnecessary_instances.patch, > make_inner_class_static.patch, refactor_isEndOfName.patch, > remove_code_duplication_in_Type2CharStringParser.patch, > remove_obsolete_class_NullOutputStream.patch, > remove_unnecessary_calls_to_toString()_String_valueOf().patch, > replace_System_getProperty()_calls.patch, screenshot-1.png, > simplify_hashCode()_and_equals(),_test_name_first_because_Map_equals()_is_expensive.patch, > simplify_stream_operations.patch, use_Map_ofEntries().patch, > use_Math_min()_to_make_code_more_readable.patch, use_Objects_equals().patch, > use_String_isEmpty()_Collection_isEmpty()_instead_of_checking_length_size.patch, > use_String_join().patch, use_switch_for_readability.patch, > use_try-with-resources_(since_Java_9_the_variable_declaration_in_the_try_is_not_necessary_.patch > > > This is a longterm issue for the task to improve code quality, by using the > SonarQube report, hints in different IDEs, the FindBugs tool and other code > quality tools. > This is a follow-up of PDFBOX-4892, which was getting too long. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache
[ https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-5874. - Assignee: Tilman Hausherr Resolution: Fixed Thank you, you're right, there's no need to warn about something that harmless. > Change Loglevel from Warn to info when rebuilding font cache > > > Key: PDFBOX-5874 > URL: https://issues.apache.org/jira/browse/PDFBOX-5874 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Thomas Hoffmann >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > > We have a monitoring system for our logfiles and some people get notified > whenever there is an error or a warning in the logfiles. > Due to OS updates, the fonts might be updated or changed. This triggers a > rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning > and this triggers an alarm. > The warnings occur in: > org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java > The logfile shows the following three entries: > 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, > font cache will be re-built > 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk > font cache, this may take a while > 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building > on-disk font cache, found 96 fonts > > Imho the message is more informational and not necessary a warning. It just > gives me the information, that the cache is getting rebuilt. > It would be great if you could consider setting these messages to info level. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache
[ https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877357#comment-17877357 ] ASF subversion and git services commented on PDFBOX-5874: - Commit 1920251 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1920251 ] PDFBOX-5874: change Loglevel from warn to info when rebuilding font cache, as suggested by Thomas Hoffmann > Change Loglevel from Warn to info when rebuilding font cache > > > Key: PDFBOX-5874 > URL: https://issues.apache.org/jira/browse/PDFBOX-5874 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Thomas Hoffmann >Priority: Minor > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > > We have a monitoring system for our logfiles and some people get notified > whenever there is an error or a warning in the logfiles. > Due to OS updates, the fonts might be updated or changed. This triggers a > rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning > and this triggers an alarm. > The warnings occur in: > org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java > The logfile shows the following three entries: > 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, > font cache will be re-built > 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk > font cache, this may take a while > 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building > on-disk font cache, found 96 fonts > > Imho the message is more informational and not necessary a warning. It just > gives me the information, that the cache is getting rebuilt. > It would be great if you could consider setting these messages to info level. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org