from:"\"Tilman Hausherr\""



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5882.
-
  Assignee: Tilman Hausherr
Resolution: Fixed

I've added a warning. My original idea was to throw an exception, but these 
might happen in rendering in some cases so the warning is a compromise.

> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> ---
>
> Key: PDFBOX-5882
> URL: https://issues.apache.org/jira/browse/PDFBOX-5882
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.24, 2.0.32, 3.0.3 PDFBox
>Reporter: bai yuan
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, 
> image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png
>
>
> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> It appears red in Adobe and Chrome, which is correct.
> It appears blue in Safari, which is incorrect.
> Here is the example code:
> {code:java}
> try (PDDocument document = new PDDocument()) {
> PDPage page = new PDPage();
> document.addPage(page);
> 
> try (PDPageContentStream contentStream = new 
> PDPageContentStream(document, page)) {
> PDTilingPattern pattern = new PDTilingPattern();
> pattern.setBBox(new PDRectangle(3, 3));
> pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED);
> 
> pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING);
> pattern.setXStep(3);
> pattern.setYStep(3);
> pattern.setMatrix(Matrix.getScaleInstance(1, 
> 1).createAffineTransform());
> try (PDPatternContentStream patternContentStream = new 
> PDPatternContentStream(pattern)) {
> patternContentStream.setLineWidth(0.4f);
> patternContentStream.moveTo(0, 2);
> patternContentStream.lineTo(0, 3);
> patternContentStream.lineTo(2, 3);
> patternContentStream.lineTo(2, 2);
> patternContentStream.lineTo(3, 2);
> patternContentStream.lineTo(3, 0);
> patternContentStream.lineTo(2, 0);
> patternContentStream.lineTo(2, 1);
> patternContentStream.lineTo(1, 1);
> patternContentStream.lineTo(1, 2);
> patternContentStream.closePath();
> patternContentStream.fill();
> } catch (IOException e) {
> throw new RuntimeException(e);
> };
> COSName patternName = page.getResources().add(pattern);
> PDPattern pdPattern = new PDPattern(page.getResources(), 
> PDDeviceRGB.INSTANCE);
> PDColor pdColor = new PDColor(Color.RED.getComponents(null), 
> patternName, pdPattern);
> contentStream.setNonStrokingColor(pdColor);
> contentStream.addRect(100, 500, 400, 200);
> contentStream.fill();
> }
> document.save("excel_pattern_fill.pdf");
> }
> {code}
> **Safari:**
>  !image-2024-10-08-16-04-32-344.png! 
> Adobe:
>  !image-2024-10-08-16-04-49-033.png! 
> The exported pdf file : excel_pattern_fill.pdf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5882:

Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0

> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> ---
>
> Key: PDFBOX-5882
> URL: https://issues.apache.org/jira/browse/PDFBOX-5882
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.24, 2.0.32, 3.0.3 PDFBox
>Reporter: bai yuan
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, 
> image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png
>
>
> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> It appears red in Adobe and Chrome, which is correct.
> It appears blue in Safari, which is incorrect.
> Here is the example code:
> {code:java}
> try (PDDocument document = new PDDocument()) {
> PDPage page = new PDPage();
> document.addPage(page);
> 
> try (PDPageContentStream contentStream = new 
> PDPageContentStream(document, page)) {
> PDTilingPattern pattern = new PDTilingPattern();
> pattern.setBBox(new PDRectangle(3, 3));
> pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED);
> 
> pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING);
> pattern.setXStep(3);
> pattern.setYStep(3);
> pattern.setMatrix(Matrix.getScaleInstance(1, 
> 1).createAffineTransform());
> try (PDPatternContentStream patternContentStream = new 
> PDPatternContentStream(pattern)) {
> patternContentStream.setLineWidth(0.4f);
> patternContentStream.moveTo(0, 2);
> patternContentStream.lineTo(0, 3);
> patternContentStream.lineTo(2, 3);
> patternContentStream.lineTo(2, 2);
> patternContentStream.lineTo(3, 2);
> patternContentStream.lineTo(3, 0);
> patternContentStream.lineTo(2, 0);
> patternContentStream.lineTo(2, 1);
> patternContentStream.lineTo(1, 1);
> patternContentStream.lineTo(1, 2);
> patternContentStream.closePath();
> patternContentStream.fill();
> } catch (IOException e) {
> throw new RuntimeException(e);
> };
> COSName patternName = page.getResources().add(pattern);
> PDPattern pdPattern = new PDPattern(page.getResources(), 
> PDDeviceRGB.INSTANCE);
> PDColor pdColor = new PDColor(Color.RED.getComponents(null), 
> patternName, pdPattern);
> contentStream.setNonStrokingColor(pdColor);
> contentStream.addRect(100, 500, 400, 200);
> contentStream.fill();
> }
> document.save("excel_pattern_fill.pdf");
> }
> {code}
> **Safari:**
>  !image-2024-10-08-16-04-32-344.png! 
> Adobe:
>  !image-2024-10-08-16-04-49-033.png! 
> The exported pdf file : excel_pattern_fill.pdf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5882:

Component/s: PDModel

> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> ---
>
> Key: PDFBOX-5882
> URL: https://issues.apache.org/jira/browse/PDFBOX-5882
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.24, 2.0.32, 3.0.3 PDFBox
>Reporter: bai yuan
>Priority: Major
> Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, 
> image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png
>
>
> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> It appears red in Adobe and Chrome, which is correct.
> It appears blue in Safari, which is incorrect.
> Here is the example code:
> {code:java}
> try (PDDocument document = new PDDocument()) {
> PDPage page = new PDPage();
> document.addPage(page);
> 
> try (PDPageContentStream contentStream = new 
> PDPageContentStream(document, page)) {
> PDTilingPattern pattern = new PDTilingPattern();
> pattern.setBBox(new PDRectangle(3, 3));
> pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED);
> 
> pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING);
> pattern.setXStep(3);
> pattern.setYStep(3);
> pattern.setMatrix(Matrix.getScaleInstance(1, 
> 1).createAffineTransform());
> try (PDPatternContentStream patternContentStream = new 
> PDPatternContentStream(pattern)) {
> patternContentStream.setLineWidth(0.4f);
> patternContentStream.moveTo(0, 2);
> patternContentStream.lineTo(0, 3);
> patternContentStream.lineTo(2, 3);
> patternContentStream.lineTo(2, 2);
> patternContentStream.lineTo(3, 2);
> patternContentStream.lineTo(3, 0);
> patternContentStream.lineTo(2, 0);
> patternContentStream.lineTo(2, 1);
> patternContentStream.lineTo(1, 1);
> patternContentStream.lineTo(1, 2);
> patternContentStream.closePath();
> patternContentStream.fill();
> } catch (IOException e) {
> throw new RuntimeException(e);
> };
> COSName patternName = page.getResources().add(pattern);
> PDPattern pdPattern = new PDPattern(page.getResources(), 
> PDDeviceRGB.INSTANCE);
> PDColor pdColor = new PDColor(Color.RED.getComponents(null), 
> patternName, pdPattern);
> contentStream.setNonStrokingColor(pdColor);
> contentStream.addRect(100, 500, 400, 200);
> contentStream.fill();
> }
> document.save("excel_pattern_fill.pdf");
> }
> {code}
> **Safari:**
>  !image-2024-10-08-16-04-32-344.png! 
> Adobe:
>  !image-2024-10-08-16-04-49-033.png! 
> The exported pdf file : excel_pattern_fill.pdf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5882:

Affects Version/s: 3.0.3 PDFBox
   2.0.32

> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> ---
>
> Key: PDFBOX-5882
> URL: https://issues.apache.org/jira/browse/PDFBOX-5882
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.24, 2.0.32, 3.0.3 PDFBox
>Reporter: bai yuan
>Priority: Major
> Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, 
> image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png
>
>
> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> It appears red in Adobe and Chrome, which is correct.
> It appears blue in Safari, which is incorrect.
> Here is the example code:
> {code:java}
> try (PDDocument document = new PDDocument()) {
> PDPage page = new PDPage();
> document.addPage(page);
> 
> try (PDPageContentStream contentStream = new 
> PDPageContentStream(document, page)) {
> PDTilingPattern pattern = new PDTilingPattern();
> pattern.setBBox(new PDRectangle(3, 3));
> pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED);
> 
> pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING);
> pattern.setXStep(3);
> pattern.setYStep(3);
> pattern.setMatrix(Matrix.getScaleInstance(1, 
> 1).createAffineTransform());
> try (PDPatternContentStream patternContentStream = new 
> PDPatternContentStream(pattern)) {
> patternContentStream.setLineWidth(0.4f);
> patternContentStream.moveTo(0, 2);
> patternContentStream.lineTo(0, 3);
> patternContentStream.lineTo(2, 3);
> patternContentStream.lineTo(2, 2);
> patternContentStream.lineTo(3, 2);
> patternContentStream.lineTo(3, 0);
> patternContentStream.lineTo(2, 0);
> patternContentStream.lineTo(2, 1);
> patternContentStream.lineTo(1, 1);
> patternContentStream.lineTo(1, 2);
> patternContentStream.closePath();
> patternContentStream.fill();
> } catch (IOException e) {
> throw new RuntimeException(e);
> };
> COSName patternName = page.getResources().add(pattern);
> PDPattern pdPattern = new PDPattern(page.getResources(), 
> PDDeviceRGB.INSTANCE);
> PDColor pdColor = new PDColor(Color.RED.getComponents(null), 
> patternName, pdPattern);
> contentStream.setNonStrokingColor(pdColor);
> contentStream.addRect(100, 500, 400, 200);
> contentStream.fill();
> }
> document.save("excel_pattern_fill.pdf");
> }
> {code}
> **Safari:**
>  !image-2024-10-08-16-04-32-344.png! 
> Adobe:
>  !image-2024-10-08-16-04-49-033.png! 
> The exported pdf file : excel_pattern_fill.pdf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.



[ 
https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887521#comment-17887521
 ] 

Tilman Hausherr edited comment on PDFBOX-5882 at 10/8/24 9:25 AM:
--

I'm testing adding a check in PDColor constructor.


was (Author: tilman):
I'm testing adding check in PDColor constructor.

> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> ---
>
> Key: PDFBOX-5882
> URL: https://issues.apache.org/jira/browse/PDFBOX-5882
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.24
>Reporter: bai yuan
>Priority: Major
> Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, 
> image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png
>
>
> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> It appears red in Adobe and Chrome, which is correct.
> It appears blue in Safari, which is incorrect.
> Here is the example code:
> {code:java}
> try (PDDocument document = new PDDocument()) {
> PDPage page = new PDPage();
> document.addPage(page);
> 
> try (PDPageContentStream contentStream = new 
> PDPageContentStream(document, page)) {
> PDTilingPattern pattern = new PDTilingPattern();
> pattern.setBBox(new PDRectangle(3, 3));
> pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED);
> 
> pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING);
> pattern.setXStep(3);
> pattern.setYStep(3);
> pattern.setMatrix(Matrix.getScaleInstance(1, 
> 1).createAffineTransform());
> try (PDPatternContentStream patternContentStream = new 
> PDPatternContentStream(pattern)) {
> patternContentStream.setLineWidth(0.4f);
> patternContentStream.moveTo(0, 2);
> patternContentStream.lineTo(0, 3);
> patternContentStream.lineTo(2, 3);
> patternContentStream.lineTo(2, 2);
> patternContentStream.lineTo(3, 2);
> patternContentStream.lineTo(3, 0);
> patternContentStream.lineTo(2, 0);
> patternContentStream.lineTo(2, 1);
> patternContentStream.lineTo(1, 1);
> patternContentStream.lineTo(1, 2);
> patternContentStream.closePath();
> patternContentStream.fill();
> } catch (IOException e) {
> throw new RuntimeException(e);
> };
> COSName patternName = page.getResources().add(pattern);
> PDPattern pdPattern = new PDPattern(page.getResources(), 
> PDDeviceRGB.INSTANCE);
> PDColor pdColor = new PDColor(Color.RED.getComponents(null), 
> patternName, pdPattern);
> contentStream.setNonStrokingColor(pdColor);
> contentStream.addRect(100, 500, 400, 200);
> contentStream.fill();
> }
> document.save("excel_pattern_fill.pdf");
> }
> {code}
> **Safari:**
>  !image-2024-10-08-16-04-32-344.png! 
> Adobe:
>  !image-2024-10-08-16-04-49-033.png! 
> The exported pdf file : excel_pattern_fill.pdf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.



[ 
https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887521#comment-17887521
 ] 

Tilman Hausherr commented on PDFBOX-5882:
-

I'm testing adding check in PDColor constructor.

> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> ---
>
> Key: PDFBOX-5882
> URL: https://issues.apache.org/jira/browse/PDFBOX-5882
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.24
>Reporter: bai yuan
>Priority: Major
> Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, 
> image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png
>
>
> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> It appears red in Adobe and Chrome, which is correct.
> It appears blue in Safari, which is incorrect.
> Here is the example code:
> {code:java}
> try (PDDocument document = new PDDocument()) {
> PDPage page = new PDPage();
> document.addPage(page);
> 
> try (PDPageContentStream contentStream = new 
> PDPageContentStream(document, page)) {
> PDTilingPattern pattern = new PDTilingPattern();
> pattern.setBBox(new PDRectangle(3, 3));
> pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED);
> 
> pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING);
> pattern.setXStep(3);
> pattern.setYStep(3);
> pattern.setMatrix(Matrix.getScaleInstance(1, 
> 1).createAffineTransform());
> try (PDPatternContentStream patternContentStream = new 
> PDPatternContentStream(pattern)) {
> patternContentStream.setLineWidth(0.4f);
> patternContentStream.moveTo(0, 2);
> patternContentStream.lineTo(0, 3);
> patternContentStream.lineTo(2, 3);
> patternContentStream.lineTo(2, 2);
> patternContentStream.lineTo(3, 2);
> patternContentStream.lineTo(3, 0);
> patternContentStream.lineTo(2, 0);
> patternContentStream.lineTo(2, 1);
> patternContentStream.lineTo(1, 1);
> patternContentStream.lineTo(1, 2);
> patternContentStream.closePath();
> patternContentStream.fill();
> } catch (IOException e) {
> throw new RuntimeException(e);
> };
> COSName patternName = page.getResources().add(pattern);
> PDPattern pdPattern = new PDPattern(page.getResources(), 
> PDDeviceRGB.INSTANCE);
> PDColor pdColor = new PDColor(Color.RED.getComponents(null), 
> patternName, pdPattern);
> contentStream.setNonStrokingColor(pdColor);
> contentStream.addRect(100, 500, 400, 200);
> contentStream.fill();
> }
> document.save("excel_pattern_fill.pdf");
> }
> {code}
> **Safari:**
>  !image-2024-10-08-16-04-32-344.png! 
> Adobe:
>  !image-2024-10-08-16-04-49-033.png! 
> The exported pdf file : excel_pattern_fill.pdf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5882:

Attachment: excel_pattern_fill-fixed.pdf

> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> ---
>
> Key: PDFBOX-5882
> URL: https://issues.apache.org/jira/browse/PDFBOX-5882
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.24
>Reporter: bai yuan
>Priority: Major
> Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, 
> image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png
>
>
> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> It appears red in Adobe and Chrome, which is correct.
> It appears blue in Safari, which is incorrect.
> Here is the example code:
> {code:java}
> try (PDDocument document = new PDDocument()) {
> PDPage page = new PDPage();
> document.addPage(page);
> 
> try (PDPageContentStream contentStream = new 
> PDPageContentStream(document, page)) {
> PDTilingPattern pattern = new PDTilingPattern();
> pattern.setBBox(new PDRectangle(3, 3));
> pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED);
> 
> pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING);
> pattern.setXStep(3);
> pattern.setYStep(3);
> pattern.setMatrix(Matrix.getScaleInstance(1, 
> 1).createAffineTransform());
> try (PDPatternContentStream patternContentStream = new 
> PDPatternContentStream(pattern)) {
> patternContentStream.setLineWidth(0.4f);
> patternContentStream.moveTo(0, 2);
> patternContentStream.lineTo(0, 3);
> patternContentStream.lineTo(2, 3);
> patternContentStream.lineTo(2, 2);
> patternContentStream.lineTo(3, 2);
> patternContentStream.lineTo(3, 0);
> patternContentStream.lineTo(2, 0);
> patternContentStream.lineTo(2, 1);
> patternContentStream.lineTo(1, 1);
> patternContentStream.lineTo(1, 2);
> patternContentStream.closePath();
> patternContentStream.fill();
> } catch (IOException e) {
> throw new RuntimeException(e);
> };
> COSName patternName = page.getResources().add(pattern);
> PDPattern pdPattern = new PDPattern(page.getResources(), 
> PDDeviceRGB.INSTANCE);
> PDColor pdColor = new PDColor(Color.RED.getComponents(null), 
> patternName, pdPattern);
> contentStream.setNonStrokingColor(pdColor);
> contentStream.addRect(100, 500, 400, 200);
> contentStream.fill();
> }
> document.save("excel_pattern_fill.pdf");
> }
> {code}
> **Safari:**
>  !image-2024-10-08-16-04-32-344.png! 
> Adobe:
>  !image-2024-10-08-16-04-49-033.png! 
> The exported pdf file : excel_pattern_fill.pdf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.



[ 
https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887517#comment-17887517
 ] 

Tilman Hausherr commented on PDFBOX-5882:
-

Here's a file generated with the correct code:  [^excel_pattern_fill-fixed.pdf] 

> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> ---
>
> Key: PDFBOX-5882
> URL: https://issues.apache.org/jira/browse/PDFBOX-5882
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.24
>Reporter: bai yuan
>Priority: Major
> Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, 
> image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png
>
>
> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> It appears red in Adobe and Chrome, which is correct.
> It appears blue in Safari, which is incorrect.
> Here is the example code:
> {code:java}
> try (PDDocument document = new PDDocument()) {
> PDPage page = new PDPage();
> document.addPage(page);
> 
> try (PDPageContentStream contentStream = new 
> PDPageContentStream(document, page)) {
> PDTilingPattern pattern = new PDTilingPattern();
> pattern.setBBox(new PDRectangle(3, 3));
> pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED);
> 
> pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING);
> pattern.setXStep(3);
> pattern.setYStep(3);
> pattern.setMatrix(Matrix.getScaleInstance(1, 
> 1).createAffineTransform());
> try (PDPatternContentStream patternContentStream = new 
> PDPatternContentStream(pattern)) {
> patternContentStream.setLineWidth(0.4f);
> patternContentStream.moveTo(0, 2);
> patternContentStream.lineTo(0, 3);
> patternContentStream.lineTo(2, 3);
> patternContentStream.lineTo(2, 2);
> patternContentStream.lineTo(3, 2);
> patternContentStream.lineTo(3, 0);
> patternContentStream.lineTo(2, 0);
> patternContentStream.lineTo(2, 1);
> patternContentStream.lineTo(1, 1);
> patternContentStream.lineTo(1, 2);
> patternContentStream.closePath();
> patternContentStream.fill();
> } catch (IOException e) {
> throw new RuntimeException(e);
> };
> COSName patternName = page.getResources().add(pattern);
> PDPattern pdPattern = new PDPattern(page.getResources(), 
> PDDeviceRGB.INSTANCE);
> PDColor pdColor = new PDColor(Color.RED.getComponents(null), 
> patternName, pdPattern);
> contentStream.setNonStrokingColor(pdColor);
> contentStream.addRect(100, 500, 400, 200);
> contentStream.fill();
> }
> document.save("excel_pattern_fill.pdf");
> }
> {code}
> **Safari:**
>  !image-2024-10-08-16-04-32-344.png! 
> Adobe:
>  !image-2024-10-08-16-04-49-033.png! 
> The exported pdf file : excel_pattern_fill.pdf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.



[ 
https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887515#comment-17887515
 ] 

Tilman Hausherr edited comment on PDFBOX-5882 at 10/8/24 8:28 AM:
--

That's because {{Color.RED.getComponents(null)}} returns a 4 component array, 
which includes the alpha, so this appears as {{1 0 0 1 /p1 scn}} in the PDF (4 
components instead of 3). I suspect Safari uses the last 3 components, while 
the others including PDFBox use the first 3. Use {{new float[] \{1,0,0\}}} 
instead.


was (Author: tilman):
That's because {{Color.RED.getComponents(null)}} returns a 4 component array, 
which includes the alpha, so this appears as {{1 0 0 1 /p1 scn}} in the PDF (4 
components instead of 3). I suspect Safari uses the last 3 components, while 
the others including PDFBox use the first 3. Use {{new float[] {1,0,0}}} 
instead.

> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> ---
>
> Key: PDFBOX-5882
> URL: https://issues.apache.org/jira/browse/PDFBOX-5882
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.24
>Reporter: bai yuan
>Priority: Major
> Attachments: excel_pattern_fill.pdf, 
> image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png
>
>
> The pattern created with PDFBox shows inconsistent colors between Safari and 
> Adobe.
> It appears red in Adobe and Chrome, which is correct.
> It appears blue in Safari, which is incorrect.
> Here is the example code:
> {code:java}
> try (PDDocument document = new PDDocument()) {
> PDPage page = new PDPage();
> document.addPage(page);
> 
> try (PDPageContentStream contentStream = new 
> PDPageContentStream(document, page)) {
> PDTilingPattern pattern = new PDTilingPattern();
> pattern.setBBox(new PDRectangle(3, 3));
> pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED);
> 
> pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING);
> pattern.setXStep(3);
> pattern.setYStep(3);
> pattern.setMatrix(Matrix.getScaleInstance(1, 
> 1).createAffineTransform());
> try (PDPatternContentStream patternContentStream = new 
> PDPatternContentStream(pattern)) {
> patternContentStream.setLineWidth(0.4f);
> patternContentStream.moveTo(0, 2);
> patternContentStream.lineTo(0, 3);
> patternContentStream.lineTo(2, 3);
> patternContentStream.lineTo(2, 2);
> patternContentStream.lineTo(3, 2);
> patternContentStream.lineTo(3, 0);
> patternContentStream.lineTo(2, 0);
> patternContentStream.lineTo(2, 1);
> patternContentStream.lineTo(1, 1);
> patternContentStream.lineTo(1, 2);
> patternContentStream.closePath();
> patternContentStream.fill();
> } catch (IOException e) {
> throw new RuntimeException(e);
> };
> COSName patternName = page.getResources().add(pattern);
> PDPattern pdPattern = new PDPattern(page.getResources(), 
> PDDeviceRGB.INSTANCE);
> PDColor pdColor = new PDColor(Color.RED.getComponents(null), 
> patternName, pdPattern);
> contentStream.setNonStrokingColor(pdColor);
> contentStream.addRect(100, 500, 400, 200);
> contentStream.fill();
> }
> document.save("excel_pattern_fill.pdf");
> }
> {code}
> **Safari:**
>  !image-2024-10-08-16-04-32-344.png! 
> Adobe:
>  !image-2024-10-08-16-04-49-033.png! 
> The exported pdf file : excel_pattern_fill.pdf



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: Apache PDFBox Board Report October 2024 due

2024-10-07 Thread Tilman Hausherr


+1

Tilman

On 07.10.2024 17:37, Andreas Lehmkühler wrote:

Hi,

find attached a quick draft of the board report we're expected to 
submit this month. It's based upon the report wizard template which 
can be found at [1]


Any comments or additions are appreciated ...

Sorry for the short notice, but I wasn't able to prepare a report 
earlier due to some personal reasons.



## Description:
The mission of PDFBox is the creation and maintenance of software 
related to

Java library for working with PDF documents

## Project Status:
Current project status: Ongoing with moderate activity
Issues for the board: none

## Membership Data:
Apache PDFBox was founded 2009-10-21 (15 years ago)
There are currently 21 committers and 21 PMC members in this project.
The Committer-to-PMC ratio is 1:1.

Community changes, past quarter:
- No new PMC members. Last addition was Matthäus Mayer on 2017-10-16.
- No new committers. Last addition was Joerg O. Henne on 2017-10-09.

## Project Activity:
Recent releases:

    3.0.3 was released on 2024-08-08.
    2.0.32 was released on 2024-07-24.
    2.0.31 was released on 2024-03-24.

## Community Health:
- there is a steady stream of contributions, bug reports and questions 
on the

  mailing lists
- it was a more quiet quarter due to the holiday season
- another 3.0.x and 2.0.x will most likely be released before xmas

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5881) CVE for Lucene libraries

2024-10-04 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5881.
-
Resolution: Fixed

> CVE for Lucene libraries
> 
>
> Key: PDFBOX-5881
> URL: https://issues.apache.org/jira/browse/PDFBOX-5881
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>    Reporter: Tilman Hausherr
>    Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.33, 3.0.4 PDFBox
>
>
> It looks like Lucene won't make any older jar files that fixes 
> CVE-2024-45772, so I'll add a suppression file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-5881) CVE for Lucene libraries

2024-10-04 Thread Tilman Hausherr (Jira)

Tilman Hausherr created PDFBOX-5881:
---

 Summary: CVE for Lucene libraries
 Key: PDFBOX-5881
 URL: https://issues.apache.org/jira/browse/PDFBOX-5881
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 3.0.3 PDFBox, 2.0.32
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
 Fix For: 2.0.33, 3.0.4 PDFBox


It looks like Lucene won't make any older jar files that fixes CVE-2024-45772, 
so I'll add a suppression file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

build timeout fails

2024-10-04 Thread Tilman Hausherr


https://issues.apache.org/jira/browse/INFRA-26175


-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4718) OutOfMemoryError - during renderImageWithDPI

2024-10-03 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886737#comment-17886737
 ] 

Tilman Hausherr edited comment on PDFBOX-4718 at 10/3/24 5:39 PM:
--

Sadly some differences in rendering: PDFBOX-2557, PDFBOX-3182, PDFBOX-5842 (VW 
logo missing), PDFBOX-3116.pdf (half-circles bottom right)


was (Author: tilman):
Sadly some differences in rendering: PDFBOX-2557, PDFBOX-3182, PDFBOX-5842 (VW 
logo missing), PDFBOX-3116.pdf (circles bottom right)

> OutOfMemoryError - during renderImageWithDPI
> 
>
> Key: PDFBOX-4718
> URL: https://issues.apache.org/jira/browse/PDFBOX-4718
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.17, 3.0.3 PDFBox, 4.0.0
> Environment: macOS Mojave (10.14.6)
> Java 11.0.2 -Xmx10G -Xms10G
>Reporter: Serhii Kolesnyk
>Assignee: Andreas Lehmkühler
>Priority: Blocker
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: PDFBOX-4718-reduced.pdf, PDFBox4718Intersect.java, 
> example.pdf, image-2019-12-19-05-55-57-648.png
>
>
> During rendering pdf we receive _java.lang.OutOfMemoryError: Java heap space_
> {code:java}
> Exception in thread "AWT-Shutdown" java.lang.OutOfMemoryError: Java heap 
> spaceException in thread "AWT-Shutdown" java.lang.OutOfMemoryError: Java heap 
> space at java.desktop/sun.awt.AppContext.getAppContexts(AppContext.java:167) 
> at 
> java.desktop/sun.awt.AppContext.stopEventDispatchThreads(AppContext.java:610) 
> at java.desktop/sun.awt.AWTAutoShutdown.run(AWTAutoShutdown.java:322) at 
> java.base/java.lang.Thread.run(Thread.java:834)
> java.lang.OutOfMemoryError: Java heap space
>  at java.desktop/sun.awt.geom.AreaOp.pruneEdges(AreaOp.java:362) at 
> java.desktop/sun.awt.geom.AreaOp.calculate(AreaOp.java:159) at 
> java.desktop/java.awt.geom.Area.intersect(Area.java:293) at 
> org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.intersectClippingPath(PDGraphicsState.java:618)
>  at 
> org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.intersectClippingPath(PDGraphicsState.java:597)
>  at org.apache.pdfbox.rendering.PageDrawer.endPath(PageDrawer.java:936) at 
> org.apache.pdfbox.contentstream.operator.graphics.EndPath.process(EndPath.java:35)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152)
>  at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:262) at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:314) at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243) at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229){code}
> We check the different setting of MemoryUsageSetting (TempFileOnly, 
> MainMemoryOnly), settings of DPI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4718) OutOfMemoryError - during renderImageWithDPI

2024-10-03 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886737#comment-17886737
 ] 

Tilman Hausherr commented on PDFBOX-4718:
-

Sadly some differences in rendering: PDFBOX-2557, PDFBOX-3182, PDFBOX-5842 (VW 
logo missing), PDFBOX-3116.pdf (circles bottom right)

> OutOfMemoryError - during renderImageWithDPI
> 
>
> Key: PDFBOX-4718
> URL: https://issues.apache.org/jira/browse/PDFBOX-4718
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.17, 3.0.3 PDFBox, 4.0.0
> Environment: macOS Mojave (10.14.6)
> Java 11.0.2 -Xmx10G -Xms10G
>Reporter: Serhii Kolesnyk
>Assignee: Andreas Lehmkühler
>Priority: Blocker
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: PDFBOX-4718-reduced.pdf, PDFBox4718Intersect.java, 
> example.pdf, image-2019-12-19-05-55-57-648.png
>
>
> During rendering pdf we receive _java.lang.OutOfMemoryError: Java heap space_
> {code:java}
> Exception in thread "AWT-Shutdown" java.lang.OutOfMemoryError: Java heap 
> spaceException in thread "AWT-Shutdown" java.lang.OutOfMemoryError: Java heap 
> space at java.desktop/sun.awt.AppContext.getAppContexts(AppContext.java:167) 
> at 
> java.desktop/sun.awt.AppContext.stopEventDispatchThreads(AppContext.java:610) 
> at java.desktop/sun.awt.AWTAutoShutdown.run(AWTAutoShutdown.java:322) at 
> java.base/java.lang.Thread.run(Thread.java:834)
> java.lang.OutOfMemoryError: Java heap space
>  at java.desktop/sun.awt.geom.AreaOp.pruneEdges(AreaOp.java:362) at 
> java.desktop/sun.awt.geom.AreaOp.calculate(AreaOp.java:159) at 
> java.desktop/java.awt.geom.Area.intersect(Area.java:293) at 
> org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.intersectClippingPath(PDGraphicsState.java:618)
>  at 
> org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.intersectClippingPath(PDGraphicsState.java:597)
>  at org.apache.pdfbox.rendering.PageDrawer.endPath(PageDrawer.java:936) at 
> org.apache.pdfbox.contentstream.operator.graphics.EndPath.process(EndPath.java:35)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152)
>  at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:262) at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:314) at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243) at 
> org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229){code}
> We check the different setting of MemoryUsageSetting (TempFileOnly, 
> MainMemoryOnly), settings of DPI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected

2024-09-28 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885635#comment-17885635
 ] 

Tilman Hausherr commented on PDFBOX-5880:
-

Now it works!

> PDF render blank page: The end of the stream doesn't point to the correct 
> offset, using workaround to read the stream, stream start position: 196, 
> length: 0, expected end position: 196
> 
>
> Key: PDFBOX-5880
> URL: https://issues.apache.org/jira/browse/PDFBOX-5880
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Joseph Jezerinac
>Assignee: Andreas Lehmkühler
>Priority: Major
>  Labels: regression
> Fix For: 3.0.4 PDFBox
>
> Attachments: PDFBOX-1094-PDFBOX-269.pdf, test.pdf
>
>
> When rendering page one of the attached PDF the image does not render.
> In the logs, I see the following:
> {noformat}
> 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't 
> point to the correct offset, using workaround to read the stream, stream 
> start position: 196, length: 0, expected end position: 196
> 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty
> java.io.IOException: Image stream is empty
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107)
> {noformat}
> I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an 
> issue.
> Here's the render code used:
> {code:java}
> File out = File.createTempFile("test-", ".png");
> PDDocument pdDocument = Loader.loadPDF(pdf);
> final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
> ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected en

2024-09-27 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5880:

Attachment: PDFBOX-1094-PDFBOX-269.pdf

> PDF render blank page: The end of the stream doesn't point to the correct 
> offset, using workaround to read the stream, stream start position: 196, 
> length: 0, expected end position: 196
> 
>
> Key: PDFBOX-5880
> URL: https://issues.apache.org/jira/browse/PDFBOX-5880
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Joseph Jezerinac
>Assignee: Andreas Lehmkühler
>Priority: Major
>  Labels: regression
> Attachments: PDFBOX-1094-PDFBOX-269.pdf, test.pdf
>
>
> When rendering page one of the attached PDF the image does not render.
> In the logs, I see the following:
> {noformat}
> 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't 
> point to the correct offset, using workaround to read the stream, stream 
> start position: 196, length: 0, expected end position: 196
> 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty
> java.io.IOException: Image stream is empty
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107)
> {noformat}
> I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an 
> issue.
> Here's the render code used:
> {code:java}
> File out = File.createTempFile("test-", ".png");
> PDDocument pdDocument = Loader.loadPDF(pdf);
> final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
> ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected

2024-09-27 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885251#comment-17885251
 ] 

Tilman Hausherr commented on PDFBOX-5880:
-

Several differences, e.g.  [^PDFBOX-1094-PDFBOX-269.pdf] page 2ff, the light 
background is different. Also the file of PDFBOX-1738.

> PDF render blank page: The end of the stream doesn't point to the correct 
> offset, using workaround to read the stream, stream start position: 196, 
> length: 0, expected end position: 196
> 
>
> Key: PDFBOX-5880
> URL: https://issues.apache.org/jira/browse/PDFBOX-5880
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Joseph Jezerinac
>Assignee: Andreas Lehmkühler
>Priority: Major
>  Labels: regression
> Attachments: PDFBOX-1094-PDFBOX-269.pdf, test.pdf
>
>
> When rendering page one of the attached PDF the image does not render.
> In the logs, I see the following:
> {noformat}
> 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't 
> point to the correct offset, using workaround to read the stream, stream 
> start position: 196, length: 0, expected end position: 196
> 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty
> java.io.IOException: Image stream is empty
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107)
> {noformat}
> I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an 
> issue.
> Here's the render code used:
> {code:java}
> File out = File.createTempFile("test-", ".png");
> PDDocument pdDocument = Loader.loadPDF(pdf);
> final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
> ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading



[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884739#comment-17884739
 ] 

Tilman Hausherr commented on PDFBOX-5852:
-

All good now, thanks!

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 2.0.33, 3.0.3 PDFBox, 4.0.0
>
> Attachments: CIB-coonsmesh.pdf, minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected



[ 
https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884548#comment-17884548
 ] 

Tilman Hausherr commented on PDFBOX-5880:
-

proposed change is to add {{stream.setLong(COSName.LENGTH, streamLength);}} or 
change the foreach loop that it doesn't overwrite the length entry.

> PDF render blank page: The end of the stream doesn't point to the correct 
> offset, using workaround to read the stream, stream start position: 196, 
> length: 0, expected end position: 196
> 
>
> Key: PDFBOX-5880
> URL: https://issues.apache.org/jira/browse/PDFBOX-5880
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: regression
> Attachments: test.pdf
>
>
> When rendering page one of the attached PDF the image does not render.
> In the logs, I see the following:
> {noformat}
> 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't 
> point to the correct offset, using workaround to read the stream, stream 
> start position: 196, length: 0, expected end position: 196
> 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty
> java.io.IOException: Image stream is empty
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107)
> {noformat}
> I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an 
> issue.
> Here's the render code used:
> {code:java}
> File out = File.createTempFile("test-", ".png");
> PDDocument pdDocument = Loader.loadPDF(pdf);
> final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
> ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading



[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884540#comment-17884540
 ] 

Tilman Hausherr commented on PDFBOX-5852:
-

E.g. with this file: [^CIB-coonsmesh.pdf] 

ArrayIndexOutOfBoundsException: Index 400 out of bounds for length 400

org.apache.pdfbox.pdmodel.graphics.shading.PatchMeshesShadingContext.calcPixelTableArray(PatchMeshesShadingContext.java:67)

org.apache.pdfbox.pdmodel.graphics.shading.TriangleBasedShadingContext.createPixelTable(TriangleBasedShadingContext.java:67)

org.apache.pdfbox.pdmodel.graphics.shading.PatchMeshesShadingContext.(PatchMeshesShadingContext.java:57)

org.apache.pdfbox.pdmodel.graphics.shading.Type6ShadingContext.(Type6ShadingContext.java:45)

org.apache.pdfbox.pdmodel.graphics.shading.Type6ShadingPaint.createContext(Type6ShadingPaint.java:63)

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 2.0.33, 3.0.3 PDFBox, 4.0.0
>
> Attachments: CIB-coonsmesh.pdf, minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5852:

Attachment: CIB-coonsmesh.pdf

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 2.0.33, 3.0.3 PDFBox, 4.0.0
>
> Attachments: CIB-coonsmesh.pdf, minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading



[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884533#comment-17884533
 ] 

Tilman Hausherr commented on PDFBOX-5852:
-

Lots of regressions, I need to check whether this is because of another change 
I just did, or if the first test didn't have the new code activated.

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 2.0.33, 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading



[ https://issues.apache.org/jira/browse/PDFBOX-5852 ]


Tilman Hausherr deleted comment on PDFBOX-5852:
-

was (Author: tilman):
No regressions 👍

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 2.0.33, 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected



[ 
https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884531#comment-17884531
 ] 

Tilman Hausherr commented on PDFBOX-5880:
-

The problem is here:
{code:java}
    public COSStream createCOSStream(COSDictionary dictionary, long 
startPosition,
            long streamLength) throws IOException
    {
        COSStream stream = new COSStream(streamCache,
                parser.createRandomAccessReadView(startPosition, streamLength));
        dictionary.forEach(stream::setItem);
        stream.setKey(dictionary.getKey());
        return stream;
    }
 {code}
The foreach loop overwrites the length. For some reason this didn't make 
troubles in the past with wrong lengths, only this time with a zero length that 
is an indirect object.

> PDF render blank page: The end of the stream doesn't point to the correct 
> offset, using workaround to read the stream, stream start position: 196, 
> length: 0, expected end position: 196
> 
>
> Key: PDFBOX-5880
> URL: https://issues.apache.org/jira/browse/PDFBOX-5880
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: regression
> Attachments: test.pdf
>
>
> When rendering page one of the attached PDF the image does not render.
> In the logs, I see the following:
> {noformat}
> 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't 
> point to the correct offset, using workaround to read the stream, stream 
> start position: 196, length: 0, expected end position: 196
> 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty
> java.io.IOException: Image stream is empty
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107)
> {noformat}
> I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an 
> issue.
> Here's the render code used:
> {code:java}
> File out = File.createTempFile("test-", ".png");
> PDDocument pdDocument = Loader.loadPDF(pdf);
> final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
> ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading



[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884528#comment-17884528
 ] 

Tilman Hausherr commented on PDFBOX-5852:
-

No regressions 👍

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 2.0.33, 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expe



[ 
https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884492#comment-17884492
 ] 

Tilman Hausherr edited comment on PDFBOX-5880 at 9/25/24 3:55 AM:
--

The PDF image stream has an (incorrect) length of 0. The workaround fails for 
some reason. Amusingly, this worked in 1.8.16, which displays the message 
"WARNUNG: /Length of COSObject\{1, 0} corrected from 0 to 695645".


was (Author: tilman):
The image has an (incorrect) length of 0. The workaround fails for some reason. 
Amusingly, this worked in 1.8.16, which displays the message "WARNUNG: /Length 
of COSObject\{1, 0} corrected from 0 to 695645".

> PDF render blank page: The end of the stream doesn't point to the correct 
> offset, using workaround to read the stream, stream start position: 196, 
> length: 0, expected end position: 196
> 
>
> Key: PDFBOX-5880
> URL: https://issues.apache.org/jira/browse/PDFBOX-5880
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: regression
> Attachments: test.pdf
>
>
> When rendering page one of the attached PDF the image does not render.
> In the logs, I see the following:
> {noformat}
> 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't 
> point to the correct offset, using workaround to read the stream, stream 
> start position: 196, length: 0, expected end position: 196
> 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty
> java.io.IOException: Image stream is empty
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107)
> {noformat}
> I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an 
> issue.
> Here's the render code used:
> {code:java}
> File out = File.createTempFile("test-", ".png");
> PDDocument pdDocument = Loader.loadPDF(pdf);
> final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
> ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected



[ 
https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884492#comment-17884492
 ] 

Tilman Hausherr commented on PDFBOX-5880:
-

The image has an (incorrect) length of 0. The workaround fails for some reason. 
Amusingly, this worked in 1.8.16, which displays the message "WARNUNG: /Length 
of COSObject\{1, 0} corrected from 0 to 695645".

> PDF render blank page: The end of the stream doesn't point to the correct 
> offset, using workaround to read the stream, stream start position: 196, 
> length: 0, expected end position: 196
> 
>
> Key: PDFBOX-5880
> URL: https://issues.apache.org/jira/browse/PDFBOX-5880
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: regression
> Attachments: test.pdf
>
>
> When rendering page one of the attached PDF the image does not render.
> In the logs, I see the following:
> {noformat}
> 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't 
> point to the correct offset, using workaround to read the stream, stream 
> start position: 196, length: 0, expected end position: 196
> 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty
> java.io.IOException: Image stream is empty
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107)
> {noformat}
> I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an 
> issue.
> Here's the render code used:
> {code:java}
> File out = File.createTempFile("test-", ".png");
> PDDocument pdDocument = Loader.loadPDF(pdf);
> final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
> ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected en



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5880:

Labels: regression  (was: )

> PDF render blank page: The end of the stream doesn't point to the correct 
> offset, using workaround to read the stream, stream start position: 196, 
> length: 0, expected end position: 196
> 
>
> Key: PDFBOX-5880
> URL: https://issues.apache.org/jira/browse/PDFBOX-5880
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: regression
> Attachments: test.pdf
>
>
> When rendering page one of the attached PDF the image does not render.
> In the logs, I see the following:
> {noformat}
> 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't 
> point to the correct offset, using workaround to read the stream, stream 
> start position: 196, length: 0, expected end position: 196
> 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty
> java.io.IOException: Image stream is empty
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107)
> {noformat}
> I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an 
> issue.
> Here's the render code used:
> {code:java}
> File out = File.createTempFile("test-", ".png");
> PDDocument pdDocument = Loader.loadPDF(pdf);
> final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
> ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected en



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5880:

Affects Version/s: 2.0.32

> PDF render blank page: The end of the stream doesn't point to the correct 
> offset, using workaround to read the stream, stream start position: 196, 
> length: 0, expected end position: 196
> 
>
> Key: PDFBOX-5880
> URL: https://issues.apache.org/jira/browse/PDFBOX-5880
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: test.pdf
>
>
> When rendering page one of the attached PDF the image does not render.
> In the logs, I see the following:
> {noformat}
> 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't 
> point to the correct offset, using workaround to read the stream, stream 
> start position: 196, length: 0, expected end position: 196
> 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty
> java.io.IOException: Image stream is empty
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107)
> {noformat}
> I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an 
> issue.
> Here's the render code used:
> {code:java}
> File out = File.createTempFile("test-", ".png");
> PDDocument pdDocument = Loader.loadPDF(pdf);
> final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
> ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected en



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5880:

Component/s: Parsing
 (was: Rendering)

> PDF render blank page: The end of the stream doesn't point to the correct 
> offset, using workaround to read the stream, stream start position: 196, 
> length: 0, expected end position: 196
> 
>
> Key: PDFBOX-5880
> URL: https://issues.apache.org/jira/browse/PDFBOX-5880
> Project: PDFBox
>  Issue Type: Bug
>  Components: Parsing
>Affects Versions: 3.0.3 PDFBox
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: test.pdf
>
>
> When rendering page one of the attached PDF the image does not render.
> In the logs, I see the following:
> {noformat}
> 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't 
> point to the correct offset, using workaround to read the stream, stream 
> start position: 196, length: 0, expected end position: 196
> 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty
> java.io.IOException: Image stream is empty
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477)
>   at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438)
>   at 
> org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107)
> {noformat}
> I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an 
> issue.
> Here's the render code used:
> {code:java}
> File out = File.createTempFile("test-", ".png");
> PDDocument pdDocument = Loader.loadPDF(pdf);
> final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
> ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page



[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882327#comment-17882327
 ] 

Tilman Hausherr commented on PDFBOX-5879:
-

I added a simple test for the feature because it turns out we didn't have any. 
However this isn't a test of the fixed bug, that would have been more difficult 
to create a file, and there is no risk that this fix gets reverted anyway.

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page



[ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882327#comment-17882327
 ] 

Tilman Hausherr edited comment on PDFBOX-5879 at 9/17/24 9:08 AM:
--

I added a simple test for the rotationMagic feature because it turns out we 
didn't have any. However this isn't a test of the fixed bug, that would have 
been more difficult to create a file, and there is no risk that this fix gets 
reverted anyway.


was (Author: tilman):
I added a simple test for the feature because it turns out we didn't have any. 
However this isn't a test of the fixed bug, that would have been more difficult 
to create a file, and there is no risk that this fix gets reverted anyway.

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>        Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5879.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

Thank you. It's not the commit, it's poor programming that got exposed because 
of the commit.

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5879:

Affects Version/s: 2.0.32

> Regression from PDFBOX-5841: Text extraction with rotation magic fails for 
> PDF with multiple content streams in a page
> --
>
> Key: PDFBOX-5879
> URL: https://issues.apache.org/jira/browse/PDFBOX-5879
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Priority: Major
> Attachments: MVM_Aram_augusztus.pdf
>
>
> {code:java}
> java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic 
> -i="MVM_Aram_augusztus.pdf" {code}
> fails with the following error:
> {code:java}
> java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be 
> cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject 
> and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app')
>         at 
> org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225)
>         at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62)
>         at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
>         at picocli.CommandLine.access$1500(CommandLine.java:148)
>         at 
> picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
>         at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
>         at 
> picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
>         at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
>         at picocli.CommandLine.execute(CommandLine.java:2174)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code}
> The same command succeeds in 3.0.2.
> The triggering PDF can be downloaded from 
> [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,]
>  and is also attached.
> The root cause appears to be this change: 
> [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2]
>  from PDFBOX-5841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-16 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882240#comment-17882240
 ] 

Tilman Hausherr commented on PDFBOX-5852:
-

Wow!

No regressions.

> Hi CPU and memory usage when converting a PDF with type 4 shading
> -
>
> Key: PDFBOX-5852
> URL: https://issues.apache.org/jira/browse/PDFBOX-5852
> Project: PDFBox
>  Issue Type: Wish
>  Components: Rendering
>Affects Versions: 2.0.28
>Reporter: Larry Lynn
>Assignee: Andreas Lehmkühler
>Priority: Major
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: minimal.pdf
>
>
> We've observed excessive CPU and memory consumption when converting a PDF to 
> images when the PDF contains type 4 shading.  This is especially noticeable 
> when the conversion is done with a high DPI.  Can this be improved?
>  
> Conversation from the PDFBox users mailing list follows
> Initial email:
> {quote}
> Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox 
> users and maintainers,
> We have a PDF that causes performance problems when we use PDFBox to
> convert it to an image with renderImageWithDPI().  We're calling
> renderImageWithDPI()
> with 650 DPI.  I realize this is a very high value - we're using it for
> high fidelity original images that will later be downsampled.  On my work
> laptop which has fairly strong hardware, the conversion takes 25 minutes
> and consumes 20GB of memory.  CPU and memory usage is reduced if we use a
> lower DPI.
> The PDF is 1 page long.  It contains type 4 shading / Gouraud free form
> triangle meshes.  We've been aware of some performance issues with type 4
> shading for a little while now, but the PDFs that contained the type 4
> shading belonged to our customers and we were not authorized to share
> them.  We finally found a problem input document that is non-sensitive and
> that we are authorized to share.  I've attached a copy of the problem PDF
> to this email.
> I searched the archives for the users and the developers mailing list and I
> didn't find anything specifically about this issue.
> I searched through the PDFBox jira tickets and I found a couple of tickets
> that looked similar: PDFBOX-2901 & PDFBOX-4491.  PDFBOX-2901 seems to most
> closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
> and our issue still reproduces with PDFBox 2.0.28.
> Should I refer this issue over to the developers mailing list or create a
> PDFBox Jira ticket for this?
> Thanks and Regards,
> Larry Lynn {quote}
> Response:
> {quote}
> Hi,
> Yes shading can be very slow, especially at high dpi. The attachment 
> didn't get through, please upload to a sharehoster or create a ticket. 
> If you need to register then add a meaningful text, e.g. the subject of 
> this post so we know you're not a spammer. Also retry with 2.0.31 and 
> 3.0.2 just to be sure. However I'm pessimistic that this can be fixed.
> Tilman {quote}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading

2024-09-15 Thread Tilman Hausherr (Jira)

[
https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tilman Hausherr updated PDFBOX-5852:

Description:
We've observed excessive CPU and memory consumption when converting a PDF to
images when the PDF contains type 4 shading. This is especially noticeable
when the conversion is done with a high DPI. Can this be improved?

Conversation from the PDFBox users mailing list follows

Initial email:
{quote}
Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox
users and maintainers,

We have a PDF that causes performance problems when we use PDFBox to
convert it to an image with renderImageWithDPI(). We're calling
renderImageWithDPI()
with 650 DPI. I realize this is a very high value - we're using it for
high fidelity original images that will later be downsampled. On my work
laptop which has fairly strong hardware, the conversion takes 25 minutes
and consumes 20GB of memory. CPU and memory usage is reduced if we use a
lower DPI.

The PDF is 1 page long. It contains type 4 shading / Gouraud free form
triangle meshes. We've been aware of some performance issues with type 4
shading for a little while now, but the PDFs that contained the type 4
shading belonged to our customers and we were not authorized to share
them. We finally found a problem input document that is non-sensitive and
that we are authorized to share. I've attached a copy of the problem PDF
to this email.

I searched the archives for the users and the developers mailing list and I
didn't find anything specifically about this issue.
I searched through the PDFBox jira tickets and I found a couple of tickets
that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most
closely describe what we're seeing, but that was closed in PDFBox 2.0.0,
and our issue still reproduces with PDFBox 2.0.28.

Should I refer this issue over to the developers mailing list or create a
PDFBox Jira ticket for this?

Thanks and Regards,
Larry Lynn {quote}
Response:
{quote}
Hi,

Yes shading can be very slow, especially at high dpi. The attachment
didn't get through, please upload to a sharehoster or create a ticket.
If you need to register then add a meaningful text, e.g. the subject of
this post so we know you're not a spammer. Also retry with 2.0.31 and
3.0.2 just to be sure. However I'm pessimistic that this can be fixed.

Tilman {quote}

was:
We've observed excessive CPU and memory consumption when converting a PDF to
images when the PDF contains type 4 shading. This is especially noticeable
when the conversion is done with a high DPI. Can this be improved?

Conversation from the PDFBox users mailing list follows

Initial email:
{code:java}
Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox
users and maintainers,

Should I refer this issue over to the developers mailing list or create a
PDFBox Jira ticket for this?

Thanks and Regards,
Larry Lynn {code}
Response:
{code:java}
Hi,

Tilman {code}

> Hi CPU and memory usage when converting a PDF with type 4 shading
> --

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening



[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879832#comment-17879832
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 10:35 AM:
--

Here's what worked:
{code:java}
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (field instanceof PDVariableText)
{
for (PDAnnotationWidget widget : field.getWidgets())
{
widget.setAppearance(null);
}
}
}
}
acroForm.refreshAppearances();
{code}
 [^PDFBox5878-flattened.pdf]
[^PDFBox5878-saved.pdf] 
The only problem left is that the second multiline field starts a bit too low, 
but IIRC there's another issue about that.


was (Author: tilman):
Here's what worked:
{code:java}
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (field instanceof PDVariableText)
{
for (PDAnnotationWidget widget : field.getWidgets())
{
widget.setAppearance(null);
}
}
}
}
acroForm.refreshAppearances();
{code}
 [^PDFBox5878-flattened.pdf]
[^PDFBox5878-saved.pdf] 

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, 
> flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening



[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879832#comment-17879832
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 10:31 AM:
--

Here's what worked:
{code:java}
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (field instanceof PDVariableText)
{
for (PDAnnotationWidget widget : field.getWidgets())
{
widget.setAppearance(null);
}
}
}
}
acroForm.refreshAppearances();
{code}
 [^PDFBox5878-flattened.pdf]
[^PDFBox5878-saved.pdf] 


was (Author: tilman):
Here's what worked:
{code:java}
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (field instanceof PDVariableText)
{
for (PDAnnotationWidget widget : field.getWidgets())
{
widget.setAppearance(null);
}
}
}
acroForm.refreshAppearances();
}
{code}
 [^PDFBox5878-flattened.pdf]
[^PDFBox5878-saved.pdf] 

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, 
> flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5878) pdf form field text gets blurred after flattening



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5878:

Attachment: PDFBox5878-flattened.pdf
PDFBox5878-saved.pdf

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, 
> flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening



[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879822#comment-17879822
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 9:30 AM:
-

I added this for the missing fonts, which is just a guess that it's the correct 
font
{code:java}
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
acroForm.setNeedAppearances(false);
PDFont font1 = PDType0Font.load(doc, new 
FileInputStream("c:/windows/fonts/times.ttf"), false);
PDFont font2 = PDType0Font.load(doc, new 
FileInputStream("c:/windows/fonts/timesbd.ttf"), false);
PDFont font3 = PDType0Font.load(doc, new 
FileInputStream("c:/windows/fonts/arial.ttf"), false);
acroForm.getDefaultResources().put(COSName.getPDFName("TimesNewRomanPSMT"), 
font1);
acroForm.getDefaultResources().put(COSName.getPDFName("TimesNewRomanPS-BoldMT"),
 font2);
acroForm.getDefaultResources().put(COSName.getPDFName("Helvetica"), font3);
for (PDField field: acroForm.getFieldTree())
{
if (field instanceof PDTextField)
{
if (((PDTextField) field).isMultiline())
{
field.setValue("XXX");
}
}
}
{code}
But when setting a value, this happens in 
AppearanceGeneratorHelper.setAppearanceContent():
{code}
if (bmcIndex == -1)
{
// append to existing stream
writer.writeTokens(tokens);
writer.writeTokens(COSName.TX, BMC);
}
{code}
So it appends to the existing appearance steam. This is the result after 
calling setValue("XXX"):
{code}
q
Q
q
  9.613575 0.4609071 430.9062 41.31819 re
  W
  n
  q
0.9781767 0 0 -0.9781767 -87.43936 478.0107 cm
BT
  11 0 0 -11 102.2182 458.5622 Tm
  /TT21 1 Tf
  [ (N) -0.2 (a) 0.2 (m) 0.2 (e) 0.2 ( c) 0.2 (ha) 0.2 (nge) 0.2 (d 09/) 
0.2 (26/) 0.2 (2020) ] TJ
ET
  Q
Q
q
  6.43259 0.3084 434.0872 41.6232 re
  W
  n
  q
0.9853977 0 0 0.9853977 9.388783 29.51731 cm
BT
  11 0 0 11 0 0 Tm
  /TT18 1 Tf
  [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 
(ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 
111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 
(ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof 
of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ
ET
  Q
  q
0.9853977 0 0 0.9853977 9.388783 17.51355 cm
BT
  11 0 0 11 0 0 Tm
  /TT18 1 Tf
  [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 
(i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ
ET
  Q
Q
q
  3.228123 0.1547671 437.2917 41.93047 re
  W
  n
  q
0.992672 0 0 0.992672 6.206139 29.5793 cm
BT
  11 0 0 11 0 0 Tm
  /TT19 1 Tf
  [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 
(ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 
111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 
(ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof 
of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ
ET
  Q
  q
0.992672 0 0 0.992672 6.206139 17.48693 cm
BT
  11 0 0 11 0 0 Tm
  /TT19 1 Tf
  [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 
(i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ
ET
  Q
Q
q
  0 0 440.5198 42.24 re
  W
  n
  /Cs6 cs
  0 sc
  q
1 0 0 1 3 29.64175 cm
BT
  11 0 0 11 0 0 Tm
  /TT20 1 Tf
  [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 
(ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 
111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 
(ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof 
of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ
ET
  Q
  q
1 0 0 1 3 17.46011 cm
BT
  11 0 0 11 0 0 Tm
  /TT20 1 Tf
  [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 
(i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ
ET
  Q
Q
/Tx BMC
  q
-2.252 1 441.7718 40.24 re
W
n
BT
  /TimesNewRomanPSMT 11 Tf
  /DeviceGray cs
  0 sc
  -1.252 25.4319 Td
  (\000;\000;\000;) Tj
ET
  Q
EMC
{code}
So the XXX is there, but also all the previous content.


was (Author: tilman):
I added this for the missing fonts, which is just a guess that it's the correct 
font
{code:java}
acroForm.setNeedAppearances(false);
PDFont font1 = PDType0Font.load(doc, new 
FileInputStream("c:/windows/fonts/times.ttf"), false);
PDFont font2

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening



[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879796#comment-17879796
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 8:00 AM:
-

There are so many things wrong with this PDF that I don't see a specific 
solution. I'm doing this just for fun. I was able to fix some of the fields 
(e.g. Last1) but not yet all (e.g. the multiline fields and some others), for 
some unknown reason. (I added the missing fonts to the default resources) Not 
all appearances are redrawn. Either there's a bug in my code or there is 
something in our code that skips the recreation of the appearances and I forgot 
about it.

It's not even recreated when changing to the value to something else?!


was (Author: tilman):
There are so many things wrong with this PDF that I don't see a specific 
solution. I'm doing this just for fun. I was able to fix some of the fields 
(e.g. Last1) but not yet all (e.g. the multiline fields and some others), for 
some unknown reason. (I added the missing fonts to the default resources) Not 
all appearances are redrawn. Either there's a bug in my code or there is 
something in our code that skips the recreation of the appearances and I forgot 
about it.

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> beforeFlattening.pdf, flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-05 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879753#comment-17879753
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 4:04 AM:
-

I could try to getValue() and setValue() on the text fields and see whether it 
looks better when PDFBox recreates the appearances. These fields have a value 
that makes sense. I'm just wondering whether this person will have legal 
disadvantages if the file is refused? (Although I doubt that the content of 
field {{Root/Pages/Kids/[0]/Annots/[7]/V}} will work for the petitioner). OTOH 
it's from 22.2 so it may already have been decided in some way.


was (Author: tilman):
I could try to getValue() and setValue() on the text fields and see whether it 
looks better when PDFBox recreates the appearances. These fields have a value 
that makes sense. I'm just wondering whether this person will have legal 
disadvantages if the file is refused? (Although I doubt that the content of 
field {{Root/Pages/Kids/[0]/Annots/[7]/V}} will work for the petitioner). OTOH 
it's from 22.2 so it may already have been processed.

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> beforeFlattening.pdf, flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening

2024-09-05 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879480#comment-17879480
 ] 

Tilman Hausherr edited comment on PDFBOX-5878 at 9/5/24 8:16 AM:
-

{code}
q
Q
q
  9.469598 0.4248199 206.7517 18.55036 re
  W
  n
  q
0.9562042 0 0 -0.9562042 -55.6218 672.8725 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT21 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  6.360067 0.2853218 209.8612 18.82936 re
  W
  n
  q
0.9705854 0 0 -0.9705854 -59.7103 682.8466 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT18 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  3.203769 0.1437257 213.0175 19.11255 re
  W
  n
  q
0.9851829 0 0 -0.9851829 -63.86029 692.9707 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT19 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  0 0 216.2213 19.4 re
  W
  n
  /Cs6 cs
  0 sc
  q
1 0 0 -1 -68.0727 703.247 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT20 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
{code}
The text appears 3 times at slightly different positions in this appearance 
stream.


was (Author: tilman):
{code}
q
Q
q
  9.469598 0.4248199 206.7517 18.55036 re
  W
  n
  q
0.9562042 0 0 -0.9562042 -55.6218 672.8725 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT21 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  6.360067 0.2853218 209.8612 18.82936 re
  W
  n
  q
0.9705854 0 0 -0.9705854 -59.7103 682.8466 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT18 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  3.203769 0.1437257 213.0175 19.11255 re
  W
  n
  q
0.9851829 0 0 -0.9851829 -63.86029 692.9707 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT19 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
q
  0 0 216.2213 19.4 re
  W
  n
  /Cs6 cs
  0 sc
  q
1 0 0 -1 -68.0727 703.247 cm
BT
  11 0 0 -11 71.0727 696.6206 Tm
  /TT20 1 Tf
  [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 
(nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ
ET
  Q
Q
{code}
The text appears 3 times.

> pdf form field text gets blurred after flattening
> -
>
> Key: PDFBOX-5878
> URL: https://issues.apache.org/jira/browse/PDFBOX-5878
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 2.0.28, 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
>  Labels: Appearance
> Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, 
> beforeFlattening.pdf, flattened.pdf
>
>
> After flattening a pdf acro form, value of some fields get blurred
> {code:java}
>  PDDocument pdDocument = Loader.loadPDF(inFile, "");
> pdDocument.setResourceCache(new DefaultResourceCache());
> try {
> boolean save = false;
> if (pdDocument.isEncrypted()) {
> pdDocument.setAllSecurityToBeRemoved(true);
> save = true;
> }
> final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
> if (pdDocumentCatalog != null) {
> final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
> if (pdForm != null) {
> pdForm.flatten();
> save = true;
> }
> }
> if (save) {
> pdDocument.save(outFile);
> }
> }
> catch (Exception e) {}
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Reopened] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.

2024-09-04 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reopened PDFBOX-5876:
-

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content



[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878964#comment-17878964
 ] 

Tilman Hausherr commented on PDFBOX-5877:
-

Yeah!! There's a log message, so it means you also disabled or disregarded logs 
:-(

> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5877) After flattening a form pdf, the pdf loses content



[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878961#comment-17878961
 ] 

Tilman Hausherr edited comment on PDFBOX-5877 at 9/3/24 5:55 PM:
-

What's this?
{code}
pdDocument.setResourceCache(new PdfResourceCache())
{code}
We have no class {{PdfResourceCache}}.


was (Author: tilman):
What's this?

pdDocument.setResourceCache(new PdfResourceCache())



> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content



[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878961#comment-17878961
 ] 

Tilman Hausherr commented on PDFBOX-5877:
-

What's this?

pdDocument.setResourceCache(new PdfResourceCache())



> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content



[ 
https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878960#comment-17878960
 ] 

Tilman Hausherr commented on PDFBOX-5877:
-

Are you sure you used 3.0.3 and not 3.0.2 ? I just tried with the trunk and 
3.0.4-SNAPSHOT with our test and I got only invisible differences (yours are 
clearly visible and are because all fonts are lost in the PDF)

> After flattening a form pdf, the pdf loses content
> --
>
> Key: PDFBOX-5877
> URL: https://issues.apache.org/jira/browse/PDFBOX-5877
> Project: PDFBox
>  Issue Type: Bug
>  Components: AcroForm
>Affects Versions: 3.0.3 PDFBox
> Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9
> Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93
>Reporter: Joseph Jezerinac
>Priority: Major
> Attachments: beforeFalttening.pdf, flattenedPdf.pdf
>
>
> After flattening the pdf form content changes. Pls take a look at before and 
> after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we 
> started getting many  issues with pdf forms after flattening. 
> The code that used for flattening is as follows
> {code}
> PDDocument pdDocument = Loader.loadPDF(file, “”);
> pdDocument.setResourceCache(new PdfResourceCache())
> try {
>     boolean save = false;
>     if (pdDocument.isEncrypted()) {      
>         pdDocument.setAllSecurityToBeRemoved(true);
>         save = true;
>     }
>     final PDDocumentCatalog pdDocumentCatalog = 
> pdDocument.getDocumentCatalog();
>     if (pdDocumentCatalog != null) {
>         final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm();
>         if (pdForm != null) {       
>             pdForm.flatten();         
>             save = true;
>         }
>     }
>     if (save) {
>         pdDocument.save(file);        
>     }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.



[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878879#comment-17878879
 ] 

Tilman Hausherr commented on PDFBOX-5876:
-

No... I used -Xmx4G for a production project.

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.



[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878846#comment-17878846
 ] 

Tilman Hausherr commented on PDFBOX-5876:
-

Are you sure you are using the new version? You have to build yourself or wait 
until a new snapshot build is available. Instead of using PDFDebugger now I 
just tried your code as it is with a locally built 3.0.4-SNAPSHOT and it did 
work with -Xmx600m. (Also with 550, but not with 500)

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5876.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5876:

Affects Version/s: 2.0.32

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5876:

Component/s: Rendering

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.



[ 
https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878835#comment-17878835
 ] 

Tilman Hausherr commented on PDFBOX-5876:
-

The JPX image in that file is 7020 x 4964, which is quite big, and -Xmx600m is 
quite low. But I noticed that the subsampling parameter wasn't used when 
reading the JPX image the second time, which was the cause for the OOM. (JPX 
images have to be read twice because of some weirdness in the specification) It 
should work now, I tried it with PDFDebugger, which doesn't allow to set a temp 
cache.

> This jpeg2000 takes up a lot of memory, causing overflow.
> -
>
> Key: PDFBOX-5876
> URL: https://issues.apache.org/jira/browse/PDFBOX-5876
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 3.0.2 PDFBox
>Reporter: liu
>Priority: Major
> Attachments: jpeg2000.pdf
>
>
> pdf：[^jpeg2000.pdf]
> JVM：-Xmx600m
> {code:java}
> //代码占位符
> public static void main(String[] args) throws IOException, 
> InterruptedException {
>File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf");
>PDDocument pdf = Loader.loadPDF(file, 
> IOUtils.createTempFileOnlyStreamCache());
>PDFRenderer renderer = new PDFRenderer(pdf);
>int numPages = 0;
>renderer.setSubsamplingAllowed(true);
>BufferedImage bi = renderer.renderImage(numPages, 0.5f);
>pdf.close();
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5875) using font data to process ligatures



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5875:

Fix Version/s: (was: 3.0.4 PDFBox)

> using font data to process ligatures
> 
>
> Key: PDFBOX-5875
> URL: https://issues.apache.org/jira/browse/PDFBOX-5875
> Project: PDFBox
>  Issue Type: New Feature
>  Components: Parsing, PDModel, Text extraction
>Affects Versions: 3.0.3 PDFBox
>Reporter: Manish S N
>Priority: Major
>  Labels: Asian, CIDFont, font, ligatures, unicodemapping
> Attachments: page.pdf
>
>
> To process ligatures from Asian languages (where a glyph is the combination 
> of two unicode characters) using the data in embedded fonts.
>  
> *The problem:*
> currently modern PDF creators put these ligatures in /ActualText field which 
> we only recently considered to support in this issue . But this is not the 
> case in old PDFs with embedded CID fonts like [^page.pdf] where the glyphs of 
> ligatures lack a /toUnicode character mapping because there is no single 
> unicode codepoint for these as these are combination of more than one unicode 
> characters. 
>  
> *The Potential Solution (if not perfect):* 
> I managed to extract the font files using pdfbox 
> ([code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java])
>  and when i viewed the fontfiles using fontforge i found the data about 
> ligatures intact in it. So we can use this data to map the glyphs that are 
> ligatures to the unicodes of its constituent glyphs
>  
> *Problems:*
> In some cases the constituent glyphs may not be present in the cmap at all. 
> removed by PDF optimiser as it is never directly used in the PDF apart from 
> in ligatures. such glyphs are empty with only glyph id and no /toUnicode 
> mapping even if that particular glyph has a corresponding unicode character.
>  
> *The Hope:*
> This is not a common problem in large PDFs. and basic spell checkers could 
> easily rectify the problem. some comprehension is better than no 
> comprehension when it comes to dealing with data. this will greatly enhance 
> the parsing of non-Latin Asian languages.
>  
> (the PDF sample i attached is in Tamil language)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878089#comment-17878089
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Yes. But consider that Adobe didn't do it and they're smarter than us, I just 
tried copy / paste and save as text. The ligature thing in fonts are meant to 
be used when creating PDFs, I don't know if these would work in extraction.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878076#comment-17878076
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/30/24 11:50 AM:
---

Please create a new ticket for the file you just added because this is a 
different problem (only if you manage to extract this properly from Adobe 
Reader).


was (Author: tilman):
Please create a new ticket for the file you just added because this is a 
different problem.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878076#comment-17878076
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Please create a new ticket for the file you just added because this is a 
different problem.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache

2024-08-28 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5874.
-
  Assignee: Tilman Hausherr
Resolution: Fixed

Thank you, you're right, there's no need to warn about something that harmless.

> Change Loglevel from Warn to info when rebuilding font cache
> 
>
> Key: PDFBOX-5874
> URL: https://issues.apache.org/jira/browse/PDFBOX-5874
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Thomas Hoffmann
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>
> We have a monitoring system for our logfiles and some people get notified 
> whenever there is an error or a warning in the logfiles.
> Due to OS updates, the fonts might be updated or changed. This triggers a 
> rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning 
> and this triggers an alarm.
> The warnings occur in:
> org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java
> The logfile shows the following three entries:
> 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, 
> font cache will be re-built
> 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk 
> font cache, this may take a while
> 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building 
> on-disk font cache, found 96 fonts
>  
> Imho the message is more informational and not necessary a warning. It just 
> gives me the information, that the cache is getting rebuilt.
> It would be great if you could consider setting these messages to info level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache

2024-08-28 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5874:

Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0

> Change Loglevel from Warn to info when rebuilding font cache
> 
>
> Key: PDFBOX-5874
> URL: https://issues.apache.org/jira/browse/PDFBOX-5874
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Thomas Hoffmann
>Priority: Minor
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>
> We have a monitoring system for our logfiles and some people get notified 
> whenever there is an error or a warning in the logfiles.
> Due to OS updates, the fonts might be updated or changed. This triggers a 
> rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning 
> and this triggers an alarm.
> The warnings occur in:
> org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java
> The logfile shows the following three entries:
> 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, 
> font cache will be re-built
> 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk 
> font cache, this may take a while
> 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building 
> on-disk font cache, found 96 fonts
>  
> Imho the message is more informational and not necessary a warning. It just 
> gives me the information, that the cache is getting rebuilt.
> It would be great if you could consider setting these messages to info level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache

2024-08-28 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5874:

Affects Version/s: 2.0.32

> Change Loglevel from Warn to info when rebuilding font cache
> 
>
> Key: PDFBOX-5874
> URL: https://issues.apache.org/jira/browse/PDFBOX-5874
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Thomas Hoffmann
>Priority: Minor
>
> We have a monitoring system for our logfiles and some people get notified 
> whenever there is an error or a warning in the logfiles.
> Due to OS updates, the fonts might be updated or changed. This triggers a 
> rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning 
> and this triggers an alarm.
> The warnings occur in:
> org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java
> The logfile shows the following three entries:
> 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, 
> font cache will be re-built
> 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk 
> font cache, this may take a while
> 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building 
> on-disk font cache, found 96 fonts
>  
> Imho the message is more informational and not necessary a warning. It just 
> gives me the information, that the cache is getting rebuilt.
> It would be great if you could consider setting these messages to info level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876692#comment-17876692
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

In the files I saw /ActualText was often used only for a part of the text 
(although I see that one of the files I attached uses it for all). Using 
/ActualText only and disregard the old text extraction was never in my 
thoughts. That's why a switch would mean we either have the improvement of this 
ticket, or work as before.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876660#comment-17876660
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

I haven't resolved this ticket because of one question I've been asking to 
myself and now to the users here: should I add a getter/setter that makes this 
ActualText thing optional? It should be active by default because I believe 
that it is useful in most cases.

e.g. ConsiderActualText / ActivateActualText / IncludeActualText  / whatever

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5657) SMaskInData not supported for JPX images



[ 
https://issues.apache.org/jira/browse/PDFBOX-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876632#comment-17876632
 ] 

Tilman Hausherr edited comment on PDFBOX-5657 at 8/26/24 8:53 AM:
--

This related issue
https://github.com/mozilla/pdf.js/issues/11306
won't look better because there's an exception in the JPEG2000 decoder, see
https://github.com/jai-imageio/jai-imageio-jpeg2000/issues/9


was (Author: tilman):
This related issue
https://github.com/mozilla/pdf.js/issues/11306
won't look better because there's an exception in the JPEG2000 decoder.

> SMaskInData not supported for JPX images
> 
>
> Key: PDFBOX-5657
> URL: https://issues.apache.org/jira/browse/PDFBOX-5657
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.29, 3.0.0 PDFBox, 4.0.0
>    Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: JPEG2000, JPXDecode, JPXFilter
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: PDFJS-16782-SMaskInData.pdf
>
>
> JPX images can have transparency information and not only we don't support 
> that, but the images look broken.
> For now, lets just return the opaque image until there's a good idea what to 
> do. Maybe we have to return the mask in the DecodeResult. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5657) SMaskInData not supported for JPX images



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5657.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

This related issue
https://github.com/mozilla/pdf.js/issues/11306
won't look better because there's an exception in the JPEG2000 decoder.

> SMaskInData not supported for JPX images
> 
>
> Key: PDFBOX-5657
> URL: https://issues.apache.org/jira/browse/PDFBOX-5657
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.29, 3.0.0 PDFBox, 4.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: JPEG2000, JPXDecode, JPXFilter
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: PDFJS-16782-SMaskInData.pdf
>
>
> JPX images can have transparency information and not only we don't support 
> that, but the images look broken.
> For now, lets just return the opaque image until there's a good idea what to 
> do. Maybe we have to return the mask in the DecodeResult. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5872) Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding

2024-08-25 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5872:

Affects Version/s: 2.0.32

> Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding
> 
>
> Key: PDFBOX-5872
> URL: https://issues.apache.org/jira/browse/PDFBOX-5872
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Priority: Major
>
> [https://github.com/dbmdz/imageio-jnr] / 
> [https://mvnrepository.com/artifact/de.digitalcollections.imageio/imageio-openjpeg]
>  is an alternative JPEG2000 implementation for Java ImageIO that uses the 
> native OpenJPEG library as its backend.
> Unfortunately, it doesn't work out of the box because it doesn't implement 
> raster reading (canReadRaster not overridden, returns false), and PDFBox uses 
> canReadRaster() to validate image reader instances. However, it doesn't 
> appear that there is any real reliance on raster support in PDFBox (at least 
> in version 3) - if I patch the library to lie about raster support, it seems 
> to work perfectly.
> A further complication arises when the OpenJPEG native library cannot be 
> found: imageio-openjpeg returns null as the reader instance, which causes PDF 
> rendering to fail with an NPE, even if another JPEG2000 reader is available. 
> This can be remedied with a simple null check.
> [https://github.com/apache/pdfbox/pull/197] shows a possible solution. Until 
> then, [https://github.com/Googulator/imageio-jnr] can be used with PDFBox 
> 3.0.3 as a workaround, so long as the native library is correctly installed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5872) Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding

2024-08-25 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5872.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

Done, thanks!

> Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding
> 
>
> Key: PDFBOX-5872
> URL: https://issues.apache.org/jira/browse/PDFBOX-5872
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>
> [https://github.com/dbmdz/imageio-jnr] / 
> [https://mvnrepository.com/artifact/de.digitalcollections.imageio/imageio-openjpeg]
>  is an alternative JPEG2000 implementation for Java ImageIO that uses the 
> native OpenJPEG library as its backend.
> Unfortunately, it doesn't work out of the box because it doesn't implement 
> raster reading (canReadRaster not overridden, returns false), and PDFBox uses 
> canReadRaster() to validate image reader instances. However, it doesn't 
> appear that there is any real reliance on raster support in PDFBox (at least 
> in version 3) - if I patch the library to lie about raster support, it seems 
> to work perfectly.
> A further complication arises when the OpenJPEG native library cannot be 
> found: imageio-openjpeg returns null as the reader instance, which causes PDF 
> rendering to fail with an NPE, even if another JPEG2000 reader is available. 
> This can be remedied with a simple null check.
> [https://github.com/apache/pdfbox/pull/197] shows a possible solution. Until 
> then, [https://github.com/Googulator/imageio-jnr] can be used with PDFBox 
> 3.0.3 as a workaround, so long as the native library is correctly installed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5872) Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding

2024-08-22 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876045#comment-17876045
 ] 

Tilman Hausherr commented on PDFBOX-5872:
-

{quote}However, it doesn't appear that there is any real reliance on raster 
support in PDFBox (at least in version 3){quote}

{{readRaster()}} is called for CMYK images. Wouldn't it be better to have your 
modified method as a separate private method just for JPX?

> Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding
> 
>
> Key: PDFBOX-5872
> URL: https://issues.apache.org/jira/browse/PDFBOX-5872
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Rendering
>Affects Versions: 3.0.3 PDFBox
>Reporter: Gábor Stefanik
>Priority: Major
>
> [https://github.com/dbmdz/imageio-jnr] / 
> [https://mvnrepository.com/artifact/de.digitalcollections.imageio/imageio-openjpeg]
>  is an alternative JPEG2000 implementation for Java ImageIO that uses the 
> native OpenJPEG library as its backend.
> Unfortunately, it doesn't work out of the box because it doesn't implement 
> raster reading (canReadRaster not overridden, returns false), and PDFBox uses 
> canReadRaster() to validate image reader instances. However, it doesn't 
> appear that there is any real reliance on raster support in PDFBox (at least 
> in version 3) - if I patch the library to lie about raster support, it seems 
> to work perfectly.
> A further complication arises when the OpenJPEG native library cannot be 
> found: imageio-openjpeg returns null as the reader instance, which causes PDF 
> rendering to fail with an NPE, even if another JPEG2000 reader is available. 
> This can be remedied with a simple null check.
> [https://github.com/apache/pdfbox/pull/197] shows a possible solution. Until 
> then, [https://github.com/Googulator/imageio-jnr] can be used with PDFBox 
> 3.0.3 as a workaround, so long as the native library is correctly installed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5869) Checkstyle

2024-08-21 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5869.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

That's it for now. It will only prevent the worst "transgressions".

> Checkstyle
> --
>
> Key: PDFBOX-5869
> URL: https://issues.apache.org/jira/browse/PDFBOX-5869
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Simon Steiner
>Assignee: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
>
> Can you enforce via the CI that mvn checkstyle:check passes
> Disable any rules in the config you dont want to enforce



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5869) Checkstyle

2024-08-21 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5869:

Affects Version/s: 3.0.3 PDFBox
   2.0.32

> Checkstyle
> --
>
> Key: PDFBOX-5869
> URL: https://issues.apache.org/jira/browse/PDFBOX-5869
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Simon Steiner
>Priority: Major
>
> Can you enforce via the CI that mvn checkstyle:check passes
> Disable any rules in the config you dont want to enforce



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875214#comment-17875214
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Another thought I just had was to extend TextPosition and add the setter there 
and pass this object to the method of the base class of processTextPosition(), 
however TextPosition is final.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5871) Rendering never finishes



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5871:

Affects Version/s: 3.0.3 PDFBox
   2.0.32

> Rendering never finishes
> 
>
> Key: PDFBOX-5871
> URL: https://issues.apache.org/jira/browse/PDFBOX-5871
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>    Reporter: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox
>
> Attachments: 2_42.pdf, image-2024-08-20-12-22-36-716.png
>
>
> Submitted by Patrycja Zaremba  on the users mailing list. I can confirm that 
> it doesn't end even when running overnight 😡



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5871) Rendering never finishes



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5871:

Attachment: (was: screenshot-1.png)

> Rendering never finishes
> 
>
> Key: PDFBOX-5871
> URL: https://issues.apache.org/jira/browse/PDFBOX-5871
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>    Reporter: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox
>
> Attachments: 2_42.pdf, image-2024-08-20-12-22-36-716.png
>
>
> Submitted by Patrycja Zaremba  on the users mailing list. I can confirm that 
> it doesn't end even when running overnight 😡



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5871) Rendering never finishes



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5871:

Attachment: screenshot-1.png

> Rendering never finishes
> 
>
> Key: PDFBOX-5871
> URL: https://issues.apache.org/jira/browse/PDFBOX-5871
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>    Reporter: Tilman Hausherr
>Priority: Major
> Fix For: 2.0.33, 3.0.4 PDFBox
>
> Attachments: 2_42.pdf, screenshot-1.png
>
>
> Submitted by Patrycja Zaremba  on the users mailing list. I can confirm that 
> it doesn't end even when running overnight 😡



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-5871) Rendering never finishes

Tilman Hausherr created PDFBOX-5871:
---

 Summary: Rendering never finishes
 Key: PDFBOX-5871
 URL: https://issues.apache.org/jira/browse/PDFBOX-5871
 Project: PDFBox
  Issue Type: Bug
  Components: Rendering
Reporter: Tilman Hausherr
 Fix For: 2.0.33, 3.0.4 PDFBox
 Attachments: 2_42.pdf

Submitted by Patrycja Zaremba  on the users mailing list. I can confirm that it 
doesn't end even when running overnight 😡



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875100#comment-17875100
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Oops, no, it's not that easy. I forgot that we need 
{{TextPosition.setUnicode()}} which doesn't exist in the released versions. And 
in the snapshot I've made it package local to avoid people messing around.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874947#comment-17874947
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Yes this could be possible. All the changes except one could be done by using 
an extension of the stripper. The suppressDuplicateOverlappingText problem 
would have to be solved by saving the value when ActualText is active and 
restoring it afterwards.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5870:

Affects Version/s: 3.0.3 PDFBox
   2.0.32

> [PATCH] Detect CMYK image without relying on metadata
> -
>
> Key: PDFBOX-5870
> URL: https://issues.apache.org/jira/browse/PDFBOX-5870
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Simon Steiner
>Priority: Major
> Attachments: tmp.patch
>
>
> If getNumChannels returns empty string we should use a different system to 
> detect a cmyk image, so the output image is not inverted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Resolved] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-5870.
-
Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0
 Assignee: Tilman Hausherr
   Resolution: Fixed

> [PATCH] Detect CMYK image without relying on metadata
> -
>
> Key: PDFBOX-5870
> URL: https://issues.apache.org/jira/browse/PDFBOX-5870
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Simon Steiner
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: CMYK
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: tmp.patch
>
>
> If getNumChannels returns empty string we should use a different system to 
> detect a cmyk image, so the output image is not inverted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5870:

Labels: CMYK  (was: )

> [PATCH] Detect CMYK image without relying on metadata
> -
>
> Key: PDFBOX-5870
> URL: https://issues.apache.org/jira/browse/PDFBOX-5870
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Simon Steiner
>Priority: Major
>  Labels: CMYK
> Attachments: tmp.patch
>
>
> If getNumChannels returns empty string we should use a different system to 
> detect a cmyk image, so the output image is not inverted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5870:

Component/s: Rendering

> [PATCH] Detect CMYK image without relying on metadata
> -
>
> Key: PDFBOX-5870
> URL: https://issues.apache.org/jira/browse/PDFBOX-5870
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering
>Affects Versions: 2.0.32, 3.0.3 PDFBox
>Reporter: Simon Steiner
>Priority: Major
> Attachments: tmp.patch
>
>
> If getNumChannels returns empty string we should use a different system to 
> detect a cmyk image, so the output image is not inverted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata



[ 
https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874891#comment-17874891
 ] 

Tilman Hausherr commented on PDFBOX-5870:
-

Could you attach a PDF where this happens?

> [PATCH] Detect CMYK image without relying on metadata
> -
>
> Key: PDFBOX-5870
> URL: https://issues.apache.org/jira/browse/PDFBOX-5870
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Simon Steiner
>Priority: Major
> Attachments: tmp.patch
>
>
> If getNumChannels returns empty string we should use a different system to 
> detect a cmyk image, so the output image is not inverted



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874801#comment-17874801
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/19/24 7:40 AM:
--

Here's the excel file with the differences: 
[^content_diffs_with_exceptions-ActualText.xlsx]. This is from the Apache Tika 
project which also uses PDFBox.

Look at the columns U and W (in yellow) and compare with V and X. Usually V and 
X look better. Empty content in the yellow columns means we "lost" something 
during the update. Look also at the header column names to understand what they 
mean. Surprisingly (for me, maybe less for you) the non latin texts are the 
ones that are more improved.


was (Author: tilman):
Here's the excel file with the differences: 
[^content_diffs_with_exceptions-ActualText.xlsx] 

Look at the columns U and W (in yellow) and compare with V and X. Usually V and 
X look better. Empty content in the yellow columns means we "lost" something 
during the update. Look also at the header column names to understand what they 
mean. Surprisingly (for me, maybe less for you) the non latin texts are the 
ones that are more improved.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874801#comment-17874801
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

Here's the excel file with the differences: 
[^content_diffs_with_exceptions-ActualText.xlsx] 

Look at the columns U and W (in yellow) and compare with V and X. Usually V and 
X look better. Empty content in the yellow columns means we "lost" something 
during the update. Look also at the header column names to understand what they 
mean. Surprisingly (for me, maybe less for you) the non latin texts are the 
ones that are more improved.

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5868:

Attachment: content_diffs_with_exceptions-ActualText.xlsx

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: EmptyActualText_poppler.txt, 
> EmptyActualText_reduced_poppler.txt, Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, 
> content_diffs_with_exceptions-ActualText.xlsx, 
> image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, 
> pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, 
> suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-18 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874700#comment-17874700
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

I ran a comparison on several 10 PDF files. While there were many 
improvements, I discovered that /ActualText is also used to PREVENT text 
extraction, as shown by these files:
[^PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf] 
[^PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf] 
[^PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf]


> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>    Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, multilingual_test.pdf, 
> okular_out.txt, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does

2024-08-18 Thread Tilman Hausherr (Jira)



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5868:

Attachment: 
PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf
PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf
PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: Main.java, 
> PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, 
> PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, 
> Tilman's_solution_out.txt, adobe_out.txt, multilingual_test.pdf, 
> okular_out.txt, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, 
> screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5869) Checkstyle

2024-08-18 Thread Tilman Hausherr (Jira)



[ 
https://issues.apache.org/jira/browse/PDFBOX-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874647#comment-17874647
 ] 

Tilman Hausherr commented on PDFBOX-5869:
-

It should now work for the trunk, both with mvn checkstyle:check and for an 
ordinary build. It will prevent the "worst" things only. I didn't manage to 
create a regexp for all legal headers and mostly gave up on that one after 
failing with xmpbox, and maybe I shouldn't have bothered at all because we 
already have delegated that part to the "pedantic" build profile.

> Checkstyle
> --
>
> Key: PDFBOX-5869
> URL: https://issues.apache.org/jira/browse/PDFBOX-5869
> Project: PDFBox
>  Issue Type: Bug
>Reporter: Simon Steiner
>Priority: Major
>
> Can you enforce via the CI that mvn checkstyle:check passes
> Disable any rules in the config you dont want to enforce



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874501#comment-17874501
 ] 

Tilman Hausherr edited comment on PDFBOX-5868 at 8/17/24 12:57 PM:
---

It's already done elsewhere and makes sure that the logic isn't applied during 
an ActualText segment:
{code}
if (suppressDuplicateOverlappingText && actualText == null)
{code}
Your proposed change ends up setting {{suppressDuplicateOverlappingText}} to 
true even if it was set to false (it's an obscure option of the stripper).


was (Author: tilman):
It's already done elsewhere:
{code}
if (suppressDuplicateOverlappingText && actualText == null)
{code}
Your proposed change ends up setting {{suppressDuplicateOverlappingText}} to 
true even if it was set to false (it's an obscure option of the stripper).

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



[ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874501#comment-17874501
 ] 

Tilman Hausherr commented on PDFBOX-5868:
-

It's already done elsewhere:
{code}
if (suppressDuplicateOverlappingText && actualText == null)
{code}
Your proposed change ends up setting {{suppressDuplicateOverlappingText}} to 
true even if it was set to false (it's an obscure option of the stripper).

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Closed] (PDFBOX-2740) Text extraction failed on Korean PDF



 [ 
https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-2740.
---
Resolution: Not A Problem

The /ActualText problem was fixed in PDFBOX-5868. However extraction of the 
file he had been improved before already.

> Text extraction failed on Korean PDF
> 
>
> Key: PDFBOX-2740
> URL: https://issues.apache.org/jira/browse/PDFBOX-2740
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0
>Reporter: Julien Ortega
>Assignee: John Hewson
>Priority: Major
>  Labels: ActualText
> Attachments: g_KO_201506-ReaderDC-cutAndPaste.txt, 
> g_KO_201506-ReaderDC-saveAsText.txt, g_KO_201506.pdf, g_KO_201506.txt
>
>
> Trying to extract text on a Korean PDF gives me a lot of warnings :
> WARNING: No Unicode mapping for US (33) in font 
> DVCAYA+WtKoBaeumMyungjoL063zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for NAK (33) in font 
> JYLDGG+WtKoBaeumMyungjoL053zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for RS (38) in font 
> WRYULE+WtKoBaeumMyungjoL013zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for DEL (33) in font 
> FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for SOH (33) in font 
> OOLNBG+WtKoBaeumGothicL0122b4?Pw
> and the result is not readable. The pdf is containing the necessary 
> conversion table because every pdf reader (Desktop or Mobile) let me copy and 
> past the text without problem.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Reopened] (PDFBOX-2740) Text extraction failed on Korean PDF



 [ 
https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reopened PDFBOX-2740:
-

> Text extraction failed on Korean PDF
> 
>
> Key: PDFBOX-2740
> URL: https://issues.apache.org/jira/browse/PDFBOX-2740
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0
>Reporter: Julien Ortega
>Assignee: John Hewson
>Priority: Major
> Attachments: g_KO_201506-ReaderDC-cutAndPaste.txt, 
> g_KO_201506-ReaderDC-saveAsText.txt, g_KO_201506.pdf, g_KO_201506.txt
>
>
> Trying to extract text on a Korean PDF gives me a lot of warnings :
> WARNING: No Unicode mapping for US (33) in font 
> DVCAYA+WtKoBaeumMyungjoL063zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for NAK (33) in font 
> JYLDGG+WtKoBaeumMyungjoL053zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for RS (38) in font 
> WRYULE+WtKoBaeumMyungjoL013zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for DEL (33) in font 
> FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for SOH (33) in font 
> OOLNBG+WtKoBaeumGothicL0122b4?Pw
> and the result is not readable. The pdf is containing the necessary 
> conversion table because every pdf reader (Desktop or Mobile) let me copy and 
> past the text without problem.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-2740) Text extraction failed on Korean PDF



 [ 
https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2740:

Labels: ActualText  (was: )

> Text extraction failed on Korean PDF
> 
>
> Key: PDFBOX-2740
> URL: https://issues.apache.org/jira/browse/PDFBOX-2740
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0
>Reporter: Julien Ortega
>Assignee: John Hewson
>Priority: Major
>  Labels: ActualText
> Attachments: g_KO_201506-ReaderDC-cutAndPaste.txt, 
> g_KO_201506-ReaderDC-saveAsText.txt, g_KO_201506.pdf, g_KO_201506.txt
>
>
> Trying to extract text on a Korean PDF gives me a lot of warnings :
> WARNING: No Unicode mapping for US (33) in font 
> DVCAYA+WtKoBaeumMyungjoL063zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for NAK (33) in font 
> JYLDGG+WtKoBaeumMyungjoL053zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for RS (38) in font 
> WRYULE+WtKoBaeumMyungjoL013zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for DEL (33) in font 
> FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont 
> WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for SOH (33) in font 
> OOLNBG+WtKoBaeumGothicL0122b4?Pw
> and the result is not readable. The pdf is containing the necessary 
> conversion table because every pdf reader (Desktop or Mobile) let me copy and 
> past the text without problem.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Closed] (PDFBOX-4532) PDFTextStripper replacing the decimal with white space



 [ 
https://issues.apache.org/jira/browse/PDFBOX-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed PDFBOX-4532.
---
Resolution: Duplicate

Fixed in PDFBOX-5868

> PDFTextStripper replacing the decimal with white space
> --
>
> Key: PDFBOX-4532
> URL: https://issues.apache.org/jira/browse/PDFBOX-4532
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.15
>Reporter: Akash Gupta
>Priority: Major
>  Labels: ActualText
> Attachments: FSUSA00BDD.pdf, PDFBOX-4532-reduced.pdf, SO71723006.pdf, 
> code_textStripper.PNG, numbers_without_decimal.PNG
>
>
> I'm using the PDFTextStripperByArea to be specific and trying to extract a 
> particular area from the document. 
> In the output most the numbers (all but one) have their decimal point 
> replaced by a white space. When I copy and paste the text using Abobe 
> reader/chrome the decimal point are preserved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5868:

Fix Version/s: 2.0.33
   3.0.4 PDFBox
   4.0.0

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Assigned] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr reassigned PDFBOX-5868:
---

Assignee: Tilman Hausherr

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Assignee: Tilman Hausherr
>Priority: Major
>  Labels: ActualText
> Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0
>
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5868:

Affects Version/s: 2.0.32

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.32, 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
>  Labels: ActualText
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does



 [ 
https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-5868:

Labels: ActualText  (was: )

> PDFBox not extracting text of non-latin languages(tamil, bengali) properly 
> but adobe reader's save as text does
> ---
>
> Key: PDFBOX-5868
> URL: https://issues.apache.org/jira/browse/PDFBOX-5868
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 3.0.3 PDFBox
> Environment: Ubuntu 22.04.4 LTS x86_64
>Reporter: Manish S N
>Priority: Major
>  Labels: ActualText
> Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, 
> multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, 
> screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt
>
>
> I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used 
> the export:text command line tool to obtain the results
>  * the multilingual_test.pdf is the original pdf i made to test multilingual 
> text extraction.
>  * the pdfbox_out.txt is the text file produced by pdfbox
>  * the adobe_out.txt is the text file created by adobe reader's save as text 
> feature
>  
> Observation:
> as you can see in the attachment the text file obtained by pdfbox shows weird 
> unicodes for tamil and bengali (for hindi the charecters are extracted but 
> not overlapped; japanese seems fine to me). in contrast the text file file 
> obtained from adobe reader's save as text feature seems fine and copy pasting 
> the text from my document viewer(evince) also works.
> Questions:
>  # why are the outputs from pdfbox and adobe different?
>  # what can i do to extract the text from a multilingual pdf correctly?
>  # Is there a way to apply pattern matching to text in pdf file and declare 
> matches without extracting the text first? (say if the problem is with fonts 
> and glyphs)
> —
> My Usecase fyi:
> i am trying to extract text from files and run pattern matching. I am using 
> apache tika for parsing documents. I noticed problem with extracted PDF text 
> (other filetypes parse fine). used executable pdfbox jar to conclude that the 
> _problem is in pdfbox and not in tika._ tested with adobe reader's extract 
> text to confirm the problem is not with the pdf. i  want to extract these 
> multilingual text to run pattern matching on them alone and do not need to 
> display the content but only if the pattern is present or not (say if the 
> problem is with fonts and glyphs)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Updated] (PDFBOX-3248) Unwanted spaces in text extraction (2)