[jira] [Resolved] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.
[ https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-5882. - Assignee: Tilman Hausherr Resolution: Fixed I've added a warning. My original idea was to throw an exception, but these might happen in rendering in some cases so the warning is a compromise. > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > --- > > Key: PDFBOX-5882 > URL: https://issues.apache.org/jira/browse/PDFBOX-5882 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.24, 2.0.32, 3.0.3 PDFBox >Reporter: bai yuan >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, > image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png > > > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > It appears red in Adobe and Chrome, which is correct. > It appears blue in Safari, which is incorrect. > Here is the example code: > {code:java} > try (PDDocument document = new PDDocument()) { > PDPage page = new PDPage(); > document.addPage(page); > > try (PDPageContentStream contentStream = new > PDPageContentStream(document, page)) { > PDTilingPattern pattern = new PDTilingPattern(); > pattern.setBBox(new PDRectangle(3, 3)); > pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED); > > pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING); > pattern.setXStep(3); > pattern.setYStep(3); > pattern.setMatrix(Matrix.getScaleInstance(1, > 1).createAffineTransform()); > try (PDPatternContentStream patternContentStream = new > PDPatternContentStream(pattern)) { > patternContentStream.setLineWidth(0.4f); > patternContentStream.moveTo(0, 2); > patternContentStream.lineTo(0, 3); > patternContentStream.lineTo(2, 3); > patternContentStream.lineTo(2, 2); > patternContentStream.lineTo(3, 2); > patternContentStream.lineTo(3, 0); > patternContentStream.lineTo(2, 0); > patternContentStream.lineTo(2, 1); > patternContentStream.lineTo(1, 1); > patternContentStream.lineTo(1, 2); > patternContentStream.closePath(); > patternContentStream.fill(); > } catch (IOException e) { > throw new RuntimeException(e); > }; > COSName patternName = page.getResources().add(pattern); > PDPattern pdPattern = new PDPattern(page.getResources(), > PDDeviceRGB.INSTANCE); > PDColor pdColor = new PDColor(Color.RED.getComponents(null), > patternName, pdPattern); > contentStream.setNonStrokingColor(pdColor); > contentStream.addRect(100, 500, 400, 200); > contentStream.fill(); > } > document.save("excel_pattern_fill.pdf"); > } > {code} > **Safari:** > !image-2024-10-08-16-04-32-344.png! > Adobe: > !image-2024-10-08-16-04-49-033.png! > The exported pdf file : excel_pattern_fill.pdf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.
[ https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5882: Fix Version/s: 2.0.33 3.0.4 PDFBox 4.0.0 > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > --- > > Key: PDFBOX-5882 > URL: https://issues.apache.org/jira/browse/PDFBOX-5882 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.24, 2.0.32, 3.0.3 PDFBox >Reporter: bai yuan >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, > image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png > > > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > It appears red in Adobe and Chrome, which is correct. > It appears blue in Safari, which is incorrect. > Here is the example code: > {code:java} > try (PDDocument document = new PDDocument()) { > PDPage page = new PDPage(); > document.addPage(page); > > try (PDPageContentStream contentStream = new > PDPageContentStream(document, page)) { > PDTilingPattern pattern = new PDTilingPattern(); > pattern.setBBox(new PDRectangle(3, 3)); > pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED); > > pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING); > pattern.setXStep(3); > pattern.setYStep(3); > pattern.setMatrix(Matrix.getScaleInstance(1, > 1).createAffineTransform()); > try (PDPatternContentStream patternContentStream = new > PDPatternContentStream(pattern)) { > patternContentStream.setLineWidth(0.4f); > patternContentStream.moveTo(0, 2); > patternContentStream.lineTo(0, 3); > patternContentStream.lineTo(2, 3); > patternContentStream.lineTo(2, 2); > patternContentStream.lineTo(3, 2); > patternContentStream.lineTo(3, 0); > patternContentStream.lineTo(2, 0); > patternContentStream.lineTo(2, 1); > patternContentStream.lineTo(1, 1); > patternContentStream.lineTo(1, 2); > patternContentStream.closePath(); > patternContentStream.fill(); > } catch (IOException e) { > throw new RuntimeException(e); > }; > COSName patternName = page.getResources().add(pattern); > PDPattern pdPattern = new PDPattern(page.getResources(), > PDDeviceRGB.INSTANCE); > PDColor pdColor = new PDColor(Color.RED.getComponents(null), > patternName, pdPattern); > contentStream.setNonStrokingColor(pdColor); > contentStream.addRect(100, 500, 400, 200); > contentStream.fill(); > } > document.save("excel_pattern_fill.pdf"); > } > {code} > **Safari:** > !image-2024-10-08-16-04-32-344.png! > Adobe: > !image-2024-10-08-16-04-49-033.png! > The exported pdf file : excel_pattern_fill.pdf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.
[ https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5882: Component/s: PDModel > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > --- > > Key: PDFBOX-5882 > URL: https://issues.apache.org/jira/browse/PDFBOX-5882 > Project: PDFBox > Issue Type: Bug > Components: PDModel >Affects Versions: 2.0.24, 2.0.32, 3.0.3 PDFBox >Reporter: bai yuan >Priority: Major > Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, > image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png > > > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > It appears red in Adobe and Chrome, which is correct. > It appears blue in Safari, which is incorrect. > Here is the example code: > {code:java} > try (PDDocument document = new PDDocument()) { > PDPage page = new PDPage(); > document.addPage(page); > > try (PDPageContentStream contentStream = new > PDPageContentStream(document, page)) { > PDTilingPattern pattern = new PDTilingPattern(); > pattern.setBBox(new PDRectangle(3, 3)); > pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED); > > pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING); > pattern.setXStep(3); > pattern.setYStep(3); > pattern.setMatrix(Matrix.getScaleInstance(1, > 1).createAffineTransform()); > try (PDPatternContentStream patternContentStream = new > PDPatternContentStream(pattern)) { > patternContentStream.setLineWidth(0.4f); > patternContentStream.moveTo(0, 2); > patternContentStream.lineTo(0, 3); > patternContentStream.lineTo(2, 3); > patternContentStream.lineTo(2, 2); > patternContentStream.lineTo(3, 2); > patternContentStream.lineTo(3, 0); > patternContentStream.lineTo(2, 0); > patternContentStream.lineTo(2, 1); > patternContentStream.lineTo(1, 1); > patternContentStream.lineTo(1, 2); > patternContentStream.closePath(); > patternContentStream.fill(); > } catch (IOException e) { > throw new RuntimeException(e); > }; > COSName patternName = page.getResources().add(pattern); > PDPattern pdPattern = new PDPattern(page.getResources(), > PDDeviceRGB.INSTANCE); > PDColor pdColor = new PDColor(Color.RED.getComponents(null), > patternName, pdPattern); > contentStream.setNonStrokingColor(pdColor); > contentStream.addRect(100, 500, 400, 200); > contentStream.fill(); > } > document.save("excel_pattern_fill.pdf"); > } > {code} > **Safari:** > !image-2024-10-08-16-04-32-344.png! > Adobe: > !image-2024-10-08-16-04-49-033.png! > The exported pdf file : excel_pattern_fill.pdf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.
[ https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5882: Affects Version/s: 3.0.3 PDFBox 2.0.32 > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > --- > > Key: PDFBOX-5882 > URL: https://issues.apache.org/jira/browse/PDFBOX-5882 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.24, 2.0.32, 3.0.3 PDFBox >Reporter: bai yuan >Priority: Major > Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, > image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png > > > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > It appears red in Adobe and Chrome, which is correct. > It appears blue in Safari, which is incorrect. > Here is the example code: > {code:java} > try (PDDocument document = new PDDocument()) { > PDPage page = new PDPage(); > document.addPage(page); > > try (PDPageContentStream contentStream = new > PDPageContentStream(document, page)) { > PDTilingPattern pattern = new PDTilingPattern(); > pattern.setBBox(new PDRectangle(3, 3)); > pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED); > > pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING); > pattern.setXStep(3); > pattern.setYStep(3); > pattern.setMatrix(Matrix.getScaleInstance(1, > 1).createAffineTransform()); > try (PDPatternContentStream patternContentStream = new > PDPatternContentStream(pattern)) { > patternContentStream.setLineWidth(0.4f); > patternContentStream.moveTo(0, 2); > patternContentStream.lineTo(0, 3); > patternContentStream.lineTo(2, 3); > patternContentStream.lineTo(2, 2); > patternContentStream.lineTo(3, 2); > patternContentStream.lineTo(3, 0); > patternContentStream.lineTo(2, 0); > patternContentStream.lineTo(2, 1); > patternContentStream.lineTo(1, 1); > patternContentStream.lineTo(1, 2); > patternContentStream.closePath(); > patternContentStream.fill(); > } catch (IOException e) { > throw new RuntimeException(e); > }; > COSName patternName = page.getResources().add(pattern); > PDPattern pdPattern = new PDPattern(page.getResources(), > PDDeviceRGB.INSTANCE); > PDColor pdColor = new PDColor(Color.RED.getComponents(null), > patternName, pdPattern); > contentStream.setNonStrokingColor(pdColor); > contentStream.addRect(100, 500, 400, 200); > contentStream.fill(); > } > document.save("excel_pattern_fill.pdf"); > } > {code} > **Safari:** > !image-2024-10-08-16-04-32-344.png! > Adobe: > !image-2024-10-08-16-04-49-033.png! > The exported pdf file : excel_pattern_fill.pdf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.
[ https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887521#comment-17887521 ] Tilman Hausherr edited comment on PDFBOX-5882 at 10/8/24 9:25 AM: -- I'm testing adding a check in PDColor constructor. was (Author: tilman): I'm testing adding check in PDColor constructor. > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > --- > > Key: PDFBOX-5882 > URL: https://issues.apache.org/jira/browse/PDFBOX-5882 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.24 >Reporter: bai yuan >Priority: Major > Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, > image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png > > > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > It appears red in Adobe and Chrome, which is correct. > It appears blue in Safari, which is incorrect. > Here is the example code: > {code:java} > try (PDDocument document = new PDDocument()) { > PDPage page = new PDPage(); > document.addPage(page); > > try (PDPageContentStream contentStream = new > PDPageContentStream(document, page)) { > PDTilingPattern pattern = new PDTilingPattern(); > pattern.setBBox(new PDRectangle(3, 3)); > pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED); > > pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING); > pattern.setXStep(3); > pattern.setYStep(3); > pattern.setMatrix(Matrix.getScaleInstance(1, > 1).createAffineTransform()); > try (PDPatternContentStream patternContentStream = new > PDPatternContentStream(pattern)) { > patternContentStream.setLineWidth(0.4f); > patternContentStream.moveTo(0, 2); > patternContentStream.lineTo(0, 3); > patternContentStream.lineTo(2, 3); > patternContentStream.lineTo(2, 2); > patternContentStream.lineTo(3, 2); > patternContentStream.lineTo(3, 0); > patternContentStream.lineTo(2, 0); > patternContentStream.lineTo(2, 1); > patternContentStream.lineTo(1, 1); > patternContentStream.lineTo(1, 2); > patternContentStream.closePath(); > patternContentStream.fill(); > } catch (IOException e) { > throw new RuntimeException(e); > }; > COSName patternName = page.getResources().add(pattern); > PDPattern pdPattern = new PDPattern(page.getResources(), > PDDeviceRGB.INSTANCE); > PDColor pdColor = new PDColor(Color.RED.getComponents(null), > patternName, pdPattern); > contentStream.setNonStrokingColor(pdColor); > contentStream.addRect(100, 500, 400, 200); > contentStream.fill(); > } > document.save("excel_pattern_fill.pdf"); > } > {code} > **Safari:** > !image-2024-10-08-16-04-32-344.png! > Adobe: > !image-2024-10-08-16-04-49-033.png! > The exported pdf file : excel_pattern_fill.pdf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.
[ https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887521#comment-17887521 ] Tilman Hausherr commented on PDFBOX-5882: - I'm testing adding check in PDColor constructor. > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > --- > > Key: PDFBOX-5882 > URL: https://issues.apache.org/jira/browse/PDFBOX-5882 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.24 >Reporter: bai yuan >Priority: Major > Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, > image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png > > > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > It appears red in Adobe and Chrome, which is correct. > It appears blue in Safari, which is incorrect. > Here is the example code: > {code:java} > try (PDDocument document = new PDDocument()) { > PDPage page = new PDPage(); > document.addPage(page); > > try (PDPageContentStream contentStream = new > PDPageContentStream(document, page)) { > PDTilingPattern pattern = new PDTilingPattern(); > pattern.setBBox(new PDRectangle(3, 3)); > pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED); > > pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING); > pattern.setXStep(3); > pattern.setYStep(3); > pattern.setMatrix(Matrix.getScaleInstance(1, > 1).createAffineTransform()); > try (PDPatternContentStream patternContentStream = new > PDPatternContentStream(pattern)) { > patternContentStream.setLineWidth(0.4f); > patternContentStream.moveTo(0, 2); > patternContentStream.lineTo(0, 3); > patternContentStream.lineTo(2, 3); > patternContentStream.lineTo(2, 2); > patternContentStream.lineTo(3, 2); > patternContentStream.lineTo(3, 0); > patternContentStream.lineTo(2, 0); > patternContentStream.lineTo(2, 1); > patternContentStream.lineTo(1, 1); > patternContentStream.lineTo(1, 2); > patternContentStream.closePath(); > patternContentStream.fill(); > } catch (IOException e) { > throw new RuntimeException(e); > }; > COSName patternName = page.getResources().add(pattern); > PDPattern pdPattern = new PDPattern(page.getResources(), > PDDeviceRGB.INSTANCE); > PDColor pdColor = new PDColor(Color.RED.getComponents(null), > patternName, pdPattern); > contentStream.setNonStrokingColor(pdColor); > contentStream.addRect(100, 500, 400, 200); > contentStream.fill(); > } > document.save("excel_pattern_fill.pdf"); > } > {code} > **Safari:** > !image-2024-10-08-16-04-32-344.png! > Adobe: > !image-2024-10-08-16-04-49-033.png! > The exported pdf file : excel_pattern_fill.pdf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.
[ https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5882: Attachment: excel_pattern_fill-fixed.pdf > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > --- > > Key: PDFBOX-5882 > URL: https://issues.apache.org/jira/browse/PDFBOX-5882 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.24 >Reporter: bai yuan >Priority: Major > Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, > image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png > > > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > It appears red in Adobe and Chrome, which is correct. > It appears blue in Safari, which is incorrect. > Here is the example code: > {code:java} > try (PDDocument document = new PDDocument()) { > PDPage page = new PDPage(); > document.addPage(page); > > try (PDPageContentStream contentStream = new > PDPageContentStream(document, page)) { > PDTilingPattern pattern = new PDTilingPattern(); > pattern.setBBox(new PDRectangle(3, 3)); > pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED); > > pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING); > pattern.setXStep(3); > pattern.setYStep(3); > pattern.setMatrix(Matrix.getScaleInstance(1, > 1).createAffineTransform()); > try (PDPatternContentStream patternContentStream = new > PDPatternContentStream(pattern)) { > patternContentStream.setLineWidth(0.4f); > patternContentStream.moveTo(0, 2); > patternContentStream.lineTo(0, 3); > patternContentStream.lineTo(2, 3); > patternContentStream.lineTo(2, 2); > patternContentStream.lineTo(3, 2); > patternContentStream.lineTo(3, 0); > patternContentStream.lineTo(2, 0); > patternContentStream.lineTo(2, 1); > patternContentStream.lineTo(1, 1); > patternContentStream.lineTo(1, 2); > patternContentStream.closePath(); > patternContentStream.fill(); > } catch (IOException e) { > throw new RuntimeException(e); > }; > COSName patternName = page.getResources().add(pattern); > PDPattern pdPattern = new PDPattern(page.getResources(), > PDDeviceRGB.INSTANCE); > PDColor pdColor = new PDColor(Color.RED.getComponents(null), > patternName, pdPattern); > contentStream.setNonStrokingColor(pdColor); > contentStream.addRect(100, 500, 400, 200); > contentStream.fill(); > } > document.save("excel_pattern_fill.pdf"); > } > {code} > **Safari:** > !image-2024-10-08-16-04-32-344.png! > Adobe: > !image-2024-10-08-16-04-49-033.png! > The exported pdf file : excel_pattern_fill.pdf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.
[ https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887517#comment-17887517 ] Tilman Hausherr commented on PDFBOX-5882: - Here's a file generated with the correct code: [^excel_pattern_fill-fixed.pdf] > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > --- > > Key: PDFBOX-5882 > URL: https://issues.apache.org/jira/browse/PDFBOX-5882 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.24 >Reporter: bai yuan >Priority: Major > Attachments: excel_pattern_fill-fixed.pdf, excel_pattern_fill.pdf, > image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png > > > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > It appears red in Adobe and Chrome, which is correct. > It appears blue in Safari, which is incorrect. > Here is the example code: > {code:java} > try (PDDocument document = new PDDocument()) { > PDPage page = new PDPage(); > document.addPage(page); > > try (PDPageContentStream contentStream = new > PDPageContentStream(document, page)) { > PDTilingPattern pattern = new PDTilingPattern(); > pattern.setBBox(new PDRectangle(3, 3)); > pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED); > > pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING); > pattern.setXStep(3); > pattern.setYStep(3); > pattern.setMatrix(Matrix.getScaleInstance(1, > 1).createAffineTransform()); > try (PDPatternContentStream patternContentStream = new > PDPatternContentStream(pattern)) { > patternContentStream.setLineWidth(0.4f); > patternContentStream.moveTo(0, 2); > patternContentStream.lineTo(0, 3); > patternContentStream.lineTo(2, 3); > patternContentStream.lineTo(2, 2); > patternContentStream.lineTo(3, 2); > patternContentStream.lineTo(3, 0); > patternContentStream.lineTo(2, 0); > patternContentStream.lineTo(2, 1); > patternContentStream.lineTo(1, 1); > patternContentStream.lineTo(1, 2); > patternContentStream.closePath(); > patternContentStream.fill(); > } catch (IOException e) { > throw new RuntimeException(e); > }; > COSName patternName = page.getResources().add(pattern); > PDPattern pdPattern = new PDPattern(page.getResources(), > PDDeviceRGB.INSTANCE); > PDColor pdColor = new PDColor(Color.RED.getComponents(null), > patternName, pdPattern); > contentStream.setNonStrokingColor(pdColor); > contentStream.addRect(100, 500, 400, 200); > contentStream.fill(); > } > document.save("excel_pattern_fill.pdf"); > } > {code} > **Safari:** > !image-2024-10-08-16-04-32-344.png! > Adobe: > !image-2024-10-08-16-04-49-033.png! > The exported pdf file : excel_pattern_fill.pdf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5882) The pattern created with PDFBox shows inconsistent colors between Safari and Adobe.
[ https://issues.apache.org/jira/browse/PDFBOX-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887515#comment-17887515 ] Tilman Hausherr edited comment on PDFBOX-5882 at 10/8/24 8:28 AM: -- That's because {{Color.RED.getComponents(null)}} returns a 4 component array, which includes the alpha, so this appears as {{1 0 0 1 /p1 scn}} in the PDF (4 components instead of 3). I suspect Safari uses the last 3 components, while the others including PDFBox use the first 3. Use {{new float[] \{1,0,0\}}} instead. was (Author: tilman): That's because {{Color.RED.getComponents(null)}} returns a 4 component array, which includes the alpha, so this appears as {{1 0 0 1 /p1 scn}} in the PDF (4 components instead of 3). I suspect Safari uses the last 3 components, while the others including PDFBox use the first 3. Use {{new float[] {1,0,0}}} instead. > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > --- > > Key: PDFBOX-5882 > URL: https://issues.apache.org/jira/browse/PDFBOX-5882 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.24 >Reporter: bai yuan >Priority: Major > Attachments: excel_pattern_fill.pdf, > image-2024-10-08-16-04-32-344.png, image-2024-10-08-16-04-49-033.png > > > The pattern created with PDFBox shows inconsistent colors between Safari and > Adobe. > It appears red in Adobe and Chrome, which is correct. > It appears blue in Safari, which is incorrect. > Here is the example code: > {code:java} > try (PDDocument document = new PDDocument()) { > PDPage page = new PDPage(); > document.addPage(page); > > try (PDPageContentStream contentStream = new > PDPageContentStream(document, page)) { > PDTilingPattern pattern = new PDTilingPattern(); > pattern.setBBox(new PDRectangle(3, 3)); > pattern.setPaintType(PDTilingPattern.PAINT_UNCOLORED); > > pattern.setTilingType(PDTilingPattern.TILING_CONSTANT_SPACING); > pattern.setXStep(3); > pattern.setYStep(3); > pattern.setMatrix(Matrix.getScaleInstance(1, > 1).createAffineTransform()); > try (PDPatternContentStream patternContentStream = new > PDPatternContentStream(pattern)) { > patternContentStream.setLineWidth(0.4f); > patternContentStream.moveTo(0, 2); > patternContentStream.lineTo(0, 3); > patternContentStream.lineTo(2, 3); > patternContentStream.lineTo(2, 2); > patternContentStream.lineTo(3, 2); > patternContentStream.lineTo(3, 0); > patternContentStream.lineTo(2, 0); > patternContentStream.lineTo(2, 1); > patternContentStream.lineTo(1, 1); > patternContentStream.lineTo(1, 2); > patternContentStream.closePath(); > patternContentStream.fill(); > } catch (IOException e) { > throw new RuntimeException(e); > }; > COSName patternName = page.getResources().add(pattern); > PDPattern pdPattern = new PDPattern(page.getResources(), > PDDeviceRGB.INSTANCE); > PDColor pdColor = new PDColor(Color.RED.getComponents(null), > patternName, pdPattern); > contentStream.setNonStrokingColor(pdColor); > contentStream.addRect(100, 500, 400, 200); > contentStream.fill(); > } > document.save("excel_pattern_fill.pdf"); > } > {code} > **Safari:** > !image-2024-10-08-16-04-32-344.png! > Adobe: > !image-2024-10-08-16-04-49-033.png! > The exported pdf file : excel_pattern_fill.pdf -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
Re: Apache PDFBox Board Report October 2024 due
+1 Tilman On 07.10.2024 17:37, Andreas Lehmkühler wrote: Hi, find attached a quick draft of the board report we're expected to submit this month. It's based upon the report wizard template which can be found at [1] Any comments or additions are appreciated ... Sorry for the short notice, but I wasn't able to prepare a report earlier due to some personal reasons. ## Description: The mission of PDFBox is the creation and maintenance of software related to Java library for working with PDF documents ## Project Status: Current project status: Ongoing with moderate activity Issues for the board: none ## Membership Data: Apache PDFBox was founded 2009-10-21 (15 years ago) There are currently 21 committers and 21 PMC members in this project. The Committer-to-PMC ratio is 1:1. Community changes, past quarter: - No new PMC members. Last addition was Matthäus Mayer on 2017-10-16. - No new committers. Last addition was Joerg O. Henne on 2017-10-09. ## Project Activity: Recent releases: 3.0.3 was released on 2024-08-08. 2.0.32 was released on 2024-07-24. 2.0.31 was released on 2024-03-24. ## Community Health: - there is a steady stream of contributions, bug reports and questions on the mailing lists - it was a more quiet quarter due to the holiday season - another 3.0.x and 2.0.x will most likely be released before xmas - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5881) CVE for Lucene libraries
[ https://issues.apache.org/jira/browse/PDFBOX-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-5881. - Resolution: Fixed > CVE for Lucene libraries > > > Key: PDFBOX-5881 > URL: https://issues.apache.org/jira/browse/PDFBOX-5881 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.32, 3.0.3 PDFBox > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.33, 3.0.4 PDFBox > > > It looks like Lucene won't make any older jar files that fixes > CVE-2024-45772, so I'll add a suppression file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5881) CVE for Lucene libraries
Tilman Hausherr created PDFBOX-5881: --- Summary: CVE for Lucene libraries Key: PDFBOX-5881 URL: https://issues.apache.org/jira/browse/PDFBOX-5881 Project: PDFBox Issue Type: Bug Affects Versions: 3.0.3 PDFBox, 2.0.32 Reporter: Tilman Hausherr Assignee: Tilman Hausherr Fix For: 2.0.33, 3.0.4 PDFBox It looks like Lucene won't make any older jar files that fixes CVE-2024-45772, so I'll add a suppression file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
build timeout fails
https://issues.apache.org/jira/browse/INFRA-26175 - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-4718) OutOfMemoryError - during renderImageWithDPI
[ https://issues.apache.org/jira/browse/PDFBOX-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886737#comment-17886737 ] Tilman Hausherr edited comment on PDFBOX-4718 at 10/3/24 5:39 PM: -- Sadly some differences in rendering: PDFBOX-2557, PDFBOX-3182, PDFBOX-5842 (VW logo missing), PDFBOX-3116.pdf (half-circles bottom right) was (Author: tilman): Sadly some differences in rendering: PDFBOX-2557, PDFBOX-3182, PDFBOX-5842 (VW logo missing), PDFBOX-3116.pdf (circles bottom right) > OutOfMemoryError - during renderImageWithDPI > > > Key: PDFBOX-4718 > URL: https://issues.apache.org/jira/browse/PDFBOX-4718 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.17, 3.0.3 PDFBox, 4.0.0 > Environment: macOS Mojave (10.14.6) > Java 11.0.2 -Xmx10G -Xms10G >Reporter: Serhii Kolesnyk >Assignee: Andreas Lehmkühler >Priority: Blocker > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: PDFBOX-4718-reduced.pdf, PDFBox4718Intersect.java, > example.pdf, image-2019-12-19-05-55-57-648.png > > > During rendering pdf we receive _java.lang.OutOfMemoryError: Java heap space_ > {code:java} > Exception in thread "AWT-Shutdown" java.lang.OutOfMemoryError: Java heap > spaceException in thread "AWT-Shutdown" java.lang.OutOfMemoryError: Java heap > space at java.desktop/sun.awt.AppContext.getAppContexts(AppContext.java:167) > at > java.desktop/sun.awt.AppContext.stopEventDispatchThreads(AppContext.java:610) > at java.desktop/sun.awt.AWTAutoShutdown.run(AWTAutoShutdown.java:322) at > java.base/java.lang.Thread.run(Thread.java:834) > java.lang.OutOfMemoryError: Java heap space > at java.desktop/sun.awt.geom.AreaOp.pruneEdges(AreaOp.java:362) at > java.desktop/sun.awt.geom.AreaOp.calculate(AreaOp.java:159) at > java.desktop/java.awt.geom.Area.intersect(Area.java:293) at > org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.intersectClippingPath(PDGraphicsState.java:618) > at > org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.intersectClippingPath(PDGraphicsState.java:597) > at org.apache.pdfbox.rendering.PageDrawer.endPath(PageDrawer.java:936) at > org.apache.pdfbox.contentstream.operator.graphics.EndPath.process(EndPath.java:35) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152) > at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:262) at > org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:314) at > org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243) at > org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229){code} > We check the different setting of MemoryUsageSetting (TempFileOnly, > MainMemoryOnly), settings of DPI. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4718) OutOfMemoryError - during renderImageWithDPI
[ https://issues.apache.org/jira/browse/PDFBOX-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886737#comment-17886737 ] Tilman Hausherr commented on PDFBOX-4718: - Sadly some differences in rendering: PDFBOX-2557, PDFBOX-3182, PDFBOX-5842 (VW logo missing), PDFBOX-3116.pdf (circles bottom right) > OutOfMemoryError - during renderImageWithDPI > > > Key: PDFBOX-4718 > URL: https://issues.apache.org/jira/browse/PDFBOX-4718 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.17, 3.0.3 PDFBox, 4.0.0 > Environment: macOS Mojave (10.14.6) > Java 11.0.2 -Xmx10G -Xms10G >Reporter: Serhii Kolesnyk >Assignee: Andreas Lehmkühler >Priority: Blocker > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: PDFBOX-4718-reduced.pdf, PDFBox4718Intersect.java, > example.pdf, image-2019-12-19-05-55-57-648.png > > > During rendering pdf we receive _java.lang.OutOfMemoryError: Java heap space_ > {code:java} > Exception in thread "AWT-Shutdown" java.lang.OutOfMemoryError: Java heap > spaceException in thread "AWT-Shutdown" java.lang.OutOfMemoryError: Java heap > space at java.desktop/sun.awt.AppContext.getAppContexts(AppContext.java:167) > at > java.desktop/sun.awt.AppContext.stopEventDispatchThreads(AppContext.java:610) > at java.desktop/sun.awt.AWTAutoShutdown.run(AWTAutoShutdown.java:322) at > java.base/java.lang.Thread.run(Thread.java:834) > java.lang.OutOfMemoryError: Java heap space > at java.desktop/sun.awt.geom.AreaOp.pruneEdges(AreaOp.java:362) at > java.desktop/sun.awt.geom.AreaOp.calculate(AreaOp.java:159) at > java.desktop/java.awt.geom.Area.intersect(Area.java:293) at > org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.intersectClippingPath(PDGraphicsState.java:618) > at > org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.intersectClippingPath(PDGraphicsState.java:597) > at org.apache.pdfbox.rendering.PageDrawer.endPath(PageDrawer.java:936) at > org.apache.pdfbox.contentstream.operator.graphics.EndPath.process(EndPath.java:35) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479) > at > org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152) > at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:262) at > org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:314) at > org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243) at > org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:229){code} > We check the different setting of MemoryUsageSetting (TempFileOnly, > MainMemoryOnly), settings of DPI. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected
[ https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885635#comment-17885635 ] Tilman Hausherr commented on PDFBOX-5880: - Now it works! > PDF render blank page: The end of the stream doesn't point to the correct > offset, using workaround to read the stream, stream start position: 196, > length: 0, expected end position: 196 > > > Key: PDFBOX-5880 > URL: https://issues.apache.org/jira/browse/PDFBOX-5880 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Joseph Jezerinac >Assignee: Andreas Lehmkühler >Priority: Major > Labels: regression > Fix For: 3.0.4 PDFBox > > Attachments: PDFBOX-1094-PDFBOX-269.pdf, test.pdf > > > When rendering page one of the attached PDF the image does not render. > In the logs, I see the following: > {noformat} > 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't > point to the correct offset, using workaround to read the stream, stream > start position: 196, length: 0, expected end position: 196 > 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty > java.io.IOException: Image stream is empty > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438) > at > org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107) > {noformat} > I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an > issue. > Here's the render code used: > {code:java} > File out = File.createTempFile("test-", ".png"); > PDDocument pdDocument = Loader.loadPDF(pdf); > final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument); > ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected en
[ https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5880: Attachment: PDFBOX-1094-PDFBOX-269.pdf > PDF render blank page: The end of the stream doesn't point to the correct > offset, using workaround to read the stream, stream start position: 196, > length: 0, expected end position: 196 > > > Key: PDFBOX-5880 > URL: https://issues.apache.org/jira/browse/PDFBOX-5880 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Joseph Jezerinac >Assignee: Andreas Lehmkühler >Priority: Major > Labels: regression > Attachments: PDFBOX-1094-PDFBOX-269.pdf, test.pdf > > > When rendering page one of the attached PDF the image does not render. > In the logs, I see the following: > {noformat} > 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't > point to the correct offset, using workaround to read the stream, stream > start position: 196, length: 0, expected end position: 196 > 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty > java.io.IOException: Image stream is empty > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438) > at > org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107) > {noformat} > I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an > issue. > Here's the render code used: > {code:java} > File out = File.createTempFile("test-", ".png"); > PDDocument pdDocument = Loader.loadPDF(pdf); > final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument); > ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected
[ https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885251#comment-17885251 ] Tilman Hausherr commented on PDFBOX-5880: - Several differences, e.g. [^PDFBOX-1094-PDFBOX-269.pdf] page 2ff, the light background is different. Also the file of PDFBOX-1738. > PDF render blank page: The end of the stream doesn't point to the correct > offset, using workaround to read the stream, stream start position: 196, > length: 0, expected end position: 196 > > > Key: PDFBOX-5880 > URL: https://issues.apache.org/jira/browse/PDFBOX-5880 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Joseph Jezerinac >Assignee: Andreas Lehmkühler >Priority: Major > Labels: regression > Attachments: PDFBOX-1094-PDFBOX-269.pdf, test.pdf > > > When rendering page one of the attached PDF the image does not render. > In the logs, I see the following: > {noformat} > 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't > point to the correct offset, using workaround to read the stream, stream > start position: 196, length: 0, expected end position: 196 > 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty > java.io.IOException: Image stream is empty > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438) > at > org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107) > {noformat} > I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an > issue. > Here's the render code used: > {code:java} > File out = File.createTempFile("test-", ".png"); > PDDocument pdDocument = Loader.loadPDF(pdf); > final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument); > ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884739#comment-17884739 ] Tilman Hausherr commented on PDFBOX-5852: - All good now, thanks! > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 2.0.33, 3.0.3 PDFBox, 4.0.0 > > Attachments: CIB-coonsmesh.pdf, minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected
[ https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884548#comment-17884548 ] Tilman Hausherr commented on PDFBOX-5880: - proposed change is to add {{stream.setLong(COSName.LENGTH, streamLength);}} or change the foreach loop that it doesn't overwrite the length entry. > PDF render blank page: The end of the stream doesn't point to the correct > offset, using workaround to read the stream, stream start position: 196, > length: 0, expected end position: 196 > > > Key: PDFBOX-5880 > URL: https://issues.apache.org/jira/browse/PDFBOX-5880 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Joseph Jezerinac >Priority: Major > Labels: regression > Attachments: test.pdf > > > When rendering page one of the attached PDF the image does not render. > In the logs, I see the following: > {noformat} > 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't > point to the correct offset, using workaround to read the stream, stream > start position: 196, length: 0, expected end position: 196 > 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty > java.io.IOException: Image stream is empty > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438) > at > org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107) > {noformat} > I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an > issue. > Here's the render code used: > {code:java} > File out = File.createTempFile("test-", ".png"); > PDDocument pdDocument = Loader.loadPDF(pdf); > final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument); > ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884540#comment-17884540 ] Tilman Hausherr commented on PDFBOX-5852: - E.g. with this file: [^CIB-coonsmesh.pdf] ArrayIndexOutOfBoundsException: Index 400 out of bounds for length 400 org.apache.pdfbox.pdmodel.graphics.shading.PatchMeshesShadingContext.calcPixelTableArray(PatchMeshesShadingContext.java:67) org.apache.pdfbox.pdmodel.graphics.shading.TriangleBasedShadingContext.createPixelTable(TriangleBasedShadingContext.java:67) org.apache.pdfbox.pdmodel.graphics.shading.PatchMeshesShadingContext.(PatchMeshesShadingContext.java:57) org.apache.pdfbox.pdmodel.graphics.shading.Type6ShadingContext.(Type6ShadingContext.java:45) org.apache.pdfbox.pdmodel.graphics.shading.Type6ShadingPaint.createContext(Type6ShadingPaint.java:63) > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 2.0.33, 3.0.3 PDFBox, 4.0.0 > > Attachments: CIB-coonsmesh.pdf, minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5852: Attachment: CIB-coonsmesh.pdf > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 2.0.33, 3.0.3 PDFBox, 4.0.0 > > Attachments: CIB-coonsmesh.pdf, minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884533#comment-17884533 ] Tilman Hausherr commented on PDFBOX-5852: - Lots of regressions, I need to check whether this is because of another change I just did, or if the first test didn't have the new code activated. > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 2.0.33, 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852 ] Tilman Hausherr deleted comment on PDFBOX-5852: - was (Author: tilman): No regressions 👍 > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 2.0.33, 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected
[ https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884531#comment-17884531 ] Tilman Hausherr commented on PDFBOX-5880: - The problem is here: {code:java} public COSStream createCOSStream(COSDictionary dictionary, long startPosition, long streamLength) throws IOException { COSStream stream = new COSStream(streamCache, parser.createRandomAccessReadView(startPosition, streamLength)); dictionary.forEach(stream::setItem); stream.setKey(dictionary.getKey()); return stream; } {code} The foreach loop overwrites the length. For some reason this didn't make troubles in the past with wrong lengths, only this time with a zero length that is an indirect object. > PDF render blank page: The end of the stream doesn't point to the correct > offset, using workaround to read the stream, stream start position: 196, > length: 0, expected end position: 196 > > > Key: PDFBOX-5880 > URL: https://issues.apache.org/jira/browse/PDFBOX-5880 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Joseph Jezerinac >Priority: Major > Labels: regression > Attachments: test.pdf > > > When rendering page one of the attached PDF the image does not render. > In the logs, I see the following: > {noformat} > 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't > point to the correct offset, using workaround to read the stream, stream > start position: 196, length: 0, expected end position: 196 > 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty > java.io.IOException: Image stream is empty > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438) > at > org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107) > {noformat} > I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an > issue. > Here's the render code used: > {code:java} > File out = File.createTempFile("test-", ".png"); > PDDocument pdDocument = Loader.loadPDF(pdf); > final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument); > ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884528#comment-17884528 ] Tilman Hausherr commented on PDFBOX-5852: - No regressions 👍 > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 2.0.33, 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expe
[ https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884492#comment-17884492 ] Tilman Hausherr edited comment on PDFBOX-5880 at 9/25/24 3:55 AM: -- The PDF image stream has an (incorrect) length of 0. The workaround fails for some reason. Amusingly, this worked in 1.8.16, which displays the message "WARNUNG: /Length of COSObject\{1, 0} corrected from 0 to 695645". was (Author: tilman): The image has an (incorrect) length of 0. The workaround fails for some reason. Amusingly, this worked in 1.8.16, which displays the message "WARNUNG: /Length of COSObject\{1, 0} corrected from 0 to 695645". > PDF render blank page: The end of the stream doesn't point to the correct > offset, using workaround to read the stream, stream start position: 196, > length: 0, expected end position: 196 > > > Key: PDFBOX-5880 > URL: https://issues.apache.org/jira/browse/PDFBOX-5880 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Joseph Jezerinac >Priority: Major > Labels: regression > Attachments: test.pdf > > > When rendering page one of the attached PDF the image does not render. > In the logs, I see the following: > {noformat} > 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't > point to the correct offset, using workaround to read the stream, stream > start position: 196, length: 0, expected end position: 196 > 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty > java.io.IOException: Image stream is empty > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438) > at > org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107) > {noformat} > I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an > issue. > Here's the render code used: > {code:java} > File out = File.createTempFile("test-", ".png"); > PDDocument pdDocument = Loader.loadPDF(pdf); > final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument); > ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected
[ https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884492#comment-17884492 ] Tilman Hausherr commented on PDFBOX-5880: - The image has an (incorrect) length of 0. The workaround fails for some reason. Amusingly, this worked in 1.8.16, which displays the message "WARNUNG: /Length of COSObject\{1, 0} corrected from 0 to 695645". > PDF render blank page: The end of the stream doesn't point to the correct > offset, using workaround to read the stream, stream start position: 196, > length: 0, expected end position: 196 > > > Key: PDFBOX-5880 > URL: https://issues.apache.org/jira/browse/PDFBOX-5880 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Joseph Jezerinac >Priority: Major > Labels: regression > Attachments: test.pdf > > > When rendering page one of the attached PDF the image does not render. > In the logs, I see the following: > {noformat} > 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't > point to the correct offset, using workaround to read the stream, stream > start position: 196, length: 0, expected end position: 196 > 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty > java.io.IOException: Image stream is empty > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438) > at > org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107) > {noformat} > I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an > issue. > Here's the render code used: > {code:java} > File out = File.createTempFile("test-", ".png"); > PDDocument pdDocument = Loader.loadPDF(pdf); > final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument); > ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected en
[ https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5880: Labels: regression (was: ) > PDF render blank page: The end of the stream doesn't point to the correct > offset, using workaround to read the stream, stream start position: 196, > length: 0, expected end position: 196 > > > Key: PDFBOX-5880 > URL: https://issues.apache.org/jira/browse/PDFBOX-5880 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Joseph Jezerinac >Priority: Major > Labels: regression > Attachments: test.pdf > > > When rendering page one of the attached PDF the image does not render. > In the logs, I see the following: > {noformat} > 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't > point to the correct offset, using workaround to read the stream, stream > start position: 196, length: 0, expected end position: 196 > 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty > java.io.IOException: Image stream is empty > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438) > at > org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107) > {noformat} > I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an > issue. > Here's the render code used: > {code:java} > File out = File.createTempFile("test-", ".png"); > PDDocument pdDocument = Loader.loadPDF(pdf); > final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument); > ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected en
[ https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5880: Affects Version/s: 2.0.32 > PDF render blank page: The end of the stream doesn't point to the correct > offset, using workaround to read the stream, stream start position: 196, > length: 0, expected end position: 196 > > > Key: PDFBOX-5880 > URL: https://issues.apache.org/jira/browse/PDFBOX-5880 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Joseph Jezerinac >Priority: Major > Attachments: test.pdf > > > When rendering page one of the attached PDF the image does not render. > In the logs, I see the following: > {noformat} > 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't > point to the correct offset, using workaround to read the stream, stream > start position: 196, length: 0, expected end position: 196 > 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty > java.io.IOException: Image stream is empty > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438) > at > org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107) > {noformat} > I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an > issue. > Here's the render code used: > {code:java} > File out = File.createTempFile("test-", ".png"); > PDDocument pdDocument = Loader.loadPDF(pdf); > final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument); > ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5880) PDF render blank page: The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 196, length: 0, expected en
[ https://issues.apache.org/jira/browse/PDFBOX-5880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5880: Component/s: Parsing (was: Rendering) > PDF render blank page: The end of the stream doesn't point to the correct > offset, using workaround to read the stream, stream start position: 196, > length: 0, expected end position: 196 > > > Key: PDFBOX-5880 > URL: https://issues.apache.org/jira/browse/PDFBOX-5880 > Project: PDFBox > Issue Type: Bug > Components: Parsing >Affects Versions: 3.0.3 PDFBox >Reporter: Joseph Jezerinac >Priority: Major > Attachments: test.pdf > > > When rendering page one of the attached PDF the image does not render. > In the logs, I see the following: > {noformat} > 2024-09-24 13:25:56:924 [main] WARN COSParser - The end of the stream doesn't > point to the correct offset, using workaround to read the stream, stream > start position: 196, length: 0, expected end position: 196 > 2024-09-24 13:25:56:930 [main] WARN PDFStreamEngine - Image stream is empty > java.io.IOException: Image stream is empty > at > org.apache.pdfbox.pdmodel.graphics.image.SampledImageReader.getRGBImage(SampledImageReader.java:182) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:477) > at > org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.getImage(PDImageXObject.java:438) > at > org.apache.pdfbox.rendering.PageDrawer.drawImage(PageDrawer.java:1107) > {noformat} > I assume this is a bad PDF, but Acrobat, Chrome, etc., display it without an > issue. > Here's the render code used: > {code:java} > File out = File.createTempFile("test-", ".png"); > PDDocument pdDocument = Loader.loadPDF(pdf); > final PDFRenderer pdfRenderer = new PDFRenderer(pdDocument); > ImageIO.write(pdfRenderer.renderImageWithDPI(0, 300), "png", out); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882327#comment-17882327 ] Tilman Hausherr commented on PDFBOX-5879: - I added a simple test for the feature because it turns out we didn't have any. However this isn't a test of the fixed bug, that would have been more difficult to create a file, and there is no risk that this fix gets reverted anyway. > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882327#comment-17882327 ] Tilman Hausherr edited comment on PDFBOX-5879 at 9/17/24 9:08 AM: -- I added a simple test for the rotationMagic feature because it turns out we didn't have any. However this isn't a test of the fixed bug, that would have been more difficult to create a file, and there is no risk that this fix gets reverted anyway. was (Author: tilman): I added a simple test for the feature because it turns out we didn't have any. However this isn't a test of the fixed bug, that would have been more difficult to create a file, and there is no risk that this fix gets reverted anyway. > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-5879. - Fix Version/s: 2.0.33 3.0.4 PDFBox 4.0.0 Assignee: Tilman Hausherr Resolution: Fixed Thank you. It's not the commit, it's poor programming that got exposed because of the commit. > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5879) Regression from PDFBOX-5841: Text extraction with rotation magic fails for PDF with multiple content streams in a page
[ https://issues.apache.org/jira/browse/PDFBOX-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5879: Affects Version/s: 2.0.32 > Regression from PDFBOX-5841: Text extraction with rotation magic fails for > PDF with multiple content streams in a page > -- > > Key: PDFBOX-5879 > URL: https://issues.apache.org/jira/browse/PDFBOX-5879 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Priority: Major > Attachments: MVM_Aram_augusztus.pdf > > > {code:java} > java -jar pdfbox-app-3.0.3.jar export:text -console -rotationMagic > -i="MVM_Aram_augusztus.pdf" {code} > fails with the following error: > {code:java} > java.lang.ClassCastException: class org.apache.pdfbox.cos.COSObject cannot be > cast to class org.apache.pdfbox.cos.COSArray (org.apache.pdfbox.cos.COSObject > and org.apache.pdfbox.cos.COSArray are in unnamed module of loader 'app') > at > org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:336) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:225) > at org.apache.pdfbox.tools.ExtractText.call(ExtractText.java:62) > at picocli.CommandLine.executeUserObject(CommandLine.java:2045) > at picocli.CommandLine.access$1500(CommandLine.java:148) > at > picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2457) > at picocli.CommandLine$RunLast.handle(CommandLine.java:2419) > at > picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277) > at picocli.CommandLine$RunLast.execute(CommandLine.java:2421) > at picocli.CommandLine.execute(CommandLine.java:2174) > at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:76) {code} > The same command succeeds in 3.0.2. > The triggering PDF can be downloaded from > [https://nagykorosiallatmentok.hu/wp-content/uploads/2023/09/MVM_Aram_augusztus.pdf,] > and is also attached. > The root cause appears to be this change: > [https://github.com/apache/pdfbox/commit/b03d12d56dd74e5c52d80cf0b80c5bfb1f3209b2] > from PDFBOX-5841 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882240#comment-17882240 ] Tilman Hausherr commented on PDFBOX-5852: - Wow! No regressions. > Hi CPU and memory usage when converting a PDF with type 4 shading > - > > Key: PDFBOX-5852 > URL: https://issues.apache.org/jira/browse/PDFBOX-5852 > Project: PDFBox > Issue Type: Wish > Components: Rendering >Affects Versions: 2.0.28 >Reporter: Larry Lynn >Assignee: Andreas Lehmkühler >Priority: Major > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: minimal.pdf > > > We've observed excessive CPU and memory consumption when converting a PDF to > images when the PDF contains type 4 shading. This is especially noticeable > when the conversion is done with a high DPI. Can this be improved? > > Conversation from the PDFBox users mailing list follows > Initial email: > {quote} > Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox > users and maintainers, > We have a PDF that causes performance problems when we use PDFBox to > convert it to an image with renderImageWithDPI(). We're calling > renderImageWithDPI() > with 650 DPI. I realize this is a very high value - we're using it for > high fidelity original images that will later be downsampled. On my work > laptop which has fairly strong hardware, the conversion takes 25 minutes > and consumes 20GB of memory. CPU and memory usage is reduced if we use a > lower DPI. > The PDF is 1 page long. It contains type 4 shading / Gouraud free form > triangle meshes. We've been aware of some performance issues with type 4 > shading for a little while now, but the PDFs that contained the type 4 > shading belonged to our customers and we were not authorized to share > them. We finally found a problem input document that is non-sensitive and > that we are authorized to share. I've attached a copy of the problem PDF > to this email. > I searched the archives for the users and the developers mailing list and I > didn't find anything specifically about this issue. > I searched through the PDFBox jira tickets and I found a couple of tickets > that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most > closely describe what we're seeing, but that was closed in PDFBox 2.0.0, > and our issue still reproduces with PDFBox 2.0.28. > Should I refer this issue over to the developers mailing list or create a > PDFBox Jira ticket for this? > Thanks and Regards, > Larry Lynn {quote} > Response: > {quote} > Hi, > Yes shading can be very slow, especially at high dpi. The attachment > didn't get through, please upload to a sharehoster or create a ticket. > If you need to register then add a meaningful text, e.g. the subject of > this post so we know you're not a spammer. Also retry with 2.0.31 and > 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. > Tilman {quote} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5852) Hi CPU and memory usage when converting a PDF with type 4 shading
[ https://issues.apache.org/jira/browse/PDFBOX-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5852: Description: We've observed excessive CPU and memory consumption when converting a PDF to images when the PDF contains type 4 shading. This is especially noticeable when the conversion is done with a high DPI. Can this be improved? Conversation from the PDFBox users mailing list follows Initial email: {quote} Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox users and maintainers, We have a PDF that causes performance problems when we use PDFBox to convert it to an image with renderImageWithDPI(). We're calling renderImageWithDPI() with 650 DPI. I realize this is a very high value - we're using it for high fidelity original images that will later be downsampled. On my work laptop which has fairly strong hardware, the conversion takes 25 minutes and consumes 20GB of memory. CPU and memory usage is reduced if we use a lower DPI. The PDF is 1 page long. It contains type 4 shading / Gouraud free form triangle meshes. We've been aware of some performance issues with type 4 shading for a little while now, but the PDFs that contained the type 4 shading belonged to our customers and we were not authorized to share them. We finally found a problem input document that is non-sensitive and that we are authorized to share. I've attached a copy of the problem PDF to this email. I searched the archives for the users and the developers mailing list and I didn't find anything specifically about this issue. I searched through the PDFBox jira tickets and I found a couple of tickets that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most closely describe what we're seeing, but that was closed in PDFBox 2.0.0, and our issue still reproduces with PDFBox 2.0.28. Should I refer this issue over to the developers mailing list or create a PDFBox Jira ticket for this? Thanks and Regards, Larry Lynn {quote} Response: {quote} Hi, Yes shading can be very slow, especially at high dpi. The attachment didn't get through, please upload to a sharehoster or create a ticket. If you need to register then add a meaningful text, e.g. the subject of this post so we know you're not a spammer. Also retry with 2.0.31 and 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. Tilman {quote} was: We've observed excessive CPU and memory consumption when converting a PDF to images when the PDF contains type 4 shading. This is especially noticeable when the conversion is done with a high DPI. Can this be improved? Conversation from the PDFBox users mailing list follows Initial email: {code:java} Hi CPU and memory usage when converting a PDF with type 4 shadingHello PDFBox users and maintainers, We have a PDF that causes performance problems when we use PDFBox to convert it to an image with renderImageWithDPI(). We're calling renderImageWithDPI() with 650 DPI. I realize this is a very high value - we're using it for high fidelity original images that will later be downsampled. On my work laptop which has fairly strong hardware, the conversion takes 25 minutes and consumes 20GB of memory. CPU and memory usage is reduced if we use a lower DPI. The PDF is 1 page long. It contains type 4 shading / Gouraud free form triangle meshes. We've been aware of some performance issues with type 4 shading for a little while now, but the PDFs that contained the type 4 shading belonged to our customers and we were not authorized to share them. We finally found a problem input document that is non-sensitive and that we are authorized to share. I've attached a copy of the problem PDF to this email. I searched the archives for the users and the developers mailing list and I didn't find anything specifically about this issue. I searched through the PDFBox jira tickets and I found a couple of tickets that looked similar: PDFBOX-2901 & PDFBOX-4491. PDFBOX-2901 seems to most closely describe what we're seeing, but that was closed in PDFBox 2.0.0, and our issue still reproduces with PDFBox 2.0.28. Should I refer this issue over to the developers mailing list or create a PDFBox Jira ticket for this? Thanks and Regards, Larry Lynn {code} Response: {code:java} Hi, Yes shading can be very slow, especially at high dpi. The attachment didn't get through, please upload to a sharehoster or create a ticket. If you need to register then add a meaningful text, e.g. the subject of this post so we know you're not a spammer. Also retry with 2.0.31 and 3.0.2 just to be sure. However I'm pessimistic that this can be fixed. Tilman {code} > Hi CPU and memory usage when converting a PDF with type 4 shading > --
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879832#comment-17879832 ] Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 10:35 AM: -- Here's what worked: {code:java} for (PDField field: acroForm.getFieldTree()) { if (field instanceof PDTextField) { if (field instanceof PDVariableText) { for (PDAnnotationWidget widget : field.getWidgets()) { widget.setAppearance(null); } } } } acroForm.refreshAppearances(); {code} [^PDFBox5878-flattened.pdf] [^PDFBox5878-saved.pdf] The only problem left is that the second multiline field starts a bit too low, but IIRC there's another issue about that. was (Author: tilman): Here's what worked: {code:java} for (PDField field: acroForm.getFieldTree()) { if (field instanceof PDTextField) { if (field instanceof PDVariableText) { for (PDAnnotationWidget widget : field.getWidgets()) { widget.setAppearance(null); } } } } acroForm.refreshAppearances(); {code} [^PDFBox5878-flattened.pdf] [^PDFBox5878-saved.pdf] > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, > flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879832#comment-17879832 ] Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 10:31 AM: -- Here's what worked: {code:java} for (PDField field: acroForm.getFieldTree()) { if (field instanceof PDTextField) { if (field instanceof PDVariableText) { for (PDAnnotationWidget widget : field.getWidgets()) { widget.setAppearance(null); } } } } acroForm.refreshAppearances(); {code} [^PDFBox5878-flattened.pdf] [^PDFBox5878-saved.pdf] was (Author: tilman): Here's what worked: {code:java} for (PDField field: acroForm.getFieldTree()) { if (field instanceof PDTextField) { if (field instanceof PDVariableText) { for (PDAnnotationWidget widget : field.getWidgets()) { widget.setAppearance(null); } } } acroForm.refreshAppearances(); } {code} [^PDFBox5878-flattened.pdf] [^PDFBox5878-saved.pdf] > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, > flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5878: Attachment: PDFBox5878-flattened.pdf PDFBox5878-saved.pdf > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > PDFBox5878-flattened.pdf, PDFBox5878-saved.pdf, beforeFlattening.pdf, > flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879822#comment-17879822 ] Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 9:30 AM: - I added this for the missing fonts, which is just a guess that it's the correct font {code:java} PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm(); acroForm.setNeedAppearances(false); PDFont font1 = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/times.ttf"), false); PDFont font2 = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/timesbd.ttf"), false); PDFont font3 = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/arial.ttf"), false); acroForm.getDefaultResources().put(COSName.getPDFName("TimesNewRomanPSMT"), font1); acroForm.getDefaultResources().put(COSName.getPDFName("TimesNewRomanPS-BoldMT"), font2); acroForm.getDefaultResources().put(COSName.getPDFName("Helvetica"), font3); for (PDField field: acroForm.getFieldTree()) { if (field instanceof PDTextField) { if (((PDTextField) field).isMultiline()) { field.setValue("XXX"); } } } {code} But when setting a value, this happens in AppearanceGeneratorHelper.setAppearanceContent(): {code} if (bmcIndex == -1) { // append to existing stream writer.writeTokens(tokens); writer.writeTokens(COSName.TX, BMC); } {code} So it appends to the existing appearance steam. This is the result after calling setValue("XXX"): {code} q Q q 9.613575 0.4609071 430.9062 41.31819 re W n q 0.9781767 0 0 -0.9781767 -87.43936 478.0107 cm BT 11 0 0 -11 102.2182 458.5622 Tm /TT21 1 Tf [ (N) -0.2 (a) 0.2 (m) 0.2 (e) 0.2 ( c) 0.2 (ha) 0.2 (nge) 0.2 (d 09/) 0.2 (26/) 0.2 (2020) ] TJ ET Q Q q 6.43259 0.3084 434.0872 41.6232 re W n q 0.9853977 0 0 0.9853977 9.388783 29.51731 cm BT 11 0 0 11 0 0 Tm /TT18 1 Tf [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 (ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 (ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ ET Q q 0.9853977 0 0 0.9853977 9.388783 17.51355 cm BT 11 0 0 11 0 0 Tm /TT18 1 Tf [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 (i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ ET Q Q q 3.228123 0.1547671 437.2917 41.93047 re W n q 0.992672 0 0 0.992672 6.206139 29.5793 cm BT 11 0 0 11 0 0 Tm /TT19 1 Tf [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 (ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 (ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ ET Q q 0.992672 0 0 0.992672 6.206139 17.48693 cm BT 11 0 0 11 0 0 Tm /TT19 1 Tf [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 (i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ ET Q Q q 0 0 440.5198 42.24 re W n /Cs6 cs 0 sc q 1 0 0 1 3 29.64175 cm BT 11 0 0 11 0 0 Tm /TT20 1 Tf [ (M) -0.2 (y na) 0.2 (m) 0.2 (e) 0.2 ( w) -0.2 (a) 0.2 (s) -0.2 ( c) 0.2 (ha) 0.2 (nge) 0.2 (d on 10/) 0.2 (14/) 0.2 (2017 a) 0.2 (t) 0.2 ( ) 18.1 (W) 111 (A) 55 ( D) -0.2 (O) -0.2 (L) 37.3 ( i) 0.2 (n F) -0.2 (e) 0.2 (de) 0.2 (ra) 0.2 (l) 0.2 ( ) 18.1 (W) 80.2 (a) 0.2 (y w) -0.2 (i) 0.2 (t) 0.2 (h proof of P) -0.2 (hi) 0.2 (l) 0.2 (i) 0.2 (ppi) 0.2 (ne) 0.2 ( ) ] TJ ET Q q 1 0 0 1 3 17.46011 cm BT 11 0 0 11 0 0 Tm /TT20 1 Tf [ (m) 0.2 (a) 0.2 (rri) 0.2 (a) 0.2 (ge) 0.2 ( c) 0.2 (e) 0.2 (rt) 0.2 (i) 0.2 (fi) 0.2 (c) 0.2 (a) 0.2 (t) 0.2 (e) 0.2 (.) ] TJ ET Q Q /Tx BMC q -2.252 1 441.7718 40.24 re W n BT /TimesNewRomanPSMT 11 Tf /DeviceGray cs 0 sc -1.252 25.4319 Td (\000;\000;\000;) Tj ET Q EMC {code} So the XXX is there, but also all the previous content. was (Author: tilman): I added this for the missing fonts, which is just a guess that it's the correct font {code:java} acroForm.setNeedAppearances(false); PDFont font1 = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/times.ttf"), false); PDFont font2
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879796#comment-17879796 ] Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 8:00 AM: - There are so many things wrong with this PDF that I don't see a specific solution. I'm doing this just for fun. I was able to fix some of the fields (e.g. Last1) but not yet all (e.g. the multiline fields and some others), for some unknown reason. (I added the missing fonts to the default resources) Not all appearances are redrawn. Either there's a bug in my code or there is something in our code that skips the recreation of the appearances and I forgot about it. It's not even recreated when changing to the value to something else?! was (Author: tilman): There are so many things wrong with this PDF that I don't see a specific solution. I'm doing this just for fun. I was able to fix some of the fields (e.g. Last1) but not yet all (e.g. the multiline fields and some others), for some unknown reason. (I added the missing fonts to the default resources) Not all appearances are redrawn. Either there's a bug in my code or there is something in our code that skips the recreation of the appearances and I forgot about it. > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > beforeFlattening.pdf, flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879753#comment-17879753 ] Tilman Hausherr edited comment on PDFBOX-5878 at 9/6/24 4:04 AM: - I could try to getValue() and setValue() on the text fields and see whether it looks better when PDFBox recreates the appearances. These fields have a value that makes sense. I'm just wondering whether this person will have legal disadvantages if the file is refused? (Although I doubt that the content of field {{Root/Pages/Kids/[0]/Annots/[7]/V}} will work for the petitioner). OTOH it's from 22.2 so it may already have been decided in some way. was (Author: tilman): I could try to getValue() and setValue() on the text fields and see whether it looks better when PDFBox recreates the appearances. These fields have a value that makes sense. I'm just wondering whether this person will have legal disadvantages if the file is refused? (Although I doubt that the content of field {{Root/Pages/Kids/[0]/Annots/[7]/V}} will work for the petitioner). OTOH it's from 22.2 so it may already have been processed. > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > beforeFlattening.pdf, flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5878) pdf form field text gets blurred after flattening
[ https://issues.apache.org/jira/browse/PDFBOX-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879480#comment-17879480 ] Tilman Hausherr edited comment on PDFBOX-5878 at 9/5/24 8:16 AM: - {code} q Q q 9.469598 0.4248199 206.7517 18.55036 re W n q 0.9562042 0 0 -0.9562042 -55.6218 672.8725 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT21 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q q 6.360067 0.2853218 209.8612 18.82936 re W n q 0.9705854 0 0 -0.9705854 -59.7103 682.8466 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT18 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q q 3.203769 0.1437257 213.0175 19.11255 re W n q 0.9851829 0 0 -0.9851829 -63.86029 692.9707 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT19 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q q 0 0 216.2213 19.4 re W n /Cs6 cs 0 sc q 1 0 0 -1 -68.0727 703.247 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT20 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q {code} The text appears 3 times at slightly different positions in this appearance stream. was (Author: tilman): {code} q Q q 9.469598 0.4248199 206.7517 18.55036 re W n q 0.9562042 0 0 -0.9562042 -55.6218 672.8725 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT21 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q q 6.360067 0.2853218 209.8612 18.82936 re W n q 0.9705854 0 0 -0.9705854 -59.7103 682.8466 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT18 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q q 3.203769 0.1437257 213.0175 19.11255 re W n q 0.9851829 0 0 -0.9851829 -63.86029 692.9707 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT19 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q q 0 0 216.2213 19.4 re W n /Cs6 cs 0 sc q 1 0 0 -1 -68.0727 703.247 cm BT 11 0 0 -11 71.0727 696.6206 Tm /TT20 1 Tf [ (E) 0.2 (l) 0.2 (ys) -0.2 (i) 0.2 (a) 0.2 ( J) -0.2 (oy ) 55.2 (A) -0.2 (nc) 0.2 (he) 0.2 (t) 0.2 (a) 0.2 ( ) 18.1 (T) 70 (orl) 0.2 (a) 0.2 (o) ] TJ ET Q Q {code} The text appears 3 times. > pdf form field text gets blurred after flattening > - > > Key: PDFBOX-5878 > URL: https://issues.apache.org/jira/browse/PDFBOX-5878 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.28, 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Labels: Appearance > Attachments: Bildschirmfoto vom 2024-09-05 10-07-13.png, > beforeFlattening.pdf, flattened.pdf > > > After flattening a pdf acro form, value of some fields get blurred > {code:java} > PDDocument pdDocument = Loader.loadPDF(inFile, ""); > pdDocument.setResourceCache(new DefaultResourceCache()); > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(outFile); > } > } > catch (Exception e) {} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Reopened] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reopened PDFBOX-5876: - > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878964#comment-17878964 ] Tilman Hausherr commented on PDFBOX-5877: - Yeah!! There's a log message, so it means you also disabled or disregarded logs :-( > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878961#comment-17878961 ] Tilman Hausherr edited comment on PDFBOX-5877 at 9/3/24 5:55 PM: - What's this? {code} pdDocument.setResourceCache(new PdfResourceCache()) {code} We have no class {{PdfResourceCache}}. was (Author: tilman): What's this? pdDocument.setResourceCache(new PdfResourceCache()) > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878961#comment-17878961 ] Tilman Hausherr commented on PDFBOX-5877: - What's this? pdDocument.setResourceCache(new PdfResourceCache()) > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5877) After flattening a form pdf, the pdf loses content
[ https://issues.apache.org/jira/browse/PDFBOX-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878960#comment-17878960 ] Tilman Hausherr commented on PDFBOX-5877: - Are you sure you used 3.0.3 and not 3.0.2 ? I just tried with the trunk and 3.0.4-SNAPSHOT with our test and I got only invisible differences (yours are clearly visible and are because all fonts are lost in the PDF) > After flattening a form pdf, the pdf loses content > -- > > Key: PDFBOX-5877 > URL: https://issues.apache.org/jira/browse/PDFBOX-5877 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 3.0.3 PDFBox > Environment: Mac Ventura, java 18 PDFBox 3.0.3, Tomcat 9 > Linux; version: 5.15.0-105-generic, java 17, Tomcat 9.0.93 >Reporter: Joseph Jezerinac >Priority: Major > Attachments: beforeFalttening.pdf, flattenedPdf.pdf > > > After flattening the pdf form content changes. Pls take a look at before and > after pdf. The flattening works fine in 2.0.31. After upgrading to 3.0.3 we > started getting many issues with pdf forms after flattening. > The code that used for flattening is as follows > {code} > PDDocument pdDocument = Loader.loadPDF(file, “”); > pdDocument.setResourceCache(new PdfResourceCache()) > try { > boolean save = false; > if (pdDocument.isEncrypted()) { > pdDocument.setAllSecurityToBeRemoved(true); > save = true; > } > final PDDocumentCatalog pdDocumentCatalog = > pdDocument.getDocumentCatalog(); > if (pdDocumentCatalog != null) { > final PDAcroForm pdForm = pdDocumentCatalog.getAcroForm(); > if (pdForm != null) { > pdForm.flatten(); > save = true; > } > } > if (save) { > pdDocument.save(file); > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878879#comment-17878879 ] Tilman Hausherr commented on PDFBOX-5876: - No... I used -Xmx4G for a production project. > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878846#comment-17878846 ] Tilman Hausherr commented on PDFBOX-5876: - Are you sure you are using the new version? You have to build yourself or wait until a new snapshot build is available. Instead of using PDFDebugger now I just tried your code as it is with a locally built 3.0.4-SNAPSHOT and it did work with -Xmx600m. (Also with 550, but not with 500) > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-5876. - Fix Version/s: 2.0.33 3.0.4 PDFBox 4.0.0 Assignee: Tilman Hausherr Resolution: Fixed > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5876: Affects Version/s: 2.0.32 > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Priority: Major > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5876: Component/s: Rendering > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.2 PDFBox >Reporter: liu >Priority: Major > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5876) This jpeg2000 takes up a lot of memory, causing overflow.
[ https://issues.apache.org/jira/browse/PDFBOX-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878835#comment-17878835 ] Tilman Hausherr commented on PDFBOX-5876: - The JPX image in that file is 7020 x 4964, which is quite big, and -Xmx600m is quite low. But I noticed that the subsampling parameter wasn't used when reading the JPX image the second time, which was the cause for the OOM. (JPX images have to be read twice because of some weirdness in the specification) It should work now, I tried it with PDFDebugger, which doesn't allow to set a temp cache. > This jpeg2000 takes up a lot of memory, causing overflow. > - > > Key: PDFBOX-5876 > URL: https://issues.apache.org/jira/browse/PDFBOX-5876 > Project: PDFBox > Issue Type: Bug >Affects Versions: 3.0.2 PDFBox >Reporter: liu >Priority: Major > Attachments: jpeg2000.pdf > > > pdf:[^jpeg2000.pdf] > JVM:-Xmx600m > {code:java} > //代码占位符 > public static void main(String[] args) throws IOException, > InterruptedException { >File file = new File("C:\\Users\\LYCIT\\Downloads\\jpeg2000.pdf"); >PDDocument pdf = Loader.loadPDF(file, > IOUtils.createTempFileOnlyStreamCache()); >PDFRenderer renderer = new PDFRenderer(pdf); >int numPages = 0; >renderer.setSubsamplingAllowed(true); >BufferedImage bi = renderer.renderImage(numPages, 0.5f); >pdf.close(); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5875) using font data to process ligatures
[ https://issues.apache.org/jira/browse/PDFBOX-5875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5875: Fix Version/s: (was: 3.0.4 PDFBox) > using font data to process ligatures > > > Key: PDFBOX-5875 > URL: https://issues.apache.org/jira/browse/PDFBOX-5875 > Project: PDFBox > Issue Type: New Feature > Components: Parsing, PDModel, Text extraction >Affects Versions: 3.0.3 PDFBox >Reporter: Manish S N >Priority: Major > Labels: Asian, CIDFont, font, ligatures, unicodemapping > Attachments: page.pdf > > > To process ligatures from Asian languages (where a glyph is the combination > of two unicode characters) using the data in embedded fonts. > > *The problem:* > currently modern PDF creators put these ligatures in /ActualText field which > we only recently considered to support in this issue . But this is not the > case in old PDFs with embedded CID fonts like [^page.pdf] where the glyphs of > ligatures lack a /toUnicode character mapping because there is no single > unicode codepoint for these as these are combination of more than one unicode > characters. > > *The Potential Solution (if not perfect):* > I managed to extract the font files using pdfbox > ([code|https://gist.githubusercontent.com/incubated-geek-cc/640a74920b184274374af257cd1587bb/raw/c6fb02fa82f9883670d96b812bfe7f2f55b18125/Main.java]) > and when i viewed the fontfiles using fontforge i found the data about > ligatures intact in it. So we can use this data to map the glyphs that are > ligatures to the unicodes of its constituent glyphs > > *Problems:* > In some cases the constituent glyphs may not be present in the cmap at all. > removed by PDF optimiser as it is never directly used in the PDF apart from > in ligatures. such glyphs are empty with only glyph id and no /toUnicode > mapping even if that particular glyph has a corresponding unicode character. > > *The Hope:* > This is not a common problem in large PDFs. and basic spell checkers could > easily rectify the problem. some comprehension is better than no > comprehension when it comes to dealing with data. this will greatly enhance > the parsing of non-Latin Asian languages. > > (the PDF sample i attached is in Tamil language) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878089#comment-17878089 ] Tilman Hausherr commented on PDFBOX-5868: - Yes. But consider that Adobe didn't do it and they're smarter than us, I just tried copy / paste and save as text. The ligature thing in fonts are meant to be used when creating PDFs, I don't know if these would work in extraction. > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 > Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878076#comment-17878076 ] Tilman Hausherr edited comment on PDFBOX-5868 at 8/30/24 11:50 AM: --- Please create a new ticket for the file you just added because this is a different problem (only if you manage to extract this properly from Adobe Reader). was (Author: tilman): Please create a new ticket for the file you just added because this is a different problem. > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 > Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878076#comment-17878076 ] Tilman Hausherr commented on PDFBOX-5868: - Please create a new ticket for the file you just added because this is a different problem. > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 > Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > page.pdf, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache
[ https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-5874. - Assignee: Tilman Hausherr Resolution: Fixed Thank you, you're right, there's no need to warn about something that harmless. > Change Loglevel from Warn to info when rebuilding font cache > > > Key: PDFBOX-5874 > URL: https://issues.apache.org/jira/browse/PDFBOX-5874 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Thomas Hoffmann >Assignee: Tilman Hausherr >Priority: Minor > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > > We have a monitoring system for our logfiles and some people get notified > whenever there is an error or a warning in the logfiles. > Due to OS updates, the fonts might be updated or changed. This triggers a > rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning > and this triggers an alarm. > The warnings occur in: > org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java > The logfile shows the following three entries: > 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, > font cache will be re-built > 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk > font cache, this may take a while > 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building > on-disk font cache, found 96 fonts > > Imho the message is more informational and not necessary a warning. It just > gives me the information, that the cache is getting rebuilt. > It would be great if you could consider setting these messages to info level. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache
[ https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5874: Fix Version/s: 2.0.33 3.0.4 PDFBox 4.0.0 > Change Loglevel from Warn to info when rebuilding font cache > > > Key: PDFBOX-5874 > URL: https://issues.apache.org/jira/browse/PDFBOX-5874 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Thomas Hoffmann >Priority: Minor > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > > We have a monitoring system for our logfiles and some people get notified > whenever there is an error or a warning in the logfiles. > Due to OS updates, the fonts might be updated or changed. This triggers a > rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning > and this triggers an alarm. > The warnings occur in: > org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java > The logfile shows the following three entries: > 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, > font cache will be re-built > 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk > font cache, this may take a while > 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building > on-disk font cache, found 96 fonts > > Imho the message is more informational and not necessary a warning. It just > gives me the information, that the cache is getting rebuilt. > It would be great if you could consider setting these messages to info level. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5874) Change Loglevel from Warn to info when rebuilding font cache
[ https://issues.apache.org/jira/browse/PDFBOX-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5874: Affects Version/s: 2.0.32 > Change Loglevel from Warn to info when rebuilding font cache > > > Key: PDFBOX-5874 > URL: https://issues.apache.org/jira/browse/PDFBOX-5874 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Thomas Hoffmann >Priority: Minor > > We have a monitoring system for our logfiles and some people get notified > whenever there is an error or a warning in the logfiles. > Due to OS updates, the fonts might be updated or changed. This triggers a > rebuild process within PDFBox. Unfortunately, the loglevel is set to Warning > and this triggers an alarm. > The warnings occur in: > org/apache/pdfbox/pdmodel/font/FileSystemFontProvider.java > The logfile shows the following three entries: > 2024-08-19T18:25:03.653+02:00 WARN FileSystemFontProvider: New fonts found, > font cache will be re-built > 2024-08-19T18:25:03.654+02:00 WARN FileSystemFontProvider: Building on-disk > font cache, this may take a while > 2024-08-19T18:25:04.105+02:00 WARN FileSystemFontProvider: Finished building > on-disk font cache, found 96 fonts > > Imho the message is more informational and not necessary a warning. It just > gives me the information, that the cache is getting rebuilt. > It would be great if you could consider setting these messages to info level. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876692#comment-17876692 ] Tilman Hausherr commented on PDFBOX-5868: - In the files I saw /ActualText was often used only for a part of the text (although I see that one of the files I attached uses it for all). Using /ActualText only and disregard the old text extraction was never in my thoughts. That's why a switch would mean we either have the improvement of this ticket, or work as before. > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 > Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, > suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876660#comment-17876660 ] Tilman Hausherr commented on PDFBOX-5868: - I haven't resolved this ticket because of one question I've been asking to myself and now to the users here: should I add a getter/setter that makes this ActualText thing optional? It should be active by default because I believe that it is useful in most cases. e.g. ConsiderActualText / ActivateActualText / IncludeActualText / whatever > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 > Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, > suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5657) SMaskInData not supported for JPX images
[ https://issues.apache.org/jira/browse/PDFBOX-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876632#comment-17876632 ] Tilman Hausherr edited comment on PDFBOX-5657 at 8/26/24 8:53 AM: -- This related issue https://github.com/mozilla/pdf.js/issues/11306 won't look better because there's an exception in the JPEG2000 decoder, see https://github.com/jai-imageio/jai-imageio-jpeg2000/issues/9 was (Author: tilman): This related issue https://github.com/mozilla/pdf.js/issues/11306 won't look better because there's an exception in the JPEG2000 decoder. > SMaskInData not supported for JPX images > > > Key: PDFBOX-5657 > URL: https://issues.apache.org/jira/browse/PDFBOX-5657 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.29, 3.0.0 PDFBox, 4.0.0 > Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: JPEG2000, JPXDecode, JPXFilter > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: PDFJS-16782-SMaskInData.pdf > > > JPX images can have transparency information and not only we don't support > that, but the images look broken. > For now, lets just return the opaque image until there's a good idea what to > do. Maybe we have to return the mask in the DecodeResult. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5657) SMaskInData not supported for JPX images
[ https://issues.apache.org/jira/browse/PDFBOX-5657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-5657. - Fix Version/s: 2.0.33 3.0.4 PDFBox 4.0.0 Assignee: Tilman Hausherr Resolution: Fixed This related issue https://github.com/mozilla/pdf.js/issues/11306 won't look better because there's an exception in the JPEG2000 decoder. > SMaskInData not supported for JPX images > > > Key: PDFBOX-5657 > URL: https://issues.apache.org/jira/browse/PDFBOX-5657 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.29, 3.0.0 PDFBox, 4.0.0 >Reporter: Tilman Hausherr >Assignee: Tilman Hausherr >Priority: Major > Labels: JPEG2000, JPXDecode, JPXFilter > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: PDFJS-16782-SMaskInData.pdf > > > JPX images can have transparency information and not only we don't support > that, but the images look broken. > For now, lets just return the opaque image until there's a good idea what to > do. Maybe we have to return the mask in the DecodeResult. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5872) Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding
[ https://issues.apache.org/jira/browse/PDFBOX-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5872: Affects Version/s: 2.0.32 > Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding > > > Key: PDFBOX-5872 > URL: https://issues.apache.org/jira/browse/PDFBOX-5872 > Project: PDFBox > Issue Type: Improvement > Components: Rendering >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Priority: Major > > [https://github.com/dbmdz/imageio-jnr] / > [https://mvnrepository.com/artifact/de.digitalcollections.imageio/imageio-openjpeg] > is an alternative JPEG2000 implementation for Java ImageIO that uses the > native OpenJPEG library as its backend. > Unfortunately, it doesn't work out of the box because it doesn't implement > raster reading (canReadRaster not overridden, returns false), and PDFBox uses > canReadRaster() to validate image reader instances. However, it doesn't > appear that there is any real reliance on raster support in PDFBox (at least > in version 3) - if I patch the library to lie about raster support, it seems > to work perfectly. > A further complication arises when the OpenJPEG native library cannot be > found: imageio-openjpeg returns null as the reader instance, which causes PDF > rendering to fail with an NPE, even if another JPEG2000 reader is available. > This can be remedied with a simple null check. > [https://github.com/apache/pdfbox/pull/197] shows a possible solution. Until > then, [https://github.com/Googulator/imageio-jnr] can be used with PDFBox > 3.0.3 as a workaround, so long as the native library is correctly installed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5872) Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding
[ https://issues.apache.org/jira/browse/PDFBOX-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-5872. - Fix Version/s: 2.0.33 3.0.4 PDFBox 4.0.0 Assignee: Tilman Hausherr Resolution: Fixed Done, thanks! > Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding > > > Key: PDFBOX-5872 > URL: https://issues.apache.org/jira/browse/PDFBOX-5872 > Project: PDFBox > Issue Type: Improvement > Components: Rendering >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Gábor Stefanik >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > > [https://github.com/dbmdz/imageio-jnr] / > [https://mvnrepository.com/artifact/de.digitalcollections.imageio/imageio-openjpeg] > is an alternative JPEG2000 implementation for Java ImageIO that uses the > native OpenJPEG library as its backend. > Unfortunately, it doesn't work out of the box because it doesn't implement > raster reading (canReadRaster not overridden, returns false), and PDFBox uses > canReadRaster() to validate image reader instances. However, it doesn't > appear that there is any real reliance on raster support in PDFBox (at least > in version 3) - if I patch the library to lie about raster support, it seems > to work perfectly. > A further complication arises when the OpenJPEG native library cannot be > found: imageio-openjpeg returns null as the reader instance, which causes PDF > rendering to fail with an NPE, even if another JPEG2000 reader is available. > This can be remedied with a simple null check. > [https://github.com/apache/pdfbox/pull/197] shows a possible solution. Until > then, [https://github.com/Googulator/imageio-jnr] can be used with PDFBox > 3.0.3 as a workaround, so long as the native library is correctly installed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5872) Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding
[ https://issues.apache.org/jira/browse/PDFBOX-5872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876045#comment-17876045 ] Tilman Hausherr commented on PDFBOX-5872: - {quote}However, it doesn't appear that there is any real reliance on raster support in PDFBox (at least in version 3){quote} {{readRaster()}} is called for CMYK images. Wouldn't it be better to have your modified method as a separate private method just for JPX? > Support imageio-jnr / imageio-openjpeg library for JPEG2000 decoding > > > Key: PDFBOX-5872 > URL: https://issues.apache.org/jira/browse/PDFBOX-5872 > Project: PDFBox > Issue Type: Improvement > Components: Rendering >Affects Versions: 3.0.3 PDFBox >Reporter: Gábor Stefanik >Priority: Major > > [https://github.com/dbmdz/imageio-jnr] / > [https://mvnrepository.com/artifact/de.digitalcollections.imageio/imageio-openjpeg] > is an alternative JPEG2000 implementation for Java ImageIO that uses the > native OpenJPEG library as its backend. > Unfortunately, it doesn't work out of the box because it doesn't implement > raster reading (canReadRaster not overridden, returns false), and PDFBox uses > canReadRaster() to validate image reader instances. However, it doesn't > appear that there is any real reliance on raster support in PDFBox (at least > in version 3) - if I patch the library to lie about raster support, it seems > to work perfectly. > A further complication arises when the OpenJPEG native library cannot be > found: imageio-openjpeg returns null as the reader instance, which causes PDF > rendering to fail with an NPE, even if another JPEG2000 reader is available. > This can be remedied with a simple null check. > [https://github.com/apache/pdfbox/pull/197] shows a possible solution. Until > then, [https://github.com/Googulator/imageio-jnr] can be used with PDFBox > 3.0.3 as a workaround, so long as the native library is correctly installed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5869) Checkstyle
[ https://issues.apache.org/jira/browse/PDFBOX-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-5869. - Fix Version/s: 2.0.33 3.0.4 PDFBox 4.0.0 Assignee: Tilman Hausherr Resolution: Fixed That's it for now. It will only prevent the worst "transgressions". > Checkstyle > -- > > Key: PDFBOX-5869 > URL: https://issues.apache.org/jira/browse/PDFBOX-5869 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Simon Steiner >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > > Can you enforce via the CI that mvn checkstyle:check passes > Disable any rules in the config you dont want to enforce -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5869) Checkstyle
[ https://issues.apache.org/jira/browse/PDFBOX-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5869: Affects Version/s: 3.0.3 PDFBox 2.0.32 > Checkstyle > -- > > Key: PDFBOX-5869 > URL: https://issues.apache.org/jira/browse/PDFBOX-5869 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Simon Steiner >Priority: Major > > Can you enforce via the CI that mvn checkstyle:check passes > Disable any rules in the config you dont want to enforce -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875214#comment-17875214 ] Tilman Hausherr commented on PDFBOX-5868: - Another thought I just had was to extend TextPosition and add the setter there and pass this object to the method of the base class of processTextPosition(), however TextPosition is final. > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 > Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, > suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5871) Rendering never finishes
[ https://issues.apache.org/jira/browse/PDFBOX-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5871: Affects Version/s: 3.0.3 PDFBox 2.0.32 > Rendering never finishes > > > Key: PDFBOX-5871 > URL: https://issues.apache.org/jira/browse/PDFBOX-5871 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.3 PDFBox > Reporter: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox > > Attachments: 2_42.pdf, image-2024-08-20-12-22-36-716.png > > > Submitted by Patrycja Zaremba on the users mailing list. I can confirm that > it doesn't end even when running overnight 😡 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5871) Rendering never finishes
[ https://issues.apache.org/jira/browse/PDFBOX-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5871: Attachment: (was: screenshot-1.png) > Rendering never finishes > > > Key: PDFBOX-5871 > URL: https://issues.apache.org/jira/browse/PDFBOX-5871 > Project: PDFBox > Issue Type: Bug > Components: Rendering > Reporter: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox > > Attachments: 2_42.pdf, image-2024-08-20-12-22-36-716.png > > > Submitted by Patrycja Zaremba on the users mailing list. I can confirm that > it doesn't end even when running overnight 😡 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5871) Rendering never finishes
[ https://issues.apache.org/jira/browse/PDFBOX-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5871: Attachment: screenshot-1.png > Rendering never finishes > > > Key: PDFBOX-5871 > URL: https://issues.apache.org/jira/browse/PDFBOX-5871 > Project: PDFBox > Issue Type: Bug > Components: Rendering > Reporter: Tilman Hausherr >Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox > > Attachments: 2_42.pdf, screenshot-1.png > > > Submitted by Patrycja Zaremba on the users mailing list. I can confirm that > it doesn't end even when running overnight 😡 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Created] (PDFBOX-5871) Rendering never finishes
Tilman Hausherr created PDFBOX-5871: --- Summary: Rendering never finishes Key: PDFBOX-5871 URL: https://issues.apache.org/jira/browse/PDFBOX-5871 Project: PDFBox Issue Type: Bug Components: Rendering Reporter: Tilman Hausherr Fix For: 2.0.33, 3.0.4 PDFBox Attachments: 2_42.pdf Submitted by Patrycja Zaremba on the users mailing list. I can confirm that it doesn't end even when running overnight 😡 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875100#comment-17875100 ] Tilman Hausherr commented on PDFBOX-5868: - Oops, no, it's not that easy. I forgot that we need {{TextPosition.setUnicode()}} which doesn't exist in the released versions. And in the snapshot I've made it package local to avoid people messing around. > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 > Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, > suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874947#comment-17874947 ] Tilman Hausherr commented on PDFBOX-5868: - Yes this could be possible. All the changes except one could be done by using an extension of the stripper. The suppressDuplicateOverlappingText problem would have to be solved by saving the value when ActualText is active and restoring it afterwards. > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 > Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, > suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata
[ https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5870: Affects Version/s: 3.0.3 PDFBox 2.0.32 > [PATCH] Detect CMYK image without relying on metadata > - > > Key: PDFBOX-5870 > URL: https://issues.apache.org/jira/browse/PDFBOX-5870 > Project: PDFBox > Issue Type: Bug >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Simon Steiner >Priority: Major > Attachments: tmp.patch > > > If getNumChannels returns empty string we should use a different system to > detect a cmyk image, so the output image is not inverted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Resolved] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata
[ https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr resolved PDFBOX-5870. - Fix Version/s: 2.0.33 3.0.4 PDFBox 4.0.0 Assignee: Tilman Hausherr Resolution: Fixed > [PATCH] Detect CMYK image without relying on metadata > - > > Key: PDFBOX-5870 > URL: https://issues.apache.org/jira/browse/PDFBOX-5870 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Simon Steiner >Assignee: Tilman Hausherr >Priority: Major > Labels: CMYK > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: tmp.patch > > > If getNumChannels returns empty string we should use a different system to > detect a cmyk image, so the output image is not inverted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata
[ https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5870: Labels: CMYK (was: ) > [PATCH] Detect CMYK image without relying on metadata > - > > Key: PDFBOX-5870 > URL: https://issues.apache.org/jira/browse/PDFBOX-5870 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Simon Steiner >Priority: Major > Labels: CMYK > Attachments: tmp.patch > > > If getNumChannels returns empty string we should use a different system to > detect a cmyk image, so the output image is not inverted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata
[ https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5870: Component/s: Rendering > [PATCH] Detect CMYK image without relying on metadata > - > > Key: PDFBOX-5870 > URL: https://issues.apache.org/jira/browse/PDFBOX-5870 > Project: PDFBox > Issue Type: Bug > Components: Rendering >Affects Versions: 2.0.32, 3.0.3 PDFBox >Reporter: Simon Steiner >Priority: Major > Attachments: tmp.patch > > > If getNumChannels returns empty string we should use a different system to > detect a cmyk image, so the output image is not inverted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5870) [PATCH] Detect CMYK image without relying on metadata
[ https://issues.apache.org/jira/browse/PDFBOX-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874891#comment-17874891 ] Tilman Hausherr commented on PDFBOX-5870: - Could you attach a PDF where this happens? > [PATCH] Detect CMYK image without relying on metadata > - > > Key: PDFBOX-5870 > URL: https://issues.apache.org/jira/browse/PDFBOX-5870 > Project: PDFBox > Issue Type: Bug >Reporter: Simon Steiner >Priority: Major > Attachments: tmp.patch > > > If getNumChannels returns empty string we should use a different system to > detect a cmyk image, so the output image is not inverted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874801#comment-17874801 ] Tilman Hausherr edited comment on PDFBOX-5868 at 8/19/24 7:40 AM: -- Here's the excel file with the differences: [^content_diffs_with_exceptions-ActualText.xlsx]. This is from the Apache Tika project which also uses PDFBox. Look at the columns U and W (in yellow) and compare with V and X. Usually V and X look better. Empty content in the yellow columns means we "lost" something during the update. Look also at the header column names to understand what they mean. Surprisingly (for me, maybe less for you) the non latin texts are the ones that are more improved. was (Author: tilman): Here's the excel file with the differences: [^content_diffs_with_exceptions-ActualText.xlsx] Look at the columns U and W (in yellow) and compare with V and X. Usually V and X look better. Empty content in the yellow columns means we "lost" something during the update. Look also at the header column names to understand what they mean. Surprisingly (for me, maybe less for you) the non latin texts are the ones that are more improved. > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, > suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874801#comment-17874801 ] Tilman Hausherr commented on PDFBOX-5868: - Here's the excel file with the differences: [^content_diffs_with_exceptions-ActualText.xlsx] Look at the columns U and W (in yellow) and compare with V and X. Usually V and X look better. Empty content in the yellow columns means we "lost" something during the update. Look also at the header column names to understand what they mean. Surprisingly (for me, maybe less for you) the non latin texts are the ones that are more improved. > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 > Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, > suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5868: Attachment: content_diffs_with_exceptions-ActualText.xlsx > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: EmptyActualText_poppler.txt, > EmptyActualText_reduced_poppler.txt, Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, > content_diffs_with_exceptions-ActualText.xlsx, > image-2024-08-19-10-38-13-472.png, multilingual_test.pdf, okular_out.txt, > pdfbox_out.txt, poppler_out.txt, screenshot-1.png, screenshot-2.png, > suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874700#comment-17874700 ] Tilman Hausherr commented on PDFBOX-5868: - I ran a comparison on several 10 PDF files. While there were many improvements, I discovered that /ActualText is also used to PREVENT text extraction, as shown by these files: [^PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf] [^PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf] [^PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf] > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 > Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, multilingual_test.pdf, > okular_out.txt, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5868: Attachment: PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: Main.java, > PDFBOX-5868-7FHMU2HNOUUPENUPKZDGD2V65YEVABRS-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText.pdf, > PDFBOX-5868-SI5K4X4Z55SQAUPLAUP6QRRWT3UD3LAA-EmptyActualText_reduced.pdf, > Tilman's_solution_out.txt, adobe_out.txt, multilingual_test.pdf, > okular_out.txt, pdfbox_out.txt, poppler_out.txt, screenshot-1.png, > screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5869) Checkstyle
[ https://issues.apache.org/jira/browse/PDFBOX-5869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874647#comment-17874647 ] Tilman Hausherr commented on PDFBOX-5869: - It should now work for the trunk, both with mvn checkstyle:check and for an ordinary build. It will prevent the "worst" things only. I didn't manage to create a regexp for all legal headers and mostly gave up on that one after failing with xmpbox, and maybe I shouldn't have bothered at all because we already have delegated that part to the "pedantic" build profile. > Checkstyle > -- > > Key: PDFBOX-5869 > URL: https://issues.apache.org/jira/browse/PDFBOX-5869 > Project: PDFBox > Issue Type: Bug >Reporter: Simon Steiner >Priority: Major > > Can you enforce via the CI that mvn checkstyle:check passes > Disable any rules in the config you dont want to enforce -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Comment Edited] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874501#comment-17874501 ] Tilman Hausherr edited comment on PDFBOX-5868 at 8/17/24 12:57 PM: --- It's already done elsewhere and makes sure that the logic isn't applied during an ActualText segment: {code} if (suppressDuplicateOverlappingText && actualText == null) {code} Your proposed change ends up setting {{suppressDuplicateOverlappingText}} to true even if it was set to false (it's an obscure option of the stripper). was (Author: tilman): It's already done elsewhere: {code} if (suppressDuplicateOverlappingText && actualText == null) {code} Your proposed change ends up setting {{suppressDuplicateOverlappingText}} to true even if it was set to false (it's an obscure option of the stripper). > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, > multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, > screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17874501#comment-17874501 ] Tilman Hausherr commented on PDFBOX-5868: - It's already done elsewhere: {code} if (suppressDuplicateOverlappingText && actualText == null) {code} Your proposed change ends up setting {{suppressDuplicateOverlappingText}} to true even if it was set to false (it's an obscure option of the stripper). > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, > multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, > screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Closed] (PDFBOX-2740) Text extraction failed on Korean PDF
[ https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-2740. --- Resolution: Not A Problem The /ActualText problem was fixed in PDFBOX-5868. However extraction of the file he had been improved before already. > Text extraction failed on Korean PDF > > > Key: PDFBOX-2740 > URL: https://issues.apache.org/jira/browse/PDFBOX-2740 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0 >Reporter: Julien Ortega >Assignee: John Hewson >Priority: Major > Labels: ActualText > Attachments: g_KO_201506-ReaderDC-cutAndPaste.txt, > g_KO_201506-ReaderDC-saveAsText.txt, g_KO_201506.pdf, g_KO_201506.txt > > > Trying to extract text on a Korean PDF gives me a lot of warnings : > WARNING: No Unicode mapping for US (33) in font > DVCAYA+WtKoBaeumMyungjoL063zb4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont > toUnicode > WARNING: No Unicode mapping for NAK (33) in font > JYLDGG+WtKoBaeumMyungjoL053zb4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont > toUnicode > WARNING: No Unicode mapping for RS (38) in font > WRYULE+WtKoBaeumMyungjoL013zb4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont > WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont > toUnicode > WARNING: No Unicode mapping for DEL (33) in font > FZEFOY+WtKoBaeumGothicL0422b4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont > WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont > toUnicode > WARNING: No Unicode mapping for SOH (33) in font > OOLNBG+WtKoBaeumGothicL0122b4?Pw > and the result is not readable. The pdf is containing the necessary > conversion table because every pdf reader (Desktop or Mobile) let me copy and > past the text without problem. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Reopened] (PDFBOX-2740) Text extraction failed on Korean PDF
[ https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reopened PDFBOX-2740: - > Text extraction failed on Korean PDF > > > Key: PDFBOX-2740 > URL: https://issues.apache.org/jira/browse/PDFBOX-2740 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0 >Reporter: Julien Ortega >Assignee: John Hewson >Priority: Major > Attachments: g_KO_201506-ReaderDC-cutAndPaste.txt, > g_KO_201506-ReaderDC-saveAsText.txt, g_KO_201506.pdf, g_KO_201506.txt > > > Trying to extract text on a Korean PDF gives me a lot of warnings : > WARNING: No Unicode mapping for US (33) in font > DVCAYA+WtKoBaeumMyungjoL063zb4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont > toUnicode > WARNING: No Unicode mapping for NAK (33) in font > JYLDGG+WtKoBaeumMyungjoL053zb4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont > toUnicode > WARNING: No Unicode mapping for RS (38) in font > WRYULE+WtKoBaeumMyungjoL013zb4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont > WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont > toUnicode > WARNING: No Unicode mapping for DEL (33) in font > FZEFOY+WtKoBaeumGothicL0422b4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont > WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont > toUnicode > WARNING: No Unicode mapping for SOH (33) in font > OOLNBG+WtKoBaeumGothicL0122b4?Pw > and the result is not readable. The pdf is containing the necessary > conversion table because every pdf reader (Desktop or Mobile) let me copy and > past the text without problem. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-2740) Text extraction failed on Korean PDF
[ https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-2740: Labels: ActualText (was: ) > Text extraction failed on Korean PDF > > > Key: PDFBOX-2740 > URL: https://issues.apache.org/jira/browse/PDFBOX-2740 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0 >Reporter: Julien Ortega >Assignee: John Hewson >Priority: Major > Labels: ActualText > Attachments: g_KO_201506-ReaderDC-cutAndPaste.txt, > g_KO_201506-ReaderDC-saveAsText.txt, g_KO_201506.pdf, g_KO_201506.txt > > > Trying to extract text on a Korean PDF gives me a lot of warnings : > WARNING: No Unicode mapping for US (33) in font > DVCAYA+WtKoBaeumMyungjoL063zb4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont > toUnicode > WARNING: No Unicode mapping for NAK (33) in font > JYLDGG+WtKoBaeumMyungjoL053zb4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont > toUnicode > WARNING: No Unicode mapping for RS (38) in font > WRYULE+WtKoBaeumMyungjoL013zb4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont > WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont > toUnicode > WARNING: No Unicode mapping for DEL (33) in font > FZEFOY+WtKoBaeumGothicL0422b4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont > WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw > avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont > toUnicode > WARNING: No Unicode mapping for SOH (33) in font > OOLNBG+WtKoBaeumGothicL0122b4?Pw > and the result is not readable. The pdf is containing the necessary > conversion table because every pdf reader (Desktop or Mobile) let me copy and > past the text without problem. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Closed] (PDFBOX-4532) PDFTextStripper replacing the decimal with white space
[ https://issues.apache.org/jira/browse/PDFBOX-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr closed PDFBOX-4532. --- Resolution: Duplicate Fixed in PDFBOX-5868 > PDFTextStripper replacing the decimal with white space > -- > > Key: PDFBOX-4532 > URL: https://issues.apache.org/jira/browse/PDFBOX-4532 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.15 >Reporter: Akash Gupta >Priority: Major > Labels: ActualText > Attachments: FSUSA00BDD.pdf, PDFBOX-4532-reduced.pdf, SO71723006.pdf, > code_textStripper.PNG, numbers_without_decimal.PNG > > > I'm using the PDFTextStripperByArea to be specific and trying to extract a > particular area from the document. > In the output most the numbers (all but one) have their decimal point > replaced by a white space. When I copy and paste the text using Abobe > reader/chrome the decimal point are preserved. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5868: Fix Version/s: 2.0.33 3.0.4 PDFBox 4.0.0 > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, > multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, > screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Assigned] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr reassigned PDFBOX-5868: --- Assignee: Tilman Hausherr > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Assignee: Tilman Hausherr >Priority: Major > Labels: ActualText > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, > multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, > screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5868: Affects Version/s: 2.0.32 > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 2.0.32, 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Priority: Major > Labels: ActualText > Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, > multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, > screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-5868) PDFBox not extracting text of non-latin languages(tamil, bengali) properly but adobe reader's save as text does
[ https://issues.apache.org/jira/browse/PDFBOX-5868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-5868: Labels: ActualText (was: ) > PDFBox not extracting text of non-latin languages(tamil, bengali) properly > but adobe reader's save as text does > --- > > Key: PDFBOX-5868 > URL: https://issues.apache.org/jira/browse/PDFBOX-5868 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 3.0.3 PDFBox > Environment: Ubuntu 22.04.4 LTS x86_64 >Reporter: Manish S N >Priority: Major > Labels: ActualText > Attachments: Main.java, Tilman's_solution_out.txt, adobe_out.txt, > multilingual_test.pdf, okular_out.txt, pdfbox_out.txt, poppler_out.txt, > screenshot-1.png, screenshot-2.png, suppressDuplicateOverlapping_out.txt > > > I downloaded the latest executable jar of pdfbox (3.0.3) for testing and used > the export:text command line tool to obtain the results > * the multilingual_test.pdf is the original pdf i made to test multilingual > text extraction. > * the pdfbox_out.txt is the text file produced by pdfbox > * the adobe_out.txt is the text file created by adobe reader's save as text > feature > > Observation: > as you can see in the attachment the text file obtained by pdfbox shows weird > unicodes for tamil and bengali (for hindi the charecters are extracted but > not overlapped; japanese seems fine to me). in contrast the text file file > obtained from adobe reader's save as text feature seems fine and copy pasting > the text from my document viewer(evince) also works. > Questions: > # why are the outputs from pdfbox and adobe different? > # what can i do to extract the text from a multilingual pdf correctly? > # Is there a way to apply pattern matching to text in pdf file and declare > matches without extracting the text first? (say if the problem is with fonts > and glyphs) > — > My Usecase fyi: > i am trying to extract text from files and run pattern matching. I am using > apache tika for parsing documents. I noticed problem with extracted PDF text > (other filetypes parse fine). used executable pdfbox jar to conclude that the > _problem is in pdfbox and not in tika._ tested with adobe reader's extract > text to confirm the problem is not with the pdf. i want to extract these > multilingual text to run pattern matching on them alone and do not need to > display the content but only if the pattern is present or not (say if the > problem is with fonts and glyphs) > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-3248) Unwanted spaces in text extraction (2)
[ https://issues.apache.org/jira/browse/PDFBOX-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-3248: Labels: ActualText (was: ) > Unwanted spaces in text extraction (2) > -- > > Key: PDFBOX-3248 > URL: https://issues.apache.org/jira/browse/PDFBOX-3248 > Project: PDFBox > Issue Type: Bug > Components: Text extraction >Affects Versions: 1.8.11, 2.0.0 > Reporter: Tilman Hausherr >Priority: Major > Labels: ActualText > Attachments: PDFBOX-3248-spaces.pdf > > > The attached file provided by Francisco from the user mailing list has spaces > in text extraction regardless of setting spacingTolerance or > averageCharTolerance. I was unable to extract "Cada frasco ampolla" which > looked straightforward in rendering, but it always appeared as "Ca da fras co > ampo lla". Adobe Reader has no such problem. > The content stream has this: > {code} > 6 0 1.058 6 122.0924 312.51 Tm > (Ca) Tj > /Span << /ActualText (\376\377\000\255) >> BDC >( ) Tj > EMC > [ (da ) -301 (fras) ] TJ > /Span << /ActualText (\376\377\000\255) >> BDC >( ) Tj > EMC > [ (co ) -301 (ampo) ] TJ > /Span << /ActualText (\376\377\000\255) >> BDC >( ) Tj > EMC > [ (lla ) -301 (con) ] TJ > {code} > So there are really spaces there, and we keep them. Adobe is smarter, and > ignores them because they are overwritten thanks to the "-301" backwards > positioning. > Would /ActualText help? However it is always the same here... > Would it help to ignore spaces and decide based on positions only, maybe as > an option? I added these two lines below the first existing one: > {code} > String characterValue = position.getUnicode(); > if (" ".equals(characterValue)) > continue; > {code} > The output looks promising: > {quote} > F ó r m u l a : > Cronopen® Balsámico Adultos: > Cada frasco ampolla contiene: ampicilina (como ampicilina sódica) > 100 mg; ampicilina (como ampicilina benzatínica) 500 mg. > Cada ampolla solvente de 5 ml contiene: dipirona 1000 mg; guaife > nesina 100 mg. Exc.: bisulfito de sodio; agua destilada. > {quote} > A complete test brings many differences, most are harmless or are > improvements. Only one test case really fails, hello3.pdf. Original extract > is "Hello محمد World.", new extract is "Hello .Worldمحمد". > More from Francisco > {quote} > As additional information, I've found 2 related posts (about another tools) > in StackOverflow: > http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction > http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775 > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org