[ https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847300#comment-17847300 ]
ASF subversion and git services commented on PDFBOX-5580: --------------------------------------------------------- Commit 1917786 from Tilman Hausherr in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1917786 ] PDFBOX-5660, PDFBOX-5580: revert commit due to incompatibility with PDFTextStripperByArea > PDFTextStripperByArea ignores text for overlapping areas (regions) when > suppressing duplicate overlapping text > -------------------------------------------------------------------------------------------------------------- > > Key: PDFBOX-5580 > URL: https://issues.apache.org/jira/browse/PDFBOX-5580 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.27, 3.0.0 PDFBox > Reporter: Sebastian Holzki > Priority: Minor > Attachments: test.pdf > > > h3. Problem > Recently we encountered duplicate texts in our clients PDF documents which > are typically created by applications to simulate some kind of bold text when > no bold variant of a font is available. Fortunately, PDFBox's > PDFTextStripperByArea has some logic to ignore exact duplicates at the same > positions for these situations (which is inherited from the normal > PDFTextStripper). So we changed from > setSuppressDuplicateOverlappingText(false) to true. > But we encountered that texts for multiple regions are not extracted > correctly in this case when some special conditions are met: > When using multiple regions which overlap each other and would provide > exactly the same text, the first region text is extracted correctly but any > following region with same text remains empty. > We believe this is a bug due to duplicate suppression not being respected > correctly in PDFTextStripperByArea. > h3. Possible cause > While investigating this problem we found that PDFTextStripperByArea swaps > charactersByArticle for multiple regions and interprets a single page > multiple times (once for each region). In PDFTextStripper a private HashMap > characterListMapping keeps track of possible duplicate symbols with their > positions. The HashMap is not being reset after each region extraction which > leads to characters being ignored for subsequent areas. > Since the HashMap is private we were not able to subclass and customize > PDFTextStripperByArea with some adjusted behavior to test this finding. > h3. Workaround > When extracting regions one at a time for every page everything works fine. > We currently don't see any performance disadvantages. > h3. Reproduction > The attached PDF file does not actually include duplicate overlapping text > since this is not needed to reproduce the issue. > > {code:java} > try (final PDDocument doc = PDDocument.load(new > File("C:\\Source\\test.pdf"))) { > final PDFTextStripperByArea stripper = new PDFTextStripperByArea(); > stripper.setSuppressDuplicateOverlappingText(true); > stripper.setPageEnd(""); > final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19); > final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19); > stripper.addRegion("A", areaA); > stripper.addRegion("B", areaB); > stripper.extractRegions(doc.getPage(0)); > System.out.println("A: " + stripper.getTextForRegion("A")); > System.out.println("B: " + stripper.getTextForRegion("B")); > } {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org