[jira] [Commented] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

2024-05-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847536#comment-17847536
 ] 

ASF subversion and git services commented on PDFBOX-5580:
-

Commit 1917802 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1917802 ]

PDFBOX-5580: initialize currentPageNo to 1 and increment after the current page 
so we can restore reverted commit, as suggested by Andreas Lehmkühler

> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --
>
> Key: PDFBOX-5580
> URL: https://issues.apache.org/jira/browse/PDFBOX-5580
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.27, 3.0.0 PDFBox
>Reporter: Sebastian Holzki
>Priority: Minor
> Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
> final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setPageEnd("");
> final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
> final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
> stripper.addRegion("A", areaA);
> stripper.addRegion("B", areaB);
> stripper.extractRegions(doc.getPage(0));
> System.out.println("A: " + stripper.getTextForRegion("A"));
> System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

2024-05-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847534#comment-17847534
 ] 

ASF subversion and git services commented on PDFBOX-5580:
-

Commit 1917800 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1917800 ]

PDFBOX-5580: initialize currentPageNo to 1 and increment after the current page 
so we can restore reverted commit, as suggested by Andreas Lehmkühler

> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --
>
> Key: PDFBOX-5580
> URL: https://issues.apache.org/jira/browse/PDFBOX-5580
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.27, 3.0.0 PDFBox
>Reporter: Sebastian Holzki
>Priority: Minor
> Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
> final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setPageEnd("");
> final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
> final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
> stripper.addRegion("A", areaA);
> stripper.addRegion("B", areaB);
> stripper.extractRegions(doc.getPage(0));
> System.out.println("A: " + stripper.getTextForRegion("A"));
> System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

2024-05-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847535#comment-17847535
 ] 

ASF subversion and git services commented on PDFBOX-5580:
-

Commit 1917801 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1917801 ]

PDFBOX-5580: initialize currentPageNo to 1 and increment after the current page 
so we can restore reverted commit, as suggested by Andreas Lehmkühler

> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --
>
> Key: PDFBOX-5580
> URL: https://issues.apache.org/jira/browse/PDFBOX-5580
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.27, 3.0.0 PDFBox
>Reporter: Sebastian Holzki
>Priority: Minor
> Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
> final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setPageEnd("");
> final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
> final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
> stripper.addRegion("A", areaA);
> stripper.addRegion("B", areaB);
> stripper.extractRegions(doc.getPage(0));
> System.out.println("A: " + stripper.getTextForRegion("A"));
> System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

2024-05-17 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847318#comment-17847318
 ] 

Tilman Hausherr commented on PDFBOX-5580:
-

Sorry for the noise, wrong issue.

> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --
>
> Key: PDFBOX-5580
> URL: https://issues.apache.org/jira/browse/PDFBOX-5580
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.27, 3.0.0 PDFBox
>Reporter: Sebastian Holzki
>Priority: Minor
> Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
> final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setPageEnd("");
> final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
> final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
> stripper.addRegion("A", areaA);
> stripper.addRegion("B", areaB);
> stripper.extractRegions(doc.getPage(0));
> System.out.println("A: " + stripper.getTextForRegion("A"));
> System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

2024-05-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847303#comment-17847303
 ] 

ASF subversion and git services commented on PDFBOX-5580:
-

Commit 1917788 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1917788 ]

PDFBOX-5580: add test for PDFTextStripperByArea

> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --
>
> Key: PDFBOX-5580
> URL: https://issues.apache.org/jira/browse/PDFBOX-5580
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.27, 3.0.0 PDFBox
>Reporter: Sebastian Holzki
>Priority: Minor
> Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
> final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setPageEnd("");
> final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
> final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
> stripper.addRegion("A", areaA);
> stripper.addRegion("B", areaB);
> stripper.extractRegions(doc.getPage(0));
> System.out.println("A: " + stripper.getTextForRegion("A"));
> System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

2024-05-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847304#comment-17847304
 ] 

ASF subversion and git services commented on PDFBOX-5580:
-

Commit 1917789 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1917789 ]

PDFBOX-5580: add test for PDFTextStripperByArea

> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --
>
> Key: PDFBOX-5580
> URL: https://issues.apache.org/jira/browse/PDFBOX-5580
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.27, 3.0.0 PDFBox
>Reporter: Sebastian Holzki
>Priority: Minor
> Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
> final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setPageEnd("");
> final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
> final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
> stripper.addRegion("A", areaA);
> stripper.addRegion("B", areaB);
> stripper.extractRegions(doc.getPage(0));
> System.out.println("A: " + stripper.getTextForRegion("A"));
> System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

2024-05-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847305#comment-17847305
 ] 

ASF subversion and git services commented on PDFBOX-5580:
-

Commit 1917790 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1917790 ]

PDFBOX-5580: add test for PDFTextStripperByArea

> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --
>
> Key: PDFBOX-5580
> URL: https://issues.apache.org/jira/browse/PDFBOX-5580
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.27, 3.0.0 PDFBox
>Reporter: Sebastian Holzki
>Priority: Minor
> Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
> final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setPageEnd("");
> final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
> final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
> stripper.addRegion("A", areaA);
> stripper.addRegion("B", areaB);
> stripper.extractRegions(doc.getPage(0));
> System.out.println("A: " + stripper.getTextForRegion("A"));
> System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

2024-05-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847302#comment-17847302
 ] 

ASF subversion and git services commented on PDFBOX-5580:
-

Commit 1917787 from Tilman Hausherr in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1917787 ]

PDFBOX-5660, PDFBOX-5580: revert commit due to incompatibility with 
PDFTextStripperByArea

> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --
>
> Key: PDFBOX-5580
> URL: https://issues.apache.org/jira/browse/PDFBOX-5580
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.27, 3.0.0 PDFBox
>Reporter: Sebastian Holzki
>Priority: Minor
> Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
> final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setPageEnd("");
> final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
> final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
> stripper.addRegion("A", areaA);
> stripper.addRegion("B", areaB);
> stripper.extractRegions(doc.getPage(0));
> System.out.println("A: " + stripper.getTextForRegion("A"));
> System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

2024-05-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847300#comment-17847300
 ] 

ASF subversion and git services commented on PDFBOX-5580:
-

Commit 1917786 from Tilman Hausherr in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1917786 ]

PDFBOX-5660, PDFBOX-5580: revert commit due to incompatibility with 
PDFTextStripperByArea

> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --
>
> Key: PDFBOX-5580
> URL: https://issues.apache.org/jira/browse/PDFBOX-5580
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.27, 3.0.0 PDFBox
>Reporter: Sebastian Holzki
>Priority: Minor
> Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
> final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setPageEnd("");
> final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
> final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
> stripper.addRegion("A", areaA);
> stripper.addRegion("B", areaB);
> stripper.extractRegions(doc.getPage(0));
> System.out.println("A: " + stripper.getTextForRegion("A"));
> System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

2024-05-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847298#comment-17847298
 ] 

ASF subversion and git services commented on PDFBOX-5580:
-

Commit 1917785 from Tilman Hausherr in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1917785 ]

PDFBOX-5660, PDFBOX-5580: revert commit due to incompatibility with 
PDFTextStripperByArea

> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --
>
> Key: PDFBOX-5580
> URL: https://issues.apache.org/jira/browse/PDFBOX-5580
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.27, 3.0.0 PDFBox
>Reporter: Sebastian Holzki
>Priority: Minor
> Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
> final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setPageEnd("");
> final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
> final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
> stripper.addRegion("A", areaA);
> stripper.addRegion("B", areaB);
> stripper.extractRegions(doc.getPage(0));
> System.out.println("A: " + stripper.getTextForRegion("A"));
> System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5580) PDFTextStripperByArea ignores text for overlapping areas (regions) when suppressing duplicate overlapping text

2023-04-07 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17709630#comment-17709630
 ] 

Tilman Hausherr commented on PDFBOX-5580:
-

I had a look at PDFTextStripperByArea. From what I understand, the text isn't 
extracted for one region and then for the next, what happens is that each text 
position is processed by the extended {{processTextPosition()}} method. So we 
can't just reset that HashMap you mention. We would need separate HashMaps for 
each region and shuffle them.

> PDFTextStripperByArea ignores text for overlapping areas (regions) when 
> suppressing duplicate overlapping text
> --
>
> Key: PDFBOX-5580
> URL: https://issues.apache.org/jira/browse/PDFBOX-5580
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.27, 3.0.0 PDFBox
>Reporter: Sebastian Holzki
>Priority: Minor
> Attachments: test.pdf
>
>
> h3. Problem
> Recently we encountered duplicate texts in our clients PDF documents which 
> are typically created by applications to simulate some kind of bold text when 
> no bold variant of a font is available. Fortunately, PDFBox's 
> PDFTextStripperByArea has some logic to ignore exact duplicates at the same 
> positions for these situations (which is inherited from the normal 
> PDFTextStripper). So we changed from 
> setSuppressDuplicateOverlappingText(false) to true.
> But we encountered that texts for multiple regions are not extracted 
> correctly in this case when some special conditions are met:
> When using multiple regions which overlap each other and would provide 
> exactly the same text, the first region text is extracted correctly but any 
> following region with same text remains empty.
> We believe this is a bug due to duplicate suppression not being respected 
> correctly in PDFTextStripperByArea.
> h3. Possible cause
> While investigating this problem we found that PDFTextStripperByArea swaps 
> charactersByArticle for multiple regions and interprets a single page 
> multiple times (once for each region). In PDFTextStripper a private HashMap 
> characterListMapping keeps track of possible duplicate symbols with their 
> positions. The HashMap is not being reset after each region extraction which 
> leads to characters being ignored for subsequent areas.
> Since the HashMap is private we were not able to subclass and customize 
> PDFTextStripperByArea with some adjusted behavior to test this finding.
> h3. Workaround
> When extracting regions one at a time for every page everything works fine. 
> We currently don't see any performance disadvantages.
> h3. Reproduction
> The attached PDF file does not actually include duplicate overlapping text 
> since this is not needed to reproduce the issue.
>  
> {code:java}
> try (final PDDocument doc = PDDocument.load(new 
> File("C:\\Source\\test.pdf"))) {
> final PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> stripper.setSuppressDuplicateOverlappingText(true);
> stripper.setPageEnd("");
> final Rectangle2D areaA = new Rectangle2D.Double(45, 319, 124, 19);
> final Rectangle2D areaB = new Rectangle2D.Double(43, 319, 130, 19);
> stripper.addRegion("A", areaA);
> stripper.addRegion("B", areaB);
> stripper.extractRegions(doc.getPage(0));
> System.out.println("A: " + stripper.getTextForRegion("A"));
> System.out.println("B: " + stripper.getTextForRegion("B"));
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org