[jira] [Comment Edited] (PDFBOX-6020) mix of subscript and superscript can lead to unnecessary new lines during text extraction

Tilman Hausherr (Jira) Sat, 14 Jun 2025 07:28:02 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17975478#comment-17975478
 ]


Tilman Hausherr edited comment on PDFBOX-6020 at 6/14/25 2:25 PM:
------------------------------------------------------------------

I did the regression tests anyway because the machine looked unused, and there 
were many differences, e.g. for this file:  
[^NAKON57LBMRU2ACXUKRCTU5FCBPX4AN2.pdf]. All files that I looked into had 
diagonal parts. There is an option for these in text extraction (not in the 
stripper but in applications) but then the tests take twice as long.


was (Author: tilman):
I did the regression tests anyway because the machine looked unused, and there 
were many differences, e.g. for this file:  
[^NAKON57LBMRU2ACXUKRCTU5FCBPX4AN2.pdf]. All files that I looked into had 
diagonal parts.

> mix of subscript and superscript can lead to unnecessary new lines during 
> text extraction
> -----------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6020
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6020
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.5 PDFBox
>            Reporter: Markus Seifert
>            Priority: Minor
>         Attachments: DE102016007628A1-Page2.pdf, 
> NAKON57LBMRU2ACXUKRCTU5FCBPX4AN2.pdf, 
> PDFBOX-6020-DE102016007628A1-p2-219.txt, 
> PDFBOX-6020-DE102016007628A1-p2-305.txt, PDFBOX-6020-DE102016007628A1-p2.pdf, 
> image-2025-06-13-11-08-06-836.png, image-2025-06-13-11-10-27-514.png
>
>
> I'm currently upgrading our usage of PDFBox from Version 2.0.19 to 3.0.5. 
> Worked out fine so far but one of our JUnit-Tests failed with the new 
> version. During text extraction by using PDFTextStripper unnecessary line 
> feeds were created for a line, that contained subscript as well as 
> superscript text. While debugging the issue I found some changes that were 
> made in Methode PDFTextStripper.writePage(). I think maxYForLine, 
> maxHeightForLine and minYTopForLine, which are used for the overlap-check, 
> are reset too often.
>  
> There's a check made with the value of 'Math.abs(position.getX() - 
> lastPosition.getTextPosition().getX())'. But I think is might have to be 
> changed to 'Math.abs(position.getX() - (lastPosition.getTextPosition().getX() 
> + lastPosition.getTextPosition().getWidth()))' to find relevant gaps.
>  
> An example-PDF can be downloaded from 
> [here|https://patentimages.storage.googleapis.com/57/b2/2f/3b5ffe86d83ef5/DE102016007628A1.pdf]
>  
> The text-line we had problems with was on page 2: 'gin-Anion 
> (12-Wolframato-1-phosphat) kann durch die Summenformel [P(W12O40)]3- 
> beschrieben werden oder'. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-6020) mix of subscript and superscript can lead to unnecessary new lines during text extraction

Reply via email to