[jira] [Commented] (PDFBOX-6020) mix of subscript and superscript can lead to unnecessary new lines during text extraction

Markus Seifert (Jira) Fri, 13 Jun 2025 00:48:16 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17968811#comment-17968811
 ]


Markus Seifert commented on PDFBOX-6020:
----------------------------------------

Just saw, that the PDF is split into single pages and some processing is done 
on it (e.g. merge of separated words, detection of sub/sup/bold/italic). 
Attached is a PDF with 'reembedded' text information. And with that one the 
line feed-problem can be reproduced. Sorry for initially pointing to a ‚wrong‘ 
PDF. I used the following code to test it:

 

public static void main(String[] args) throws IOException {

                try (PDDocument pdfDoc = Loader.loadPDF(new 
File("DE102016007628A1-Page2.pdf"))) {

                               System.out.println(new 
PDFTextStripper().getText(pdfDoc));

                }

}

 

With lines 681-682 in PDFTextStripper as 'if (Math.abs(position.getX() - 
lastPosition.getTextPosition().getX()) > (wordSpacing + deltaSpace))' we have a 
line feed in front of '3–'. And when changing it to 'if 
(Math.abs(position.getX() - (lastPosition.getTextPosition().getX() + 
lastPosition.getTextPosition().getWidth())) > (wordSpacing + deltaSpace))' the 
line feed disappears.

> mix of subscript and superscript can lead to unnecessary new lines during 
> text extraction
> -----------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6020
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6020
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.5 PDFBox
>            Reporter: Markus Seifert
>            Priority: Minor
>         Attachments: DE102016007628A1-Page2.pdf, 
> PDFBOX-6020-DE102016007628A1-p2-219.txt, 
> PDFBOX-6020-DE102016007628A1-p2-305.txt, PDFBOX-6020-DE102016007628A1-p2.pdf
>
>
> I'm currently upgrading our usage of PDFBox from Version 2.0.19 to 3.0.5. 
> Worked out fine so far but one of our JUnit-Tests failed with the new 
> version. During text extraction by using PDFTextStripper unnecessary line 
> feeds were created for a line, that contained subscript as well as 
> superscript text. While debugging the issue I found some changes that were 
> made in Methode PDFTextStripper.writePage(). I think maxYForLine, 
> maxHeightForLine and minYTopForLine, which are used for the overlap-check, 
> are reset too often.
>  
> There's a check made with the value of 'Math.abs(position.getX() - 
> lastPosition.getTextPosition().getX())'. But I think is might have to be 
> changed to 'Math.abs(position.getX() - (lastPosition.getTextPosition().getX() 
> + lastPosition.getTextPosition().getWidth()))' to find relevant gaps.
>  
> An example-PDF can be downloaded from 
> [here|https://patentimages.storage.googleapis.com/57/b2/2f/3b5ffe86d83ef5/DE102016007628A1.pdf]
>  
> The text-line we had problems with was on page 2: 'gin-Anion 
> (12-Wolframato-1-phosphat) kann durch die Summenformel [P(W12O40)]3- 
> beschrieben werden oder'. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-6020) mix of subscript and superscript can lead to unnecessary new lines during text extraction

Reply via email to