[
https://issues.apache.org/jira/browse/PDFBOX-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr updated PDFBOX-6020:
------------------------------------
Description:
I'm currently upgrading our usage of PDFBox from Version 2.0.19 to 3.0.5.
Worked out fine so far but one of our JUnit-Tests failed with the new version.
During text extraction by using PDFTextStripper unnecessary line feeds were
created for a line, that contained subscript as well as superscript text. While
debugging the issue I found some changes that were made in Methode
PDFTextStripper.writePage(). I think maxYForLine, maxHeightForLine and
minYTopForLine, which are used for the overlap-check, are reset too often.
There's a check made with the value of 'Math.abs(position.getX() -
lastPosition.getTextPosition().getX())'. But I think is might have to be
changed to 'Math.abs(position.getX() - (lastPosition.getTextPosition().getX() +
lastPosition.getTextPosition().getWidth()))' to find relevant gaps.
An example-PDF can be downloaded from
[here|https://patentimages.storage.googleapis.com/57/b2/2f/3b5ffe86d83ef5/DE102016007628A1.pdf]
The text-line we had problems with was on page 2: 'gin-Anion
(12-Wolframato-1-phosphat) kann durch die Summenformel [P(W12O40)]3-
beschrieben werden oder'.
was:
I'm currently upgrading our usage of PDFBox from Version 2.0.19 to 3.0.5.
Worked out fine so far but one of our JUnit-Tests failed with the new version.
During text extraction by using PDFTextStripper unnecessary line feeds were
created for a line, that contained subscript as well as superscript text. While
debugging the issue I found some changes that were made in Methode
PDFTextStripper.writePage(). I think maxYForLine, maxHeightForLine and
minYTopForLine, which are used for the overlap-check, are reset too often.
There's a check made with the value of 'Math.abs(position.getX() -
lastPosition.getTextPosition().getX())'. But I think is might have to be
changed to 'Math.abs(position.getX() - (lastPosition.getTextPosition().getX() +
lastPosition.getTextPosition().getWidth()))' to find relevant gaps.
An example-PDF can be downloaded from
[https://patentimages.storage.googleapis.com/57/b2/2f/3b5ffe86d83ef5/DE102016007628A1.pdf|https://deref-gmx.net/mail/client/1zP7-96Q_X4/dereferrer/?redirectUrl=https%3A%2F%2Fpatentimages.storage.googleapis.com%2F57%2Fb2%2F2f%2F3b5ffe86d83ef5%2FDE102016007628A1.pdf&lm]
The text-line we had problems with was on page 2: 'gin-Anion
(12-Wolframato-1-phosphat) kann durch die Summenformel [P(W12O40)]3-
beschrieben werden oder'.
> mix of subscript and superscript can lead to unnecessary new lines during
> text extraction
> -----------------------------------------------------------------------------------------
>
> Key: PDFBOX-6020
> URL: https://issues.apache.org/jira/browse/PDFBOX-6020
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 3.0.5 PDFBox
> Reporter: Markus Seifert
> Priority: Minor
>
> I'm currently upgrading our usage of PDFBox from Version 2.0.19 to 3.0.5.
> Worked out fine so far but one of our JUnit-Tests failed with the new
> version. During text extraction by using PDFTextStripper unnecessary line
> feeds were created for a line, that contained subscript as well as
> superscript text. While debugging the issue I found some changes that were
> made in Methode PDFTextStripper.writePage(). I think maxYForLine,
> maxHeightForLine and minYTopForLine, which are used for the overlap-check,
> are reset too often.
>
> There's a check made with the value of 'Math.abs(position.getX() -
> lastPosition.getTextPosition().getX())'. But I think is might have to be
> changed to 'Math.abs(position.getX() - (lastPosition.getTextPosition().getX()
> + lastPosition.getTextPosition().getWidth()))' to find relevant gaps.
>
> An example-PDF can be downloaded from
> [here|https://patentimages.storage.googleapis.com/57/b2/2f/3b5ffe86d83ef5/DE102016007628A1.pdf]
>
> The text-line we had problems with was on page 2: 'gin-Anion
> (12-Wolframato-1-phosphat) kann durch die Summenformel [P(W12O40)]3-
> beschrieben werden oder'.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]