[
https://issues.apache.org/jira/browse/PDFBOX-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed PDFBOX-6046.
-----------------------------------
Resolution: Not A Bug
> PDFTextStripper: Sorting issue with overlaying text
> ---------------------------------------------------
>
> Key: PDFBOX-6046
> URL: https://issues.apache.org/jira/browse/PDFBOX-6046
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: Oliver Schmidtmer
> Priority: Major
> Attachments: 10600601393673.ANF - 20.03.2025, 08_57_48.pdf,
> PDFBOX-6046-reduced.pdf, image-2025-07-28-20-24-32-787.png
>
>
> We found an issue with the PDFTextStripper if text is "layered", with in this
> case some spaces as placeholder.
> The PDFs in question are templates for orders, which are filled with data in
> a second step.
> So if the text is ordered by concurrence in the PDF source, the first half
> are the field labels, the second half then the field values. So we need
> sorting by rendered position with PDFTextStripper#setSortByPosition(true)
> Now as the first example of the file, what should be
> "Auftraggeber: NAGEL-GROUP"
> is extracted as
> "Auftraggeber: N AGEL-GROUP" with a space.
> !image-2025-07-28-20-24-32-787.png|width=440,height=62!
> This is caused by spaces after "Auftraggeber: " as a placeholder in the
> template, which overlap with the first glyph of the field value.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]