Oliver Schmidtmer created PDFBOX-6046:
-----------------------------------------
Summary: PDFTextStripper: Sorting issue with overlaying text
Key: PDFBOX-6046
URL: https://issues.apache.org/jira/browse/PDFBOX-6046
Project: PDFBox
Issue Type: Bug
Reporter: Oliver Schmidtmer
Attachments: 10600601393673.ANF - 20.03.2025, 08_57_48.pdf,
image-2025-07-28-20-24-32-787.png
We found an issue with the PDFTextStripper if text is "layered", with in this
case some spaces as placeholder.
The PDFs in question are templates for orders, which are filled with data in a
second step.
So if the text is ordered by concurrence in the PDF source, the first half are
the field labels, the second half then the field values. So we need sorting by
rendered position with PDFTextStripper#setSortByPosition(true)
Now as the first example of the file, what should be
"Auftraggeber: NAGEL-GROUP"
is extracted as
"Auftraggeber: N AGEL-GROUP" with a space.
!image-2025-07-28-20-24-32-787.png|width=440,height=62!
This is caused by spaces after "Auftraggeber: " as a placeholder in the
template, which overlap with the first glyph of the field value.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]