Nirmal Tandel created PDFBOX-6188:
-------------------------------------
Summary: PDFTextStripper misses text occurrences in PDFs with
out-of-order character drawing when setSortByPosition(false)
Key: PDFBOX-6188
URL: https://issues.apache.org/jira/browse/PDFBOX-6188
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 3.0.7 PDFBox, 2.0.29
Reporter: Nirmal Tandel
Attachments: A151_src.pdf, A403_ref.pdf
When using {{PDFTextStripper}} to search for text in a vector PDF, not all
occurrences of the search string are found. The root cause is that the PDF
content stream draws characters in non-left-to-right visual order. With
{{setSortByPosition(false)}} (the default), PDFBox respects drawing order and
produces garbled token groupings, causing text searches to miss valid matches.
With {{{}setSortByPosition(true){}}}, PDFBox fixes those cases but breaks
extraction of PDFs containing rotated (e.g. 45-degree) text, where it groups
diagonal glyphs with horizontal ones incorrectly.
h3. Steps to Reproduce
# Open the affected PDF page ({{{}A151{}}}) using {{{}PDDocument.load(...){}}}.
# Use {{PDFTextStripper}} to extract text or locate all occurrences of the
string {{A403}} via {{PDFTextStripperByArea}} or a custom subclass.
# With {{setSortByPosition(false)}} (default): only *2 of the 4* actual
occurrences of {{A403 }}on the page are found.
# With {{{}setSortByPosition(true){}}}: more occurrences are found on this
page, but other PDFs whose content streams contain 45-degree / diagonal text
are broken — PDFBox merges diagonal glyphs with horizontal glyphs, producing
incorrect word groupings.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]