[
https://issues.apache.org/jira/browse/PDFBOX-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18071437#comment-18071437
]
Tilman Hausherr commented on PDFBOX-6188:
-----------------------------------------
This is a known problem, you could try the rotationMagic option in the command
line tool ExtractText.java, whose source code you can find here:
https://github.com/apache/pdfbox/blob/3.0/tools/src/main/java/org/apache/pdfbox/tools/ExtractText.java
In that file, look for {{rotationMagic}}. Here's how the extraction looks:
[^A151_src.txt]
> PDFTextStripper misses text occurrences in PDFs with out-of-order character
> drawing when setSortByPosition(false)
> -----------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-6188
> URL: https://issues.apache.org/jira/browse/PDFBOX-6188
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.29, 3.0.7 PDFBox
> Reporter: Nirmal Tandel
> Priority: Blocker
> Attachments: A151_src.pdf, A151_src.txt, A403_ref.pdf,
> image-2026-04-06-09-22-14-452.png
>
>
> When using {{PDFTextStripper}} to search for text in a vector PDF, not all
> occurrences of the search string are found. The root cause is that the PDF
> content stream draws characters in non-left-to-right visual order. With
> {{setSortByPosition(false)}} (the default), PDFBox respects drawing order and
> produces garbled token groupings, causing text searches to miss valid
> matches. With {{{}setSortByPosition(true){}}}, PDFBox fixes those cases but
> breaks extraction of PDFs containing rotated (e.g. 45-degree) text, where it
> groups diagonal glyphs with horizontal ones incorrectly.
> The Another example is as below:
> PDF extractors do not extract the text by visual position when
> {{{}setSortByPosition(false){}}}. They extract in drawing order (the order
> text commands appear in the PDF stream) and in this PDF, the drawing order is
> wrong: “04” is drawn after “A11.10”, even though visually it's on the left.
> The PDF content stream is producing characters with irregular (not
> consistently increasing) X coordinates, such as:
> x = 1110 (digit 0)
> x = 1071 (next digit 0 — jumps backward ~40 units)
> x = 1075 (digit 4 — still out of order)
> This violates the PDF specification’s expectation that text in the same line
> flows left-to-right. This causes all text extraction libraries PDFBox to
> misorder text.
> !image-2026-04-06-09-22-14-452.png!
> h3. Steps to Reproduce
> # Open the affected PDF page ({{{}A151{}}}) using
> {{{}PDDocument.load(...){}}}.
> # Use {{PDFTextStripper}} to extract text or locate all occurrences of the
> string {{A403}} via {{PDFTextStripperByArea}} or a custom subclass.
> # With {{setSortByPosition(false)}} (default): only *2 of the 4* actual
> occurrences of \{{A403 }}on the page are found.
> # With {{{}setSortByPosition(true){}}}: more occurrences are found on this
> page, but other PDFs whose content streams contain 45-degree / diagonal text
> are broken — PDFBox merges diagonal glyphs with horizontal glyphs, producing
> incorrect word groupings.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]