[
https://issues.apache.org/jira/browse/PDFBOX-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18076894#comment-18076894
]
Nirmal Tandel edited comment on PDFBOX-6188 at 4/28/26 3:43 PM:
----------------------------------------------------------------
Hi [~tilman] , The issue is not related to rotated text. The problem is that
the {*}visual text order differs from the internal text order{*}.
The drawing order is incorrect. For example, text such as {{04}} is drawn after
{{A11.10}} even though visually it appears to the left. The content stream also
contains irregular X-coordinate progression within what should be a single
left-to-right text run, for example:
* x = 1110 (digit 0)
* x = 1071 (next digit 0, jumps backward by about 40 units)
* x = 1075 (digit 4, still out of order)
This violates the normal left-to-right expectation for text on the same line,
so PDFBox misorders the extracted text. Because of this, the scanner only finds
9 of the 22 places where sheet {{A11.10}} is linked on sheet {{{}A02.09{}}}.
Some references to *A11.10* were found in *A209* because their visual text
order and internal drawing order are the same. However, the remaining
references were not found because their visual text order and internal drawing
order differ
!image-2026-04-28-11-41-23-979.png|width=502,height=255!
We can fix it using the following approach but then that will cause the major
performance problem.
Use a two-pass scan and merge the results.
# Pass 1: {{setSortByPosition(false)}} This preserves drawing order and
continues to work for PDFs containing diagonal or 45-degree text.
# Pass 2: {{setSortByPosition(true)}} This handles PDFs whose content stream
has out-of-order character positions, such as the A02.09 / A11.10 case.
# Merge both result sets Combine the results from both passes so that neither
PDF type is penalized.
The solution I am looking for is whether there is a way to handle this
{*}without requiring two passes{*}. ?
I have shared the actual pdf's that causing the issue.
# Source file : Src_A209.pdf
# Reference file : Ref_A1110.pdf
was (Author: JIRAUSER312927):
Hi [~tilman] , The issue is not related to rotated text. The problem is that
the {*}visual text order differs from the internal text order{*}.
The drawing order is incorrect. For example, text such as {{04}} is drawn after
{{A11.10}} even though visually it appears to the left. The content stream also
contains irregular X-coordinate progression within what should be a single
left-to-right text run, for example:
* x = 1110 (digit 0)
* x = 1071 (next digit 0, jumps backward by about 40 units)
* x = 1075 (digit 4, still out of order)
This violates the normal left-to-right expectation for text on the same line,
so PDFBox misorders the extracted text. Because of this, the scanner only finds
9 of the 22 places where sheet {{A11.10}} is linked on sheet {{{}A02.09{}}}.
Some references to *A11.10* were found in *A209* because their visual text
order and internal drawing order are the same. However, the remaining
references were not found because their visual text order and internal drawing
order differ
!image-2026-04-28-11-41-23-979.png|width=502,height=255!
We can fix it using the following approach but then that will cause the major
performance problem.
Use a two-pass scan and merge the results.
# Pass 1: {{setSortByPosition(false)}} This preserves drawing order and
continues to work for PDFs containing diagonal or 45-degree text.
# Pass 2: {{setSortByPosition(true)}} This handles PDFs whose content stream
has out-of-order character positions, such as the A02.09 / A11.10 case.
# Merge both result sets Combine the results from both passes so that neither
PDF type is penalized.
The solution I am looking for is whether there is a way to handle this
{*}without requiring two passes{*}. ?
I have shared the actual pdf's that causing the issue.
# Source file : Src_A209.pdf
# Reference file : Ref_A1110.pdf
> PDFTextStripper misses text occurrences in PDFs with out-of-order character
> drawing when setSortByPosition(false)
> -----------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-6188
> URL: https://issues.apache.org/jira/browse/PDFBOX-6188
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.29, 3.0.7 PDFBox
> Reporter: Nirmal Tandel
> Priority: Blocker
> Attachments: A151_src.pdf, A151_src.txt, A403_ref.pdf, Ref_A1110.pdf,
> Src_A0209.pdf, image-2026-04-06-09-22-14-452.png,
> image-2026-04-28-11-37-57-266.png, image-2026-04-28-11-41-23-979.png
>
>
> When using {{PDFTextStripper}} to search for text in a vector PDF, not all
> occurrences of the search string are found. The root cause is that the PDF
> content stream draws characters in non-left-to-right visual order. With
> {{setSortByPosition(false)}} (the default), PDFBox respects drawing order and
> produces garbled token groupings, causing text searches to miss valid
> matches. With {{{}setSortByPosition(true){}}}, PDFBox fixes those cases but
> breaks extraction of PDFs containing rotated (e.g. 45-degree) text, where it
> groups diagonal glyphs with horizontal ones incorrectly.
> The Another example is as below:
> PDF extractors do not extract the text by visual position when
> {{{}setSortByPosition(false){}}}. They extract in drawing order (the order
> text commands appear in the PDF stream) and in this PDF, the drawing order is
> wrong: “04” is drawn after “A11.10”, even though visually it's on the left.
> The PDF content stream is producing characters with irregular (not
> consistently increasing) X coordinates, such as:
> x = 1110 (digit 0)
> x = 1071 (next digit 0 — jumps backward ~40 units)
> x = 1075 (digit 4 — still out of order)
> This violates the PDF specification’s expectation that text in the same line
> flows left-to-right. This causes all text extraction libraries PDFBox to
> misorder text.
> !image-2026-04-06-09-22-14-452.png!
> h3. Steps to Reproduce
> # Open the affected PDF page ({{{}A151{}}}) using
> {{{}PDDocument.load(...){}}}.
> # Use {{PDFTextStripper}} to extract text or locate all occurrences of the
> string {{A403}} via {{PDFTextStripperByArea}} or a custom subclass.
> # With {{setSortByPosition(false)}} (default): only *2 of the 4* actual
> occurrences of \{{A403 }}on the page are found.
> # With {{{}setSortByPosition(true){}}}: more occurrences are found on this
> page, but other PDFs whose content streams contain 45-degree / diagonal text
> are broken — PDFBox merges diagonal glyphs with horizontal glyphs, producing
> incorrect word groupings.
>
> The work around the issue in our application by *Pass 1* —
> {{{}setSortByPosition(false){}}}: preserves stream order; correct for
> diagonal/rotated-text PDFs. *Pass 2* — {{{}setSortByPosition(true){}}}:
> corrects for out-of-order character drawings but then this cause the major
> performance issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]