[jira] [Comment Edited] (PDFBOX-6188) PDFTextStripper misses text occurrences in PDFs with out-of-order character drawing when setSortByPosition(false)

Nirmal Tandel (Jira) Tue, 28 Apr 2026 08:44:05 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18076894#comment-18076894
 ]


Nirmal Tandel edited comment on PDFBOX-6188 at 4/28/26 3:43 PM:
----------------------------------------------------------------

Hi [~tilman] , The issue is not related to rotated text. The problem is that 
the {*}visual text order differs from the internal text order{*}.

The drawing order is incorrect. For example, text such as {{04}} is drawn after 
{{A11.10}} even though visually it appears to the left. The content stream also 
contains irregular X-coordinate progression within what should be a single 
left-to-right text run, for example:
 * x = 1110 (digit 0)

 * x = 1071 (next digit 0, jumps backward by about 40 units)

 * x = 1075 (digit 4, still out of order)

This violates the normal left-to-right expectation for text on the same line, 
so PDFBox misorders the extracted text. Because of this, the scanner only finds 
9 of the 22 places where sheet {{A11.10}} is linked on sheet {{{}A02.09{}}}.

 
Some references to *A11.10* were found in *A209* because their visual text 
order and internal drawing order are the same. However, the remaining 
references were not found because their visual text order and internal drawing 
order differ
!image-2026-04-28-11-41-23-979.png|width=502,height=255!

 

We can fix it using the following approach but then that will cause the major 
performance problem.

Use a two-pass scan and merge the results.
 # Pass 1: {{setSortByPosition(false)}} This preserves drawing order and 
continues to work for PDFs containing diagonal or 45-degree text.

 # Pass 2: {{setSortByPosition(true)}} This handles PDFs whose content stream 
has out-of-order character positions, such as the A02.09 / A11.10 case.

 # Merge both result sets Combine the results from both passes so that neither 
PDF type is penalized.

 
The solution I am looking for is whether there is a way to handle this 
{*}without requiring two passes{*}. ?
I have shared the actual pdf's that causing the issue. 
 # Source file : Src_A209.pdf
 # Reference file : Ref_A1110.pdf


was (Author: JIRAUSER312927):
Hi [~tilman] , The issue is not related to rotated text. The problem is that 
the {*}visual text order differs from the internal text order{*}.

The drawing order is incorrect. For example, text such as {{04}} is drawn after 
{{A11.10}} even though visually it appears to the left. The content stream also 
contains irregular X-coordinate progression within what should be a single 
left-to-right text run, for example:
 * x = 1110 (digit 0)

 * x = 1071 (next digit 0, jumps backward by about 40 units)

 * x = 1075 (digit 4, still out of order)

This violates the normal left-to-right expectation for text on the same line, 
so PDFBox misorders the extracted text. Because of this, the scanner only finds 
9 of the 22 places where sheet {{A11.10}} is linked on sheet {{{}A02.09{}}}.

 
Some references to *A11.10* were found in *A209* because their visual text 
order and internal drawing order are the same. However, the remaining 
references were not found because their visual text order and internal drawing 
order differ
!image-2026-04-28-11-41-23-979.png|width=502,height=255!

 

We can fix it using the following approach but then that will cause the major 
performance problem.

Use a two-pass scan and merge the results.
 # Pass 1: {{setSortByPosition(false)}} This preserves drawing order and 
continues to work for PDFs containing diagonal or 45-degree text.

 # Pass 2: {{setSortByPosition(true)}} This handles PDFs whose content stream 
has out-of-order character positions, such as the A02.09 / A11.10 case.

 # Merge both result sets Combine the results from both passes so that neither 
PDF type is penalized.

 
The solution I am looking for is whether there is a way to handle this 
{*}without requiring two passes{*}. ?
I have shared the actual pdf's that causing the issue. 
 # Source file : Src_A209.pdf
 # Reference file : Ref_A1110.pdf

 

 

 

> PDFTextStripper misses text occurrences in PDFs with out-of-order character 
> drawing when setSortByPosition(false)
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6188
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6188
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.29, 3.0.7 PDFBox
>            Reporter: Nirmal Tandel
>            Priority: Blocker
>         Attachments: A151_src.pdf, A151_src.txt, A403_ref.pdf, Ref_A1110.pdf, 
> Src_A0209.pdf, image-2026-04-06-09-22-14-452.png, 
> image-2026-04-28-11-37-57-266.png, image-2026-04-28-11-41-23-979.png
>
>
> When using {{PDFTextStripper}} to search for text in a vector PDF, not all 
> occurrences of the search string are found. The root cause is that the PDF 
> content stream draws characters in non-left-to-right visual order. With 
> {{setSortByPosition(false)}} (the default), PDFBox respects drawing order and 
> produces garbled token groupings, causing text searches to miss valid 
> matches. With {{{}setSortByPosition(true){}}}, PDFBox fixes those cases but 
> breaks extraction of PDFs containing rotated (e.g. 45-degree) text, where it 
> groups diagonal glyphs with horizontal ones incorrectly.
> The Another example is as below:
> PDF extractors do not extract the text by visual position when 
> {{{}setSortByPosition(false){}}}. They extract in drawing order (the order 
> text commands appear in the PDF stream) and in this PDF, the drawing order is 
> wrong: “04” is drawn after “A11.10”, even though visually it's on the left.
> The PDF content stream is producing characters with irregular (not 
> consistently increasing) X coordinates, such as:
> x = 1110 (digit 0)
> x = 1071 (next digit 0 — jumps backward ~40 units)
> x = 1075 (digit 4 — still out of order)
> This violates the PDF specification’s expectation that text in the same line 
> flows left-to-right. This causes all text extraction libraries PDFBox to 
> misorder text.
> !image-2026-04-06-09-22-14-452.png!
> h3. Steps to Reproduce
>  # Open the affected PDF page ({{{}A151{}}}) using 
> {{{}PDDocument.load(...){}}}.
>  # Use {{PDFTextStripper}} to extract text or locate all occurrences of the 
> string {{A403}} via {{PDFTextStripperByArea}} or a custom subclass.
>  # With {{setSortByPosition(false)}} (default): only *2 of the 4* actual 
> occurrences of \{{A403 }}on the page are found.
>  # With {{{}setSortByPosition(true){}}}: more occurrences are found on this 
> page, but other PDFs whose content streams contain 45-degree / diagonal text 
> are broken — PDFBox merges diagonal glyphs with horizontal glyphs, producing 
> incorrect word groupings.
>  
> The work around the issue in our application by *Pass 1* — 
> {{{}setSortByPosition(false){}}}: preserves stream order; correct for 
> diagonal/rotated-text PDFs. *Pass 2* — {{{}setSortByPosition(true){}}}: 
> corrects for out-of-order character drawings but then this cause the major 
> performance issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-6188) PDFTextStripper misses text occurrences in PDFs with out-of-order character drawing when setSortByPosition(false)

Reply via email to