[ 
https://issues.apache.org/jira/browse/PDFBOX-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18076911#comment-18076911
 ] 

Michael Klink edited comment on PDFBOX-6188 at 4/28/26 5:02 PM:
----------------------------------------------------------------

You can save a bit by parsing the page only once and collecting all the 
{{TextPosition}} objects, and then analyzing that {{TextPosition}} collection 
in different manners, e.g. once with sorting and once without, or by 
partitioning the collection by angle and then sorting each partition by 
coordinates rotated by that angle, or...

The PDFBox text extractor unfortunately has those steps closely coupled, so you 
must do some programming yourself (or at least some copying and pasting...).


was (Author: mkl):
You can save a bit by parsing the page only once and collecting all the 
{{TextPosition}} objects, and then analyzing that {{TextPosition}} collection 
in different manners, e.g. once with sorting and once without, or by 
partitioning the collection by angle and then by coordinates rotated by that 
angle, or... 

The PDFBox text extractor unfortunately has those steps closely coupled, so you 
must do some programming yourself (or at least some copying and pasting...).

> PDFTextStripper misses text occurrences in PDFs with out-of-order character 
> drawing when setSortByPosition(false)
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6188
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6188
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.29, 3.0.7 PDFBox
>            Reporter: Nirmal Tandel
>            Priority: Blocker
>         Attachments: A151_src.pdf, A151_src.txt, A403_ref.pdf, Ref_A1110.pdf, 
> Src_A0209.pdf, image-2026-04-06-09-22-14-452.png, 
> image-2026-04-28-11-37-57-266.png, image-2026-04-28-11-41-23-979.png
>
>
> When using {{PDFTextStripper}} to search for text in a vector PDF, not all 
> occurrences of the search string are found. The root cause is that the PDF 
> content stream draws characters in non-left-to-right visual order. With 
> {{setSortByPosition(false)}} (the default), PDFBox respects drawing order and 
> produces garbled token groupings, causing text searches to miss valid 
> matches. With {{{}setSortByPosition(true){}}}, PDFBox fixes those cases but 
> breaks extraction of PDFs containing rotated (e.g. 45-degree) text, where it 
> groups diagonal glyphs with horizontal ones incorrectly.
> The Another example is as below:
> PDF extractors do not extract the text by visual position when 
> {{{}setSortByPosition(false){}}}. They extract in drawing order (the order 
> text commands appear in the PDF stream) and in this PDF, the drawing order is 
> wrong: “04” is drawn after “A11.10”, even though visually it's on the left.
> The PDF content stream is producing characters with irregular (not 
> consistently increasing) X coordinates, such as:
> x = 1110 (digit 0)
> x = 1071 (next digit 0 — jumps backward ~40 units)
> x = 1075 (digit 4 — still out of order)
> This violates the PDF specification’s expectation that text in the same line 
> flows left-to-right. This causes all text extraction libraries PDFBox to 
> misorder text.
> !image-2026-04-06-09-22-14-452.png!
> h3. Steps to Reproduce
>  # Open the affected PDF page ({{{}A151{}}}) using 
> {{{}PDDocument.load(...){}}}.
>  # Use {{PDFTextStripper}} to extract text or locate all occurrences of the 
> string {{A403}} via {{PDFTextStripperByArea}} or a custom subclass.
>  # With {{setSortByPosition(false)}} (default): only *2 of the 4* actual 
> occurrences of \{{A403 }}on the page are found.
>  # With {{{}setSortByPosition(true){}}}: more occurrences are found on this 
> page, but other PDFs whose content streams contain 45-degree / diagonal text 
> are broken — PDFBox merges diagonal glyphs with horizontal glyphs, producing 
> incorrect word groupings.
>  
> The work around the issue in our application by *Pass 1* — 
> {{{}setSortByPosition(false){}}}: preserves stream order; correct for 
> diagonal/rotated-text PDFs. *Pass 2* — {{{}setSortByPosition(true){}}}: 
> corrects for out-of-order character drawings but then this cause the major 
> performance issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to