[ 
https://issues.apache.org/jira/browse/PDFBOX-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-6188:
------------------------------------
    Attachment: A151_src.txt

> PDFTextStripper misses text occurrences in PDFs with out-of-order character 
> drawing when setSortByPosition(false)
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6188
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6188
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.29, 3.0.7 PDFBox
>            Reporter: Nirmal Tandel
>            Priority: Blocker
>         Attachments: A151_src.pdf, A151_src.txt, A403_ref.pdf, 
> image-2026-04-06-09-22-14-452.png
>
>
> When using {{PDFTextStripper}} to search for text in a vector PDF, not all 
> occurrences of the search string are found. The root cause is that the PDF 
> content stream draws characters in non-left-to-right visual order. With 
> {{setSortByPosition(false)}} (the default), PDFBox respects drawing order and 
> produces garbled token groupings, causing text searches to miss valid 
> matches. With {{{}setSortByPosition(true){}}}, PDFBox fixes those cases but 
> breaks extraction of PDFs containing rotated (e.g. 45-degree) text, where it 
> groups diagonal glyphs with horizontal ones incorrectly.
> The Another example is as below:
> PDF extractors do not extract the text by visual position when 
> {{{}setSortByPosition(false){}}}. They extract in drawing order (the order 
> text commands appear in the PDF stream) and in this PDF, the drawing order is 
> wrong: “04” is drawn after “A11.10”, even though visually it's on the left.
> The PDF content stream is producing characters with irregular (not 
> consistently increasing) X coordinates, such as:
> x = 1110 (digit 0)
> x = 1071 (next digit 0 — jumps backward ~40 units)
> x = 1075 (digit 4 — still out of order)
> This violates the PDF specification’s expectation that text in the same line 
> flows left-to-right. This causes all text extraction libraries PDFBox to 
> misorder text.
> !image-2026-04-06-09-22-14-452.png!
> h3. Steps to Reproduce
>  # Open the affected PDF page ({{{}A151{}}}) using 
> {{{}PDDocument.load(...){}}}.
>  # Use {{PDFTextStripper}} to extract text or locate all occurrences of the 
> string {{A403}} via {{PDFTextStripperByArea}} or a custom subclass.
>  # With {{setSortByPosition(false)}} (default): only *2 of the 4* actual 
> occurrences of \{{A403 }}on the page are found.
>  # With {{{}setSortByPosition(true){}}}: more occurrences are found on this 
> page, but other PDFs whose content streams contain 45-degree / diagonal text 
> are broken — PDFBox merges diagonal glyphs with horizontal glyphs, producing 
> incorrect word groupings.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to