[ 
https://issues.apache.org/jira/browse/PDFBOX-4481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-4481:
------------------------------------
    Labels: Thai  (was: )

> Text extraction error with Thai combined glyph depending on space after it
> --------------------------------------------------------------------------
>
>                 Key: PDFBOX-4481
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4481
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.14
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: Thai
>         Attachments: SO54981236-reduced.pdf, SO54981236-reduced.txt, 
> SO54981236.pdf
>
>
> In the first extracted line of the reduced file, the "accent" (somebody 
> please correct me what that thing is) is separate. On the second line it is 
> at the proper place. Content stream:
> {code}
> BT
>   1 0 0 1 67.3 756.98 Tm
>   [ (\000\203\000\227\000q) ] TJ
>   1 0 0 1 77.5 756.98 Tm
>   [ (\000\003) ] TJ
>   1 0 0 1 67.3 730 Tm
>   [ (\000\203\000\227\000q\000\003) ] TJ
> ET
> {code}
> The weird thing is that the "\003" is just a space. So when the space is in 
> the string the extraction works, and when it isn't, it doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to