[ 
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089588#comment-14089588
 ] 

John Hewson commented on PDFBOX-2259:
-------------------------------------

No, the semi-space character isn't part of the text embedded in the PDF file. 
The PDF contains additional "marked content" for accessibility, screen readers, 
etc, which does contain the semi-space. Only PDFMarkedContentExtractor has 
access to that character.

However, it seems like there may be a bug in PDFMarkedContentExtractor so 
you're still getting the wrong result. I'll take a look soon.

> PDFTextStripper has problem with semi-space characters
> ------------------------------------------------------
>
>                 Key: PDFBOX-2259
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6
>            Reporter: Amir
>         Attachments: test.pdf
>
>
> In some right-to-left languages, compound words are separated using 
> "semi-space" (please take a look at Unicode spaces: 
> https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document 
> contains these words, PDFTextStripper neglects semi-space character and 
> concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to