[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

John Hewson (JIRA) Wed, 06 Aug 2014 14:25:40 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088283#comment-14088283
 ]


John Hewson commented on PDFBOX-2259:
-------------------------------------

Ok, I see the problem. The PDF text itself doesn't contain the joining space 
character which is needed, copying and pasting gives the wrong result too.

The text which you want is in the PDF though, as "marked content" which is used 
for accessibility. The PDFMarkedContentExtractor class can be used to extract 
this text. Perhaps try that.

> PDFTextStripper has problem with semi-space characters
> ------------------------------------------------------
>
>                 Key: PDFBOX-2259
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6
>            Reporter: Amir
>         Attachments: test.pdf
>
>
> In some right-to-left languages, compound words are separated using 
> "semi-space" (please take a look at Unicode spaces: 
> https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document 
> contains these words, PDFTextStripper neglects semi-space character and 
> concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

Reply via email to