[ https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089588#comment-14089588 ]
John Hewson commented on PDFBOX-2259: ------------------------------------- No, the semi-space character isn't part of the text embedded in the PDF file. The PDF contains additional "marked content" for accessibility, screen readers, etc, which does contain the semi-space. Only PDFMarkedContentExtractor has access to that character. However, it seems like there may be a bug in PDFMarkedContentExtractor so you're still getting the wrong result. I'll take a look soon. > PDFTextStripper has problem with semi-space characters > ------------------------------------------------------ > > Key: PDFBOX-2259 > URL: https://issues.apache.org/jira/browse/PDFBOX-2259 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.6 > Reporter: Amir > Attachments: test.pdf > > > In some right-to-left languages, compound words are separated using > "semi-space" (please take a look at Unicode spaces: > https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document > contains these words, PDFTextStripper neglects semi-space character and > concatenates words together. -- This message was sent by Atlassian JIRA (v6.2#6252)