Josh Burchard created PDFBOX-4795:
-------------------------------------
Summary: Hebrew words are extracted with no whitespace between
Key: PDFBOX-4795
URL: https://issues.apache.org/jira/browse/PDFBOX-4795
Project: PDFBox
Issue Type: Bug
Affects Versions: 2.0.19
Environment: Windows 10
Reporter: Josh Burchard
Attachments: hebrew_newsletter.pdf
When I extract Hebrew text from the included PDF, white space delimiting the
words is not output.
Example string of text as appears in the PDF:
מאיר שמגר. ״ההלכות
And the string as PDFBox extracts it:
״ההלכותשמגר.מאיר
The words themselves are presented LTR, instead of RTL. It would be nice to
have them RTL, but in my particular use case that doesn't matter as I'm
creating an index. The spaces between matter a lot, however.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]