[ https://issues.apache.org/jira/browse/PDFBOX-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17908878#comment-17908878 ]
Tilman Hausherr commented on PDFBOX-1652: ----------------------------------------- This ticket should be reopened but with an example so we can compare the behaviour with the one of Adobe. > TextPosition: Japanese alphabetic characters 30fc and 3005 treated as > diacritics > -------------------------------------------------------------------------------- > > Key: PDFBOX-1652 > URL: https://issues.apache.org/jira/browse/PDFBOX-1652 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.1 > Reporter: Christian Kohlschütter > Priority: Major > Labels: PatchAvailable > Attachments: PDFBOX-1652.patch > > > For the purpose of determining the position in text, the Japanese characters > U+30fc (KATAKANA-HIRAGANA PROLONGED SOUND MARK) and U+3005 (IDEOGRAPHIC > ITERATION MARK) are currently regarded "simple" diacritics. Apparently, they > are fully-fledged characters in terms of text positioning. > This can have the effect that when extracting text, some characters get > actually reversed (particularly ーン can get ンー). > A patch to fix this is attached. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org