[ https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240282#comment-14240282 ]
Glen Peterson commented on PDFBOX-1242: --------------------------------------- If I remember correctly, the PDF file format uses it's own very special 14-bit character encoding. If you use anything outside of what the PDF spec calles WinAnsi you may have to embed a font that handles those characters in the PDF file to ensure readability. I have not submitted any patches, nor am I likely to any time soon. What I did submit was a very partial work-around. The mangled code above is now publicly available under the Apache 2.0 license on GitHub where it should be much more readable. There is a Unicode to WinAnsi translation table here (I'll explain in a moment): https://github.com/GlenKPeterson/PdfLayoutManager/blob/master/src/main/java/com/planbase/pdf/layoutmanager/PdfLayoutMgr.java#L651 The code that uses that table is here: https://github.com/GlenKPeterson/PdfLayoutManager/blob/master/src/main/java/com/planbase/pdf/layoutmanager/PdfLayoutMgr.java#L972 High-level overview for each input character 1. The characters up to 127 are the same in UTF-16 and ISO-8859-1, so it leaves them unchanged 2. If one of the higher than 127 input UTF-16 characters has an ISO-8859-1 equivalent, it is converted directly/exactly. 3. If the input character is Cyrillic, there are somewhat standard, "Romanized" transliterations, where you can substitute one or more Roman characters that have a similar phonetic sound to the Cyrillic character. So this lets us support an additional set of languages (Russian in particular) without embedding any fonts or otherwise dealing with the root issue. 4. If the above rules do not cover the character in question, a bullet is written to the output stream, so that the end user can see that there is a character there that didn't print. OK, so I lied. The "while" loop at line 1006 doesn't actually work one character at a time. It finds instances of characters that need to be substituted. Then it copies what chunks of raw input it can to the output unchanged. It only drops to a character-by-character algorithm when it finds a character that actually needs to be substituted. This means that any length string of modern English characters will pass through unchanged. Most of that is in comments in the code on GitHub, but is probably easier to read knowing this overview. I hope that helps. > Handle non ISO-8859-1 chars with drawString > ------------------------------------------- > > Key: PDFBOX-1242 > URL: https://issues.apache.org/jira/browse/PDFBOX-1242 > Project: PDFBox > Issue Type: Bug > Components: Writing > Affects Versions: 1.5.0, 1.6.0 > Reporter: Peter Andersen > Assignee: John Hewson > Fix For: 2.0.0 > > > The PDPageContentStream.drawString take a String as argument, it construct a > COSString of the input. > If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff > and the bytes are taken from the > input as "UTF-16BE" encoded. > Back in the drawString method this unicode16 encoded COSString is appended as > a "ISO-8859-1" > appendRawCommands( new String( buffer.toByteArray(), "ISO-8859-1")); > > The result of this is that a line with UTF-16 chars is shown prefix with þÿ, > and with double space between the other chars. > The chars above 255 are shown as the two corresponding ISO-8859-1 characters. > As a side question to this observation, is there an alternative way to use > Pdfbox, to support UTF16? > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)