[ 
https://issues.apache.org/jira/browse/PDFBOX-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240282#comment-14240282
 ] 

Glen Peterson commented on PDFBOX-1242:
---------------------------------------

If I remember correctly, the PDF file format uses it's own very special 14-bit 
character encoding.  If you use anything outside of what the PDF spec calles 
WinAnsi you may have to embed a font that handles those characters in the PDF 
file to ensure readability.

I have not submitted any patches, nor am I likely to any time soon.  What I did 
submit was a very partial work-around.  The mangled code above is now publicly 
available under the Apache 2.0 license on GitHub where it should be much more 
readable.  There is a Unicode to WinAnsi translation table here (I'll explain 
in a moment):
https://github.com/GlenKPeterson/PdfLayoutManager/blob/master/src/main/java/com/planbase/pdf/layoutmanager/PdfLayoutMgr.java#L651

The code that uses that table is here:
https://github.com/GlenKPeterson/PdfLayoutManager/blob/master/src/main/java/com/planbase/pdf/layoutmanager/PdfLayoutMgr.java#L972

High-level overview for each input character

1. The characters up to 127 are the same in UTF-16 and ISO-8859-1, so it leaves 
them unchanged

2. If one of the higher than 127 input UTF-16 characters has an ISO-8859-1 
equivalent, it is converted directly/exactly.

3. If the input character is Cyrillic, there are somewhat standard, "Romanized" 
transliterations, where you can substitute one or more Roman characters that 
have a similar phonetic sound to the Cyrillic character.  So this lets us 
support an additional set of languages (Russian in particular) without 
embedding any fonts or otherwise dealing with the root issue.

4. If the above rules do not cover the character in question, a bullet is 
written to the output stream, so that the end user can see that there is a 
character there that didn't print.

OK, so I lied.  The "while" loop at line 1006 doesn't actually work one 
character at a time.  It finds instances of characters that need to be 
substituted.  Then it copies what chunks of raw input it can to the output 
unchanged.  It only drops to a character-by-character algorithm when it finds a 
character that actually needs to be substituted.  This means that any length 
string of modern English characters will pass through unchanged.

Most of that is in comments in the code on GitHub, but is probably easier to 
read knowing this overview. I hope that helps.

> Handle non ISO-8859-1 chars with drawString
> -------------------------------------------
>
>                 Key: PDFBOX-1242
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1242
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Writing
>    Affects Versions: 1.5.0, 1.6.0
>            Reporter: Peter Andersen
>            Assignee: John Hewson
>             Fix For: 2.0.0
>
>
> The PDPageContentStream.drawString take a String as argument, it construct a 
> COSString of the input.
> If the input contain chars above 255, the COSString is prefixed 0xFe, 0xff 
> and the bytes are taken from the
> input as "UTF-16BE" encoded.
> Back in the drawString method this unicode16 encoded COSString is appended as 
> a "ISO-8859-1"        
>       appendRawCommands( new String( buffer.toByteArray(), "ISO-8859-1"));
>  
> The result of this is that a line with UTF-16 chars is shown prefix with þÿ, 
> and with double space between the other chars.
> The chars above 255 are shown as the two corresponding ISO-8859-1 characters.
> As a side question to this observation, is there an alternative way to use 
> Pdfbox, to support UTF16?
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to