[jira] [Updated] (PDFBOX-1282) Unicode characters displayed with wrong glyps because of interpretation as 8 bit strings

Daniel Schwinn (Updated) (JIRA) Fri, 06 Apr 2012 14:54:41 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Schwinn updated PDFBOX-1282:
-----------------------------------

    Attachment: Characters_Arial_Modified.pdf

some modifications to illustrate the issue
                
> Unicode characters displayed with wrong glyps because of interpretation as 8 
> bit strings
> ----------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1282
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1282
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader
>    Affects Versions: 1.6.0
>            Reporter: Daniel Schwinn
>         Attachments: Characters_Arial.pdf, Characters_Arial_Modified.pdf
>
>
> the file Characters_Arial.pdf  shows that some unicode values are displayed 
> with wrong glyphs, for example the u2020 which is displayed as two spaces. 
> Another Issue is that invalid unicode characters are not handled correctly. 
> They should display
>  the invalid character box or something like that. This is demonstrated with 
> a modified version
>  of the file.
> The method processEncodedText is called when the texts of the document are 
> printed 
>       int codeLength = 1;
>         for( int i=0; i<string.length; i+=codeLength)
>         {
>             // Decode the value to a Unicode character
>             codeLength = 1;
>             String c = font.encode( string, i, codeLength );
>             if( c == null && i+1<string.length)
>             {
>                 //maybe a multibyte encoding
>                 codeLength++;
>                 c = font.encode( string, i, codeLength );
>             }
> This code tries to determine if the values in variable 'string' are 8 or 16 
> bit values or even a mixture of both types of values <lol>. 
> Everything works fine when variable 'string' contains 8 bit values, in most 
> cases. If there is an invalid  8 bit value this character may be dropped 
> together with the following character.
> The real problem occurs when the data in variable 'string' is encoded as 16 
> bit values. For many characters this works fine as the first byte is usually 
> not a valid character: 
> for example u0041 is first tried as char 00 with codeLength=1 an as there is 
> no entry for unicode 0 in the font it will be re-tried with codeLength=2 and 
> then interpreted as u0041.
> But what happens if the first byte of the 16 bit code is also a valid 
> character code?
> to check this I created the file Characters_Arial_Changed.pdf where I simply 
> changed the 16-bit string <0041> which displays 'A' to <4141> which is an 
> invalid character in this font. I Also changed a 8-bit string nearby from 
> (0041) to the value <4141>.
> Note that there are now two strings with the same value <4141> which have to 
> be displayed in a different way.
> Acrobat Reader then shows the invalid character box for the 16 bit string and 
> 'AA' for the 8 bit string above. PDFBox shows 'AA' for both strings.
> Problems are occuring with valid unicode character codes too: Unicode u2020 
> will be shown as two nice spaces in PDFBox where Adobe Reader shows the 
> correct character.
> To guess that it is a 16 bit character when the first byte is an invalid 
> character in the current font is the wrong way to handle the string values. 
> If the variable 'string' contains 8 or 16 bit values can't be detected by 
> analysing the values as the example shows.
> processEncodedText has to handle the data in variable 'string' as 16 bit 
> values when the font which is used has an (unicode-)encoding which uses more 
> than 256 characters, in all other cases it should be interpreted as 8 bit 
> values!!! 
> With an Unicode Font <4343> or (CC) should show the invalid character box, 
> with an 8 bit font both values should show the text 'CC'. I have included 
> this example in the file too.
> The Adobe documentation says about 8 or 16 bit values in strings for example:
> "When the current font is a Type 0 font whose Encoding entry is Identity-H or 
> Identity-V, the string to be shown shall contain pairs of bytes representing 
> CIDs, high-order byte first. When the current font is a CIDFont, the string 
> to be shown shall contain pairs of bytes representing CIDs, high-order byte 
> first. When the current font is a Type 2 CIDFont in which the CIDToGIDMap 
> entry is Identity and if the TrueType font is embedded in the PDF file, the 
> 2-byte CID values shall be identical glyph indices for the glyph descriptions 
> in the TrueType font program."
> I guess depending on this information it has to be determined if the string 
> is 8 or 16 bits!
> In my example pdf files the type 0 font has always the Indentity-H set as 
> encoding and so the strings have to be en-/decoded as pure 16 bit strings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1282) Unicode characters displayed with wrong glyps because of interpretation as 8 bit strings

Reply via email to