[ https://issues.apache.org/jira/browse/PDFBOX-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Schwinn updated PDFBOX-1282: ----------------------------------- Attachment: Characters_Arial_Modified.pdf some modifications to illustrate the issue > Unicode characters displayed with wrong glyps because of interpretation as 8 > bit strings > ---------------------------------------------------------------------------------------- > > Key: PDFBOX-1282 > URL: https://issues.apache.org/jira/browse/PDFBOX-1282 > Project: PDFBox > Issue Type: Bug > Components: PDFReader > Affects Versions: 1.6.0 > Reporter: Daniel Schwinn > Attachments: Characters_Arial.pdf, Characters_Arial_Modified.pdf > > > the file Characters_Arial.pdf shows that some unicode values are displayed > with wrong glyphs, for example the u2020 which is displayed as two spaces. > Another Issue is that invalid unicode characters are not handled correctly. > They should display > the invalid character box or something like that. This is demonstrated with > a modified version > of the file. > The method processEncodedText is called when the texts of the document are > printed > int codeLength = 1; > for( int i=0; i<string.length; i+=codeLength) > { > // Decode the value to a Unicode character > codeLength = 1; > String c = font.encode( string, i, codeLength ); > if( c == null && i+1<string.length) > { > //maybe a multibyte encoding > codeLength++; > c = font.encode( string, i, codeLength ); > } > This code tries to determine if the values in variable 'string' are 8 or 16 > bit values or even a mixture of both types of values <lol>. > Everything works fine when variable 'string' contains 8 bit values, in most > cases. If there is an invalid 8 bit value this character may be dropped > together with the following character. > The real problem occurs when the data in variable 'string' is encoded as 16 > bit values. For many characters this works fine as the first byte is usually > not a valid character: > for example u0041 is first tried as char 00 with codeLength=1 an as there is > no entry for unicode 0 in the font it will be re-tried with codeLength=2 and > then interpreted as u0041. > But what happens if the first byte of the 16 bit code is also a valid > character code? > to check this I created the file Characters_Arial_Changed.pdf where I simply > changed the 16-bit string <0041> which displays 'A' to <4141> which is an > invalid character in this font. I Also changed a 8-bit string nearby from > (0041) to the value <4141>. > Note that there are now two strings with the same value <4141> which have to > be displayed in a different way. > Acrobat Reader then shows the invalid character box for the 16 bit string and > 'AA' for the 8 bit string above. PDFBox shows 'AA' for both strings. > Problems are occuring with valid unicode character codes too: Unicode u2020 > will be shown as two nice spaces in PDFBox where Adobe Reader shows the > correct character. > To guess that it is a 16 bit character when the first byte is an invalid > character in the current font is the wrong way to handle the string values. > If the variable 'string' contains 8 or 16 bit values can't be detected by > analysing the values as the example shows. > processEncodedText has to handle the data in variable 'string' as 16 bit > values when the font which is used has an (unicode-)encoding which uses more > than 256 characters, in all other cases it should be interpreted as 8 bit > values!!! > With an Unicode Font <4343> or (CC) should show the invalid character box, > with an 8 bit font both values should show the text 'CC'. I have included > this example in the file too. > The Adobe documentation says about 8 or 16 bit values in strings for example: > "When the current font is a Type 0 font whose Encoding entry is Identity-H or > Identity-V, the string to be shown shall contain pairs of bytes representing > CIDs, high-order byte first. When the current font is a CIDFont, the string > to be shown shall contain pairs of bytes representing CIDs, high-order byte > first. When the current font is a Type 2 CIDFont in which the CIDToGIDMap > entry is Identity and if the TrueType font is embedded in the PDF file, the > 2-byte CID values shall be identical glyph indices for the glyph descriptions > in the TrueType font program." > I guess depending on this information it has to be determined if the string > is 8 or 16 bits! > In my example pdf files the type 0 font has always the Indentity-H set as > encoding and so the strings have to be en-/decoded as pure 16 bit strings. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira