Andrew Douglas Pitonyak wrote: > On 12/03/2013 11:27 AM, Herbert Duerr wrote: >> If you have an ASCII string then you can directly print it in an UTF-8 >> locale. No conversion needed. Also the inverse is true: if that string >> was encoded as UTF-8 then you can print it directly in an ASCII >> compatible locale. No conversion needed for the output. The result >> would be exactly the same. >> [...] > I would have said that the ASCII values from 0 to 127 are the same for > UTF-8, but, ASCII values greater than 127 are a problem.
As ASCII only defines codepoints from 0 to 127 any codepoint greater or equal to 128 would be a problem. It is undefined and "illegal". There are many other byte encodings that are compatible with ASCII because they share ASCII's codepoints 0 to 127. UTF-8 is one of them. But the example Andre gave was for the specific RTL_TEXTENCODING_ASCII encoding, which means pure ASCII: That encoding target limits a conversion to ASCII's real 0..127 core and maps any codepoints greater than 127 to a replacement character, which defaults to a question mark. > I recently had a problem with that when a documented contained ASCII 160, a > non-breaking space. "no-break space" is defined as codepoint 160==0x00A0 in some encodings such as iso-8859-*, CP1252, etc. [1], but it is not available in pure ASCII. Using RTL_TEXTENCODING_ASCII mapping on a string containing it would have eliminated it too. [1] http://en.wikipedia.org/wiki/Non-breaking_space#Encodings > I became aware of it when I was asked "hey, why does > this file look different after it was converted to UTF-8?" The real problem apparently was that the original file was not ASCII. Herbert --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
