Re: Improvements of OUString

Herbert Duerr Tue, 03 Dec 2013 23:05:55 -0800

Andrew Douglas Pitonyak wrote:
> On 12/03/2013 11:27 AM, Herbert Duerr wrote:
>> If you have an ASCII string then you can directly print it in an UTF-8
>> locale. No conversion needed. Also the inverse is true: if that string
>> was encoded as UTF-8 then you can print it directly in an ASCII
>> compatible locale. No conversion needed for the output. The result
>> would be exactly the same.
>> [...]
> I would have said that the ASCII values from 0 to 127 are the same for
> UTF-8, but, ASCII values greater than 127 are a problem.


As ASCII only defines codepoints from 0 to 127 any codepoint greater or
equal to 128 would be a problem. It is undefined and "illegal".

There are many other byte encodings that are compatible with ASCII
because they share ASCII's codepoints 0 to 127. UTF-8 is one of them.

But the example Andre gave was for the specific RTL_TEXTENCODING_ASCII
encoding, which means pure ASCII: That encoding target limits a
conversion to ASCII's real 0..127 core and maps any codepoints greater
than 127 to a replacement character, which defaults to a question mark.

> I recently had a problem with that when a documented contained ASCII 160, a
> non-breaking space.

"no-break space" is defined as codepoint 160==0x00A0 in some encodings
such as iso-8859-*, CP1252, etc. [1], but it is not available in pure
ASCII. Using RTL_TEXTENCODING_ASCII mapping on a string containing it
would have eliminated it too.

[1] http://en.wikipedia.org/wiki/Non-breaking_space#Encodings

> I became aware of it when I was asked "hey, why does
> this file look different after it was converted to UTF-8?"

The real problem apparently was that the original file was not ASCII.

Herbert

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Improvements of OUString

Reply via email to