Re: [Pharo-project] String input not in UTF-8

Henrik Johansen Fri, 05 Aug 2011 07:23:58 -0700

On Aug 5, 2011, at 3:41 54PM, Hilaire Fernandes wrote:

> Le 05/08/2011 13:28, Henrik Johansen a écrit :
>> 
>> On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote:
>> 
>>> It seems like when inputing accented character it is not by default in
>>> UTF-8.
>>> Is it the case with Pharo 1.3 ?
>>> 
>>> Hilaire
>>> 
>>> 
>>> -- 
>>> Education 0.2 -- http://blog.ofset.org/hilaire
>> 
>> I'm not sure what you mean.
>> When in image, all the way from InputEvents to String representation, you 
>> only deal with Unicode codePoints.
> 
> Is seems it is 8 bits chars, when exported through XMLParser, it is
> 8bits string. I need to investigate further.
> 
> Hilaire
It is an 8-bit character, since the codePoint fits in one byte. (see a)
Accented characters like é could be either:
a) One Unicode codepoint (U+00E9 (decimal 233) small acute e )
b) Two Unicode codepoints ( U+0301 (decimal 769) combining acute accent + U0065 
(decimal 101) small e ).


Internally, you'd see strings with character values corresponding to those 
listed as decimal, ie the unicode codePoints.
b) would be a WideString, as 769 does not fit in a byte.

However, if  correctly converted to UTF8, their representations should be;
a)  represented in 2 bytes ;       16r C3A9
b)  represented  in 3 bytes:  16r CD81 65.

Ie. it seems XMLParser does not encode it properly to utf8 when exporting.
Note: This is perfectly legal if the document contains an encoding attribute 
specifying a one-byte encoding like iso-8859-1 or windows-1252.
(starts with <?xml version="1.0" encoding="windows-1252" ?> or some such)
Absent such an attribute, or a BOM indicating another Unicode encoding though, 
it is a bug.

Cheers,
Henry

Re: [Pharo-project] String input not in UTF-8

Reply via email to