On Aug 5, 2011, at 3:41 54PM, Hilaire Fernandes wrote: > Le 05/08/2011 13:28, Henrik Johansen a écrit : >> >> On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote: >> >>> It seems like when inputing accented character it is not by default in >>> UTF-8. >>> Is it the case with Pharo 1.3 ? >>> >>> Hilaire >>> >>> >>> -- >>> Education 0.2 -- http://blog.ofset.org/hilaire >> >> I'm not sure what you mean. >> When in image, all the way from InputEvents to String representation, you >> only deal with Unicode codePoints. > > Is seems it is 8 bits chars, when exported through XMLParser, it is > 8bits string. I need to investigate further. > > Hilaire It is an 8-bit character, since the codePoint fits in one byte. (see a) Accented characters like é could be either: a) One Unicode codepoint (U+00E9 (decimal 233) small acute e ) b) Two Unicode codepoints ( U+0301 (decimal 769) combining acute accent + U0065 (decimal 101) small e ).
Internally, you'd see strings with character values corresponding to those listed as decimal, ie the unicode codePoints. b) would be a WideString, as 769 does not fit in a byte. However, if correctly converted to UTF8, their representations should be; a) represented in 2 bytes ; 16r C3A9 b) represented in 3 bytes: 16r CD81 65. Ie. it seems XMLParser does not encode it properly to utf8 when exporting. Note: This is perfectly legal if the document contains an encoding attribute specifying a one-byte encoding like iso-8859-1 or windows-1252. (starts with <?xml version="1.0" encoding="windows-1252" ?> or some such) Absent such an attribute, or a BOM indicating another Unicode encoding though, it is a bug. Cheers, Henry
