Re: [Pharo-project] String input not in UTF-8

Stéphane Ducasse Sat, 06 Aug 2011 02:43:50 -0700

On Aug 5, 2011, at 4:41 PM, Hilaire Fernandes wrote:

> I gave a look at the latest XMLParser but the API is different with a
> lot broken code on my face. Does XMLWriter class>>on: obsolete ? It bugs
> me with that but the class and method are still there, a Monticello
> trick I forget about?
> I don't even now how to port to new API. Is there a port guide?
> I guess this is for the better, but still frustrating and distracting
> from the main task...


indeed
We should really invest into some main packages.
For example I worked on SOUP to add comments and add new tests.
Now we (the core) do not have the energy to work on the core and external 
packages.
I hope it will change when the core gets fixed.

> 
> 
> 
> Le 05/08/2011 16:23, Henrik Johansen a écrit :
>> 
>> On Aug 5, 2011, at 3:41 54PM, Hilaire Fernandes wrote:
>> 
>>> Le 05/08/2011 13:28, Henrik Johansen a écrit :
>>>> 
>>>> On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote:
>>>> 
>>>>> It seems like when inputing accented character it is not by default in
>>>>> UTF-8.
>>>>> Is it the case with Pharo 1.3 ?
>>>>> 
>>>>> Hilaire
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Education 0.2 -- http://blog.ofset.org/hilaire
>>>> 
>>>> I'm not sure what you mean.
>>>> When in image, all the way from InputEvents to String representation, you 
>>>> only deal with Unicode codePoints.
>>> 
>>> Is seems it is 8 bits chars, when exported through XMLParser, it is
>>> 8bits string. I need to investigate further.
>>> 
>>> Hilaire
>> It is an 8-bit character, since the codePoint fits in one byte. (see a)
>> Accented characters like é could be either:
>> a) One Unicode codepoint (U+00E9 (decimal 233) small acute e )
>> b) Two Unicode codepoints ( U+0301 (decimal 769) combining acute accent + 
>> U0065 (decimal 101) small e ).
>> 
>> Internally, you'd see strings with character values corresponding to those 
>> listed as decimal, ie the unicode codePoints.
>> b) would be a WideString, as 769 does not fit in a byte.
>> 
>> However, if  correctly converted to UTF8, their representations should be;
>> a)  represented in 2 bytes ;       16r C3A9
>> b)  represented  in 3 bytes:  16r CD81 65.
>> 
>> Ie. it seems XMLParser does not encode it properly to utf8 when exporting.
>> Note: This is perfectly legal if the document contains an encoding attribute 
>> specifying a one-byte encoding like iso-8859-1 or windows-1252.
>> (starts with <?xml version="1.0" encoding="windows-1252" ?> or some such)
>> Absent such an attribute, or a BOM indicating another Unicode encoding 
>> though, it is a bug.
>> 
>> Cheers,
>> Henry
>> 
>> 
>> 
> 
> 
> -- 
> Education 0.2 -- http://blog.ofset.org/hilaire
> 
>

Re: [Pharo-project] String input not in UTF-8

Reply via email to