On 25 Sep 2014, at 8:55 , Alain Rastoul <alf.mmm....@gmail.com> wrote:

> Le 25/09/2014 07:23, Sven Van Caekenberghe a écrit :
>> 
>> On 25 Sep 2014, at 01:04, Alain Rastoul <alf.mmm....@gmail.com> wrote:
>> 
>>> Le 25/09/2014 00:06, Sven Van Caekenberghe a écrit :
>>>> Alain,
>>> 
>>>> The character encoding situation in Pharo is pretty good actually. The 
>>>> only problem is that there is some old school code left that encodes 
>>>> strings into strings, but today you can easily write much better and 
>>>> conceptually correct code.
>>>> 
>>>> You could have a look at this draft chapter of the upcoming 'Enterprise 
>>>> Pharo' book that I am currently writing:
>>>> 
>>>>   http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/
>>>> 
>>>> Concerning file system paths, FilePathEncoder and FilePluginPrimitives 
>>>> already do the right thing.
>>>> 
>>>> Now, your idea about using UTF-8 to represent internal Strings is 
>>>> something that has been discussed before and in many other languages as 
>>>> well. The short answer is that due to it being variable length, the 
>>>> inefficiency is (probably) just too high. Simple indexed access becomes a 
>>>> problem, let alone more complex string manipulations. I am not saying that 
>>>> it cannot be done, I think it is just not worth the trouble. The current 
>>>> solution in Pharo with ByteString and WideString is quite nice (check the 
>>>> chapter I mentioned before).
>>>> 
>>>> Sven
>>>> 
>>> Very interesting !
>>> It seems that most of what I was saying is already here :)
>>> I was not saying that Pharo should use utf8 (I mentionned utf8 because it 
>>> is a standard, but I find the variable length encoding very weird), I was 
>>> rather talking of using WideString in UTF 16 or 32 and that's done.
>>> I saw asWideString but didn't know about automatic convertion or codepoint 
>>> selector and internal wide string support.
>>> Does it means that Pharo Greek users (for example) use WideString for 
>>> Strings without having to specify it or make explicit convertions (except 
>>> of course when dealing with bytes if they want to) ?
>>> If yes, very good, job is almost done :)
>>> (personnally I would also deprecate ByteString, and get rid of it, just my 
>>> opinion).
>>> Thanks for the link, another good chapter .
>>> 
>>> Regards,
>>> 
>>> Alain
>> 
>> ByteString is important because it is an optimalization of the most common 
>> case.
> 
> I understand the point here, memory/data footprint, cpu cache and so on (not 
> talking of encoding/decoding).
> I think that's why Microsoft choosed UTF16 (old UCS2) as a middle solution 
> because it covers most of character sets with 2 bytes.

It used to be a middle solution, back when UCS2 could encode the entire defined 
Unicode set.
Novadays it's just the worst of both worlds; you waste memory for most normal 
text, *and* you don't have constant time indexed code point access.

The duality we have in Pharo is an attempt to achieve the *best* of both 
worlds, wasting little memory for the "normal" case (latin1), and maintain 
constant time indexed access in all cases.
The ultimate solution for this approach would have a trio of string classes 
with slot sizes 8 - 16 - 32 expanding / contracting as needed, but we don't 
have classes with variable short slots. (currently, they're planned in new Cog, 
if I've understood Eliots new object format correctly)

Cheers,
Henry

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to