Le 25/09/2014 07:23, Sven Van Caekenberghe a écrit :

On 25 Sep 2014, at 01:04, Alain Rastoul <alf.mmm....@gmail.com> wrote:

Le 25/09/2014 00:06, Sven Van Caekenberghe a écrit :
Alain,

The character encoding situation in Pharo is pretty good actually. The only 
problem is that there is some old school code left that encodes strings into 
strings, but today you can easily write much better and conceptually correct 
code.

You could have a look at this draft chapter of the upcoming 'Enterprise Pharo' 
book that I am currently writing:

   http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/

Concerning file system paths, FilePathEncoder and FilePluginPrimitives already 
do the right thing.

Now, your idea about using UTF-8 to represent internal Strings is something 
that has been discussed before and in many other languages as well. The short 
answer is that due to it being variable length, the inefficiency is (probably) 
just too high. Simple indexed access becomes a problem, let alone more complex 
string manipulations. I am not saying that it cannot be done, I think it is 
just not worth the trouble. The current solution in Pharo with ByteString and 
WideString is quite nice (check the chapter I mentioned before).

Sven

Very interesting !
It seems that most of what I was saying is already here :)
I was not saying that Pharo should use utf8 (I mentionned utf8 because it is a 
standard, but I find the variable length encoding very weird), I was rather 
talking of using WideString in UTF 16 or 32 and that's done.
I saw asWideString but didn't know about automatic convertion or codepoint 
selector and internal wide string support.
Does it means that Pharo Greek users (for example) use WideString for Strings 
without having to specify it or make explicit convertions (except of course 
when dealing with bytes if they want to) ?
If yes, very good, job is almost done :)
(personnally I would also deprecate ByteString, and get rid of it, just my 
opinion).
Thanks for the link, another good chapter .

Regards,

Alain

ByteString is important because it is an optimalization of the most common case.

I understand the point here, memory/data footprint, cpu cache and so on (not talking of encoding/decoding). I think that's why Microsoft choosed UTF16 (old UCS2) as a middle solution because it covers most of character sets with 2 bytes. May be I'm excessive but I have reasons, once had to debug a french program used in China by a Chinese user who was seeing "weird" characters on a (weird-to-me) chinese windows xp ... a missing WideString and a great moment of loneliness :) As a normal user you should only think of abstract Strings and never use #asByteString (but use proper encoding).

Feedback on the chapter is always welcome.

Sven

Agree.
Your chapter is excellent, I played a bit with Zn encoders.
I look forward to Pharo for the enterprise on Lulu.

However, I'm wondering , WideString beeing a variableWordSubclass: with 32 bits words on a 32 bits vm, what will it become on a 64 bits vm ? 32 bits words or 64 bit words ? immediate characters (seen on Clément Bera's blog about Spur and new object format) ?

Alain


Reply via email to