Le 25/09/2014 07:23, Sven Van Caekenberghe a écrit :
On 25 Sep 2014, at 01:04, Alain Rastoul <alf.mmm....@gmail.com> wrote:
Le 25/09/2014 00:06, Sven Van Caekenberghe a écrit :
Alain,
The character encoding situation in Pharo is pretty good actually. The only
problem is that there is some old school code left that encodes strings into
strings, but today you can easily write much better and conceptually correct
code.
You could have a look at this draft chapter of the upcoming 'Enterprise Pharo'
book that I am currently writing:
http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/
Concerning file system paths, FilePathEncoder and FilePluginPrimitives already
do the right thing.
Now, your idea about using UTF-8 to represent internal Strings is something
that has been discussed before and in many other languages as well. The short
answer is that due to it being variable length, the inefficiency is (probably)
just too high. Simple indexed access becomes a problem, let alone more complex
string manipulations. I am not saying that it cannot be done, I think it is
just not worth the trouble. The current solution in Pharo with ByteString and
WideString is quite nice (check the chapter I mentioned before).
Sven
Very interesting !
It seems that most of what I was saying is already here :)
I was not saying that Pharo should use utf8 (I mentionned utf8 because it is a
standard, but I find the variable length encoding very weird), I was rather
talking of using WideString in UTF 16 or 32 and that's done.
I saw asWideString but didn't know about automatic convertion or codepoint
selector and internal wide string support.
Does it means that Pharo Greek users (for example) use WideString for Strings
without having to specify it or make explicit convertions (except of course
when dealing with bytes if they want to) ?
If yes, very good, job is almost done :)
(personnally I would also deprecate ByteString, and get rid of it, just my
opinion).
Thanks for the link, another good chapter .
Regards,
Alain
ByteString is important because it is an optimalization of the most common case.
I understand the point here, memory/data footprint, cpu cache and so on
(not talking of encoding/decoding).
I think that's why Microsoft choosed UTF16 (old UCS2) as a middle
solution because it covers most of character sets with 2 bytes.
May be I'm excessive but I have reasons, once had to debug a french
program used in China by a Chinese user who was seeing "weird"
characters on a (weird-to-me) chinese windows xp ... a missing
WideString and a great moment of loneliness :)
As a normal user you should only think of abstract Strings and never use
#asByteString (but use proper encoding).
Feedback on the chapter is always welcome.
Sven
Agree.
Your chapter is excellent, I played a bit with Zn encoders.
I look forward to Pharo for the enterprise on Lulu.
However, I'm wondering , WideString beeing a variableWordSubclass: with
32 bits words on a 32 bits vm, what will it become on a 64 bits vm ? 32
bits words or 64 bit words ? immediate characters (seen on Clément
Bera's blog about Spur and new object format) ?
Alain