Alain,

On 24 Sep 2014, at 23:00, Alain Rastoul <alf.mmm....@gmail.com> wrote:

> Le 24/09/2014 19:09, Benjamin Pollack a écrit :
> 
>> If Pharo used > ByteArrays to represent paths, with convenience methods for 
>> working with
>> UTF-8 (since I do agree that's the most likely thing for a user/dev to
>> want), then you'd be able to work with all files no matter what, *and*
>> have a convenient way of doing so for the common case.
> Hi Ben,
> I strongly disagree with you on this point: using byte arrays (or byte 
> strings) is a pain in an international context.
> The OS knows about its encoding: locale for unix, code page for windows.
> Windows code pages depends on country, for english windows 1252 (similar to 
> iso-8859-1), for other european countries, other variations of 8859-xx... 
> (welcome to ISO  soup), same for unix.
> 
> Java uses UTF8 strings and dotNet uses UTF16 strings (don't know for Python) 
> where chars are not bytes and they are not used as byte arrays but as 
> Character arrays.
> Both do conversions from OS character set encoding  to internal encoding for 
> strings (paths and whatever).
> 
> There is already an UTF8 and UTF16 encoding support in Pharo, but the
> standard String class uses bytes, and lot of files, directories and
> system methods use ByteString class and that is the problem here.
> UTF8 encoding in Pharo encodes to a variable lenght ByteString, which is not 
> the same as an (hypothetical) Utf8String where all (variable length) chars 
> would be utf8 encoded.
> Using a new UTF8 or UTF16 string class could be a major rework,
> but taking a decision about about internal string encoding is needed.
> As Sven says, there is no emergency and you have a workaround, but
> perhaps using the existing WideString encoded as UTF16 (or UTF32?) in
> some well defined classes/methods could be a good start for this rework?
> IMHO the workaround of using utf8 encoded byte strings is not a good way to 
> deal with this problem and should not be granted as "the solution".

The character encoding situation in Pharo is pretty good actually. The only 
problem is that there is some old school code left that encodes strings into 
strings, but today you can easily write much better and conceptually correct 
code.

You could have a look at this draft chapter of the upcoming 'Enterprise Pharo' 
book that I am currently writing:

  http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/

Concerning file system paths, FilePathEncoder and FilePluginPrimitives already 
do the right thing.

Now, your idea about using UTF-8 to represent internal Strings is something 
that has been discussed before and in many other languages as well. The short 
answer is that due to it being variable length, the inefficiency is (probably) 
just too high. Simple indexed access becomes a problem, let alone more complex 
string manipulations. I am not saying that it cannot be done, I think it is 
just not worth the trouble. The current solution in Pharo with ByteString and 
WideString is quite nice (check the chapter I mentioned before).

Sven


Reply via email to