Re: [Pharo-users] Ridiculous we are

2014-09-26 Thread Henrik Johansen

On 25 Sep 2014, at 8:55 , Alain Rastoul alf.mmm@gmail.com wrote:

 Le 25/09/2014 07:23, Sven Van Caekenberghe a écrit :
 
 On 25 Sep 2014, at 01:04, Alain Rastoul alf.mmm@gmail.com wrote:
 
 Le 25/09/2014 00:06, Sven Van Caekenberghe a écrit :
 Alain,
 
 The character encoding situation in Pharo is pretty good actually. The 
 only problem is that there is some old school code left that encodes 
 strings into strings, but today you can easily write much better and 
 conceptually correct code.
 
 You could have a look at this draft chapter of the upcoming 'Enterprise 
 Pharo' book that I am currently writing:
 
   http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/
 
 Concerning file system paths, FilePathEncoder and FilePluginPrimitives 
 already do the right thing.
 
 Now, your idea about using UTF-8 to represent internal Strings is 
 something that has been discussed before and in many other languages as 
 well. The short answer is that due to it being variable length, the 
 inefficiency is (probably) just too high. Simple indexed access becomes a 
 problem, let alone more complex string manipulations. I am not saying that 
 it cannot be done, I think it is just not worth the trouble. The current 
 solution in Pharo with ByteString and WideString is quite nice (check the 
 chapter I mentioned before).
 
 Sven
 
 Very interesting !
 It seems that most of what I was saying is already here :)
 I was not saying that Pharo should use utf8 (I mentionned utf8 because it 
 is a standard, but I find the variable length encoding very weird), I was 
 rather talking of using WideString in UTF 16 or 32 and that's done.
 I saw asWideString but didn't know about automatic convertion or codepoint 
 selector and internal wide string support.
 Does it means that Pharo Greek users (for example) use WideString for 
 Strings without having to specify it or make explicit convertions (except 
 of course when dealing with bytes if they want to) ?
 If yes, very good, job is almost done :)
 (personnally I would also deprecate ByteString, and get rid of it, just my 
 opinion).
 Thanks for the link, another good chapter .
 
 Regards,
 
 Alain
 
 ByteString is important because it is an optimalization of the most common 
 case.
 
 I understand the point here, memory/data footprint, cpu cache and so on (not 
 talking of encoding/decoding).
 I think that's why Microsoft choosed UTF16 (old UCS2) as a middle solution 
 because it covers most of character sets with 2 bytes.

It used to be a middle solution, back when UCS2 could encode the entire defined 
Unicode set.
Novadays it's just the worst of both worlds; you waste memory for most normal 
text, *and* you don't have constant time indexed code point access.

The duality we have in Pharo is an attempt to achieve the *best* of both 
worlds, wasting little memory for the normal case (latin1), and maintain 
constant time indexed access in all cases.
The ultimate solution for this approach would have a trio of string classes 
with slot sizes 8 - 16 - 32 expanding / contracting as needed, but we don't 
have classes with variable short slots. (currently, they're planned in new Cog, 
if I've understood Eliots new object format correctly)

Cheers,
Henry


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Pharo-users] Ridiculous we are

2014-09-26 Thread stepharo

I'm not expert and I would like to know what people think.
But I think that we should consider

- the impact of spur new object format. I would like to have 
unicode and clean the leadChar


Stef



Re: [Pharo-users] Ridiculous we are

2014-09-26 Thread stepharo

Sven I love this chapter.
I will read it calmly now.

Stef

On 25/9/14 07:23, Sven Van Caekenberghe wrote:

On 25 Sep 2014, at 01:04, Alain Rastoul alf.mmm@gmail.com wrote:


Le 25/09/2014 00:06, Sven Van Caekenberghe a écrit :

Alain,
The character encoding situation in Pharo is pretty good actually. The only 
problem is that there is some old school code left that encodes strings into 
strings, but today you can easily write much better and conceptually correct 
code.

You could have a look at this draft chapter of the upcoming 'Enterprise Pharo' 
book that I am currently writing:

   http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/

Concerning file system paths, FilePathEncoder and FilePluginPrimitives already 
do the right thing.

Now, your idea about using UTF-8 to represent internal Strings is something 
that has been discussed before and in many other languages as well. The short 
answer is that due to it being variable length, the inefficiency is (probably) 
just too high. Simple indexed access becomes a problem, let alone more complex 
string manipulations. I am not saying that it cannot be done, I think it is 
just not worth the trouble. The current solution in Pharo with ByteString and 
WideString is quite nice (check the chapter I mentioned before).

Sven


Very interesting !
It seems that most of what I was saying is already here :)
I was not saying that Pharo should use utf8 (I mentionned utf8 because it is a 
standard, but I find the variable length encoding very weird), I was rather 
talking of using WideString in UTF 16 or 32 and that's done.
I saw asWideString but didn't know about automatic convertion or codepoint 
selector and internal wide string support.
Does it means that Pharo Greek users (for example) use WideString for Strings 
without having to specify it or make explicit convertions (except of course 
when dealing with bytes if they want to) ?
If yes, very good, job is almost done :)
(personnally I would also deprecate ByteString, and get rid of it, just my 
opinion).
Thanks for the link, another good chapter .

Regards,

Alain

Yes, the Greek users won't notice a difference, it is all transparent. 
ByteString is important because it is an optimalization of the most common 
case. As a normal user you should only think of abstract Strings and never use 
#asByteString (but use proper encoding).

Feedback on the chapter is always welcome.

Sven






Re: [Pharo-users] Ridiculous we are

2014-09-26 Thread Hilaire
Le 26/09/2014 21:00, p...@highoctane.be a écrit :
 I'd love another title for this thread.
 
 It depresses me.

Yes, me too.

Hilaire

-- 
Dr. Geo - http://drgeo.eu
iStoa - http://istoa.drgeo.eu




Re: [Pharo-users] Ridiculous we are

2014-09-25 Thread Henrik Johansen

On 25 Sep 2014, at 5:00 , Hilaire Fernandes hila...@drgeo.eu wrote:

 Le 24/09/2014 18:48, Benjamin Pollack a écrit :
 On Tue, 23 Sep 2014 08:51:54 -0400, Hilaire hila...@drgeo.eu wrote:
 
 Le 23/09/2014 14:09, Damien Cassou a écrit :
 I recently read documents about utf-8 encoding. In all of them, the
 author says that pathnames should be kept as is because you never know
 which encoding the filesystem uses. So, a filename should probably be
 a bytearray.
 
 
 yes, but a #é should be encoded in two bytes.
 
 As noted in my previous message, é could be represented as either
 one or two Unicode code points, and these in turn could validly be
 either two or three bytes in UTF-8.  My gut says that $é should be
 U+00E9, because otherwise you should have to use two Characters ($e
 and $´), but you could legitimately argue otherwise as well, and at
 any rate, #é could definitely be either.  This is likely the core of
 the issue you're hitting.
 As I understand it, #é should be encoded on two bytes and only two byte.
 Only ASCII is coded as 1 byte with UTF-8.
 See ref. on Wikipedia

Hilaire: Benjamin is talking about which unicode normalization form é should be 
represented in, which is orthogonal to the encoding; 
http://en.wikipedia.org/wiki/Unicode_equivalence#Combining_and_precomposed_characters
 .
So é can indeed be encoded in two different ways in utf8 (as in any other 
encoding), both as #[c3 a9] (encoding U+E9, Latin small letter e with acute), 
and as #[65 cc 81] (encoding U+65, Latin small letter e, followed by U+0301, 
Combining accute accent)

Benjamin: Since the base path that contains the problematic character 
originates from a filesystem primitive, we can safely assume it's already in a 
canonical form*, Pharo does no automatic normalization. (that is, if the path 
would have been e + ´, the internal string would have two separate characters 
as well)

Cheers,
Henry

* Only Mac OSX defines a canonical form for its paths anyways, the others don't 
care


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Pharo-users] Ridiculous we are

2014-09-25 Thread Henrik Johansen

On 22 Sep 2014, at 10:07 , Hilaire hila...@drgeo.eu wrote:

 
 However font path seems ok:
 File @ /home/hilaire/Téléchargements/DrGeo.app/Contents/Resources.
 Inspecting this path, it looks like 'Téléchargements' is 8 bits, but it
 should be utf-8, right?
 
 I think there are issue on Windows, as some user reported to me.


The fun thing about plugins calling external libraries, is that you have to 
find out what that library does to know the right answer to what encoding char* 
parameters are meant to be passed...

In the case of FreeType, after some digging*, it seems to me it ends up calling 
fopen on all platforms, which on windows... *drumroll*
... resolves to the legacy ANSI version** of the Windows file libraries. 
Hence, the correct encoding to use on Windows would be the locale legacy code 
page.
It also means that, on Windows, you *cannot* load fonts from a directory whose 
name is not encodable in the current codepage no matter what we do in Pharo. 
(short of submitting a bug-fix to the FreeType project)

Cheers,
Henry

*FT_New_Face 
(http://git.savannah.gnu.org/cgit/freetype/freetype2.git/tree/src/base/ftobjs.c)
 calls...
FT_Open_Face  (same) which calls...
FT_Stream_New  (same) which calls...
FT_Stream_Open 
(http://git.savannah.gnu.org/cgit/freetype/freetype2.git/tree/src/base/ftsystem.c)
 which calls...
ft_fopen 
(http://git.savannah.gnu.org/cgit/freetype/freetype2.git/tree/include/config/ftstdlib.h)
 which resolves to
f_open.

** http://msdn.microsoft.com/en-us/library/yeby3zcb.aspx , don't be fooled, the 
Unicode support section is about contents written/read to/from file, not the 
path parameter.


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Pharo-users] Ridiculous we are

2014-09-25 Thread Alain Rastoul

Le 25/09/2014 07:23, Sven Van Caekenberghe a écrit :


On 25 Sep 2014, at 01:04, Alain Rastoul alf.mmm@gmail.com wrote:


Le 25/09/2014 00:06, Sven Van Caekenberghe a écrit :

Alain,



The character encoding situation in Pharo is pretty good actually. The only 
problem is that there is some old school code left that encodes strings into 
strings, but today you can easily write much better and conceptually correct 
code.

You could have a look at this draft chapter of the upcoming 'Enterprise Pharo' 
book that I am currently writing:

   http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/

Concerning file system paths, FilePathEncoder and FilePluginPrimitives already 
do the right thing.

Now, your idea about using UTF-8 to represent internal Strings is something 
that has been discussed before and in many other languages as well. The short 
answer is that due to it being variable length, the inefficiency is (probably) 
just too high. Simple indexed access becomes a problem, let alone more complex 
string manipulations. I am not saying that it cannot be done, I think it is 
just not worth the trouble. The current solution in Pharo with ByteString and 
WideString is quite nice (check the chapter I mentioned before).

Sven


Very interesting !
It seems that most of what I was saying is already here :)
I was not saying that Pharo should use utf8 (I mentionned utf8 because it is a 
standard, but I find the variable length encoding very weird), I was rather 
talking of using WideString in UTF 16 or 32 and that's done.
I saw asWideString but didn't know about automatic convertion or codepoint 
selector and internal wide string support.
Does it means that Pharo Greek users (for example) use WideString for Strings 
without having to specify it or make explicit convertions (except of course 
when dealing with bytes if they want to) ?
If yes, very good, job is almost done :)
(personnally I would also deprecate ByteString, and get rid of it, just my 
opinion).
Thanks for the link, another good chapter .

Regards,

Alain


ByteString is important because it is an optimalization of the most common case.


I understand the point here, memory/data footprint, cpu cache and so on 
(not talking of encoding/decoding).
I think that's why Microsoft choosed UTF16 (old UCS2) as a middle 
solution because it covers most of character sets with 2 bytes.
May be I'm excessive but I have reasons, once had to debug a french 
program used in China by a Chinese user who was seeing weird 
characters on a (weird-to-me) chinese windows xp ... a missing 
WideString and a great moment of loneliness :)
As a normal user you should only think of abstract Strings and never use 
#asByteString (but use proper encoding).


Feedback on the chapter is always welcome.

Sven


Agree.
Your chapter is excellent, I played a bit with Zn encoders.
I look forward to Pharo for the enterprise on Lulu.

However, I'm wondering , WideString beeing a variableWordSubclass: with 
32 bits words on a 32 bits vm, what will it become on a 64 bits vm ? 32 
bits words or 64 bit words ? immediate characters  (seen on Clément 
Bera's blog about Spur and new object format) ?


Alain




Re: [Pharo-users] Ridiculous we are

2014-09-24 Thread Benjamin Pollack

On Tue, 23 Sep 2014 08:51:54 -0400, Hilaire hila...@drgeo.eu wrote:


Le 23/09/2014 14:09, Damien Cassou a écrit :

I recently read documents about utf-8 encoding. In all of them, the
author says that pathnames should be kept as is because you never know
which encoding the filesystem uses. So, a filename should probably be
a bytearray.



yes, but a #é should be encoded in two bytes.


As noted in my previous message, é could be represented as either one or  
two Unicode code points, and these in turn could validly be either two or  
three bytes in UTF-8.  My gut says that $é should be U+00E9, because  
otherwise you should have to use two Characters ($e and $´), but you could  
legitimately argue otherwise as well, and at any rate, #é could definitely  
be either.  This is likely the core of the issue you're hitting.




Re: [Pharo-users] Ridiculous we are

2014-09-24 Thread Benjamin Pollack
On Mon, 22 Sep 2014 17:58:41 -0400, Sven Van Caekenberghe s...@stfx.eu  
wrote:


I also find the way some problems are reported quite disturbing. How  
much testing did you do ? On which platforms ?


I can do this (in Pharo 3) without any problems (we're talking about  
arbitrary Unicode characters in path names):


('/tmp' asFileReference / 'été') ensureCreateDirectory.
'/tmp/été' asFileReference exists.
('/tmp/été' asFileReference / 'Ελλάδα.txt') writeStreamDo: [ :out |
  out  'What about Greece ?' ].
('/tmp/été' asFileReference / 'Ελλάδα.txt') exists.
('/tmp/été' asFileReference / 'Ελλάδα.txt') contents.

And in a terminal, I get:

$ ls /tmp/été/Ελλάδα.txt
/tmp/été/Ελλάδα.txt

$ cat !$
cat /tmp/été/Ελλάδα.txt
What about Greece ?

This is on Mac OS X.

So this part fundamentally works in the image and on one VM. There might  
of course be problems in how paths are used in certain places or on  
certain VM/platforms.




Focusing purely on Unicode itself (not the encoding systems), a letter  
like é can be represented as U+00E9 (LATIN SMALL LETTER E WITH ACUTE), or  
as U+0065 (LATIN SMALL LETTER E) followed by U+0301 (combining acute  
accent).  These will appear identical to the user, but are emphatically  
*not* identical for most software.  The way you're testing here, you will  
not hit any error relating to this concept, ever, because you're using  
Pharo for both generating and consuming the strings.  At the very least,  
we'd need to generate a file named été with both forms explicitly and  
see what happens.


Things get even more exciting, though, because Unix says that file names  
are simply arbitrary byte patterns that do not contain the null byte.*   
Thus, you can trivially create a file named été using Latin-1 encoding,  
and again using UTF-8 encoding, and again using UTF-7 encoding, and these  
might all be shown to the user as identically named, but I guarantee you  
that Pharo will not act sanely with all four of these.  Even on Windows,  
where things are a bit saner (NTFS mandates UTF-16), and where an explicit  
normalization form is preferred (NFC), I just explicitly verified that I  
can trivially inject other normalization forms into the file system.   
Thus, you can still have two files named été that nevertheless have  
different names as far as the OS is concerned.


In this case, as far as I can tell, Pharo assumes that all path names are  
Unicode, and does not do any work to convert strings to or from the  
various normalization schemes (looking in Path  
classcanonicalizeElements:, Path classfrom:delimiter, and  
FileSystemStorepathFromString: here).


There's therefore a pretty straightforward fix that Pharo could do:

  1. Path would use ByteArrays as the actual canonical store, and
 provide convenience methods to see what the array decodes to
 in various encodings.  The developer and application can make
 decisions about what encoding system they want to use.
  2. The VM likely needs to be modified to handle this (didn't check)

As much as I wish Hilaire provided more details in his bug report, it's  
worth keeping in mind that not all users, or even all programmers,  
understand the full implications of things like how various Unicode  
normalization and encoding schemes interact in practice with Unix's very  
vague concept of what a file name actually is, so I usually try to  
approach these bug reports carefully and with an open mind.


--Benjamin

* On OS X, HFS+ uses UTF-16 with an Apple-specific variant of NFD, whereas  
I do not believe this holds for e.g. UFS or FUSE-backed file systems, so  
things are a bit subtler there, but the general rule holds.




Re: [Pharo-users] Ridiculous we are

2014-09-24 Thread Sven Van Caekenberghe

On 24 Sep 2014, at 18:48, Benjamin Pollack benja...@bitquabit.com wrote:

 On Tue, 23 Sep 2014 08:51:54 -0400, Hilaire hila...@drgeo.eu wrote:
 
 Le 23/09/2014 14:09, Damien Cassou a écrit :
 I recently read documents about utf-8 encoding. In all of them, the
 author says that pathnames should be kept as is because you never know
 which encoding the filesystem uses. So, a filename should probably be
 a bytearray.
 
 
 yes, but a #é should be encoded in two bytes.
 
 As noted in my previous message, é could be represented as either one or 
 two Unicode code points, and these in turn could validly be either two or 
 three bytes in UTF-8.  My gut says that $é should be U+00E9, because 
 otherwise you should have to use two Characters ($e and $´), but you could 
 legitimately argue otherwise as well, and at any rate, #é could definitely be 
 either.  This is likely the core of the issue you're hitting.

Did you read the actual conversation in the issue ?

 
https://pharo.fogbugz.com/f/cases/14054/Issue-with-path-with-accented-characters

It has been renamed and there is a fix (as a change set, not as a slice, yet). 
Basically, there was a primitive call into a plugin that failed to do encoding.

Now regarding the issues you raised. Pharo does not do Unicode canonicalisation 
or any of that other fancy stuff (like categorisation, proper ordering and so 
on). This is another orthogonal and way more general issue.

Regarding the pathnames encoding: if the OS itself does not know it, how can we 
? I think that the current approach (assuming UTF-8) makes (the most) sense for 
a system that runs on multiple platforms.

Sven




Re: [Pharo-users] Ridiculous we are

2014-09-24 Thread Benjamin Pollack
On Wed, 24 Sep 2014 13:03:57 -0400, Sven Van Caekenberghe s...@stfx.eu  
wrote:




Did you read the actual conversation in the issue ?

 
https://pharo.fogbugz.com/f/cases/14054/Issue-with-path-with-accented-characters

It has been renamed and there is a fix (as a change set, not as a slice,  
yet). Basically, there was a primitive call into a plugin that failed to  
do encoding.




No, I apologize; I missed the bug link.  Thanks for reposting it.

Now regarding the issues you raised. Pharo does not do Unicode  
canonicalisation or any of that other fancy stuff (like categorisation,  
proper ordering and so on). This is another orthogonal and way more  
general issue.


Regarding the pathnames encoding: if the OS itself does not know it, how  
can we ?


That's actually the argument *against* using UTF-8 as the standard Pharo  
way to represent filenames--at least on Unix systems.  If Pharo used  
ByteArrays to represent paths, with convenience methods for working with  
UTF-8 (since I do agree that's the most likely thing for a user/dev to  
want), then you'd be able to work with all files no matter what, *and*  
have a convenient way of doing so for the common case.


This is an old discussion, and I do see both sides of it.  In terms of  
SCMs, Mercurial and Git both just say it's a collection of bytes,  
whereas Subversion says it's Unicode code points.  This has some  
uncomfortable implications for both systems when working on multiple  
platforms.


--Benjamin



Re: [Pharo-users] Ridiculous we are

2014-09-24 Thread Sven Van Caekenberghe

On 24 Sep 2014, at 19:09, Benjamin Pollack benja...@bitquabit.com wrote:

 On Wed, 24 Sep 2014 13:03:57 -0400, Sven Van Caekenberghe s...@stfx.eu 
 wrote:
 
 
 Did you read the actual conversation in the issue ?
 
 https://pharo.fogbugz.com/f/cases/14054/Issue-with-path-with-accented-characters
 
 It has been renamed and there is a fix (as a change set, not as a slice, 
 yet). Basically, there was a primitive call into a plugin that failed to do 
 encoding.
 
 
 No, I apologize; I missed the bug link.  Thanks for reposting it.
 
 Now regarding the issues you raised. Pharo does not do Unicode 
 canonicalisation or any of that other fancy stuff (like categorisation, 
 proper ordering and so on). This is another orthogonal and way more general 
 issue.
 
 Regarding the pathnames encoding: if the OS itself does not know it, how can 
 we ?
 
 That's actually the argument *against* using UTF-8 as the standard Pharo way 
 to represent filenames--at least on Unix systems.  If Pharo used ByteArrays 
 to represent paths, with convenience methods for working with UTF-8 (since I 
 do agree that's the most likely thing for a user/dev to want), then you'd be 
 able to work with all files no matter what, *and* have a convenient way of 
 doing so for the common case.
 
 This is an old discussion, and I do see both sides of it.  In terms of SCMs, 
 Mercurial and Git both just say it's a collection of bytes, whereas 
 Subversion says it's Unicode code points.  This has some uncomfortable 
 implications for both systems when working on multiple platforms.

Benjamin,

I think I understand the concern / situation that you describe. But I fail to 
see how not-interpreting it and interpreting it in different encodings can work 
in practice, especially since your point seems to be that there is no meta 
information that gives a definitive answer. 

I would guess that other languages, say Java or Python, have some approach to 
handle this problem ?

Also, since we are living with the current approach without much problems, I 
think the issue is not terribly pressing.

Sven




Re: [Pharo-users] Ridiculous we are

2014-09-24 Thread Alain Rastoul



Le 24/09/2014 19:09, Benjamin Pollack a écrit :


If Pharo used  ByteArrays to represent paths, with convenience methods for 
working with
UTF-8 (since I do agree that's the most likely thing for a user/dev to
want), then you'd be able to work with all files no matter what, *and*
have a convenient way of doing so for the common case.

Hi Ben,
I strongly disagree with you on this point: using byte arrays (or byte 
strings) is a pain in an international context.

The OS knows about its encoding: locale for unix, code page for windows.
Windows code pages depends on country, for english windows 1252 (similar 
to iso-8859-1), for other european countries, other variations of 
8859-xx... (welcome to ISO  soup), same for unix.


Java uses UTF8 strings and dotNet uses UTF16 strings (don't know for 
Python) where chars are not bytes and they are not used as byte arrays 
but as Character arrays.
Both do conversions from OS character set encoding  to internal encoding 
for strings (paths and whatever).


There is already an UTF8 and UTF16 encoding support in Pharo, but the
standard String class uses bytes, and lot of files, directories and
system methods use ByteString class and that is the problem here.
UTF8 encoding in Pharo encodes to a variable lenght ByteString, which is 
not the same as an (hypothetical) Utf8String where all (variable length) 
chars would be utf8 encoded.

Using a new UTF8 or UTF16 string class could be a major rework,
but taking a decision about about internal string encoding is needed.
As Sven says, there is no emergency and you have a workaround, but
perhaps using the existing WideString encoded as UTF16 (or UTF32?) in
some well defined classes/methods could be a good start for this rework?
IMHO the workaround of using utf8 encoded byte strings is not a good way 
to deal with this problem and should not be granted as the solution.





Re: [Pharo-users] Ridiculous we are

2014-09-24 Thread Sven Van Caekenberghe
Alain,

On 24 Sep 2014, at 23:00, Alain Rastoul alf.mmm@gmail.com wrote:

 Le 24/09/2014 19:09, Benjamin Pollack a écrit :
 
 If Pharo used  ByteArrays to represent paths, with convenience methods for 
 working with
 UTF-8 (since I do agree that's the most likely thing for a user/dev to
 want), then you'd be able to work with all files no matter what, *and*
 have a convenient way of doing so for the common case.
 Hi Ben,
 I strongly disagree with you on this point: using byte arrays (or byte 
 strings) is a pain in an international context.
 The OS knows about its encoding: locale for unix, code page for windows.
 Windows code pages depends on country, for english windows 1252 (similar to 
 iso-8859-1), for other european countries, other variations of 8859-xx... 
 (welcome to ISO  soup), same for unix.
 
 Java uses UTF8 strings and dotNet uses UTF16 strings (don't know for Python) 
 where chars are not bytes and they are not used as byte arrays but as 
 Character arrays.
 Both do conversions from OS character set encoding  to internal encoding for 
 strings (paths and whatever).
 
 There is already an UTF8 and UTF16 encoding support in Pharo, but the
 standard String class uses bytes, and lot of files, directories and
 system methods use ByteString class and that is the problem here.
 UTF8 encoding in Pharo encodes to a variable lenght ByteString, which is not 
 the same as an (hypothetical) Utf8String where all (variable length) chars 
 would be utf8 encoded.
 Using a new UTF8 or UTF16 string class could be a major rework,
 but taking a decision about about internal string encoding is needed.
 As Sven says, there is no emergency and you have a workaround, but
 perhaps using the existing WideString encoded as UTF16 (or UTF32?) in
 some well defined classes/methods could be a good start for this rework?
 IMHO the workaround of using utf8 encoded byte strings is not a good way to 
 deal with this problem and should not be granted as the solution.

The character encoding situation in Pharo is pretty good actually. The only 
problem is that there is some old school code left that encodes strings into 
strings, but today you can easily write much better and conceptually correct 
code.

You could have a look at this draft chapter of the upcoming 'Enterprise Pharo' 
book that I am currently writing:

  http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/

Concerning file system paths, FilePathEncoder and FilePluginPrimitives already 
do the right thing.

Now, your idea about using UTF-8 to represent internal Strings is something 
that has been discussed before and in many other languages as well. The short 
answer is that due to it being variable length, the inefficiency is (probably) 
just too high. Simple indexed access becomes a problem, let alone more complex 
string manipulations. I am not saying that it cannot be done, I think it is 
just not worth the trouble. The current solution in Pharo with ByteString and 
WideString is quite nice (check the chapter I mentioned before).

Sven




Re: [Pharo-users] Ridiculous we are

2014-09-24 Thread Alain Rastoul

Le 25/09/2014 00:06, Sven Van Caekenberghe a écrit :

Alain,



The character encoding situation in Pharo is pretty good actually. The only 
problem is that there is some old school code left that encodes strings into 
strings, but today you can easily write much better and conceptually correct 
code.

You could have a look at this draft chapter of the upcoming 'Enterprise Pharo' 
book that I am currently writing:

   http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/

Concerning file system paths, FilePathEncoder and FilePluginPrimitives already 
do the right thing.

Now, your idea about using UTF-8 to represent internal Strings is something 
that has been discussed before and in many other languages as well. The short 
answer is that due to it being variable length, the inefficiency is (probably) 
just too high. Simple indexed access becomes a problem, let alone more complex 
string manipulations. I am not saying that it cannot be done, I think it is 
just not worth the trouble. The current solution in Pharo with ByteString and 
WideString is quite nice (check the chapter I mentioned before).

Sven


Very interesting !
It seems that most of what I was saying is already here :)
I was not saying that Pharo should use utf8 (I mentionned utf8 because 
it is a standard, but I find the variable length encoding very weird), I 
was rather talking of using WideString in UTF 16 or 32 and that's done.
I saw asWideString but didn't know about automatic convertion or 
codepoint selector and internal wide string support.
Does it means that Pharo Greek users (for example) use WideString for 
Strings without having to specify it or make explicit convertions 
(except of course when dealing with bytes if they want to) ?

If yes, very good, job is almost done :)
(personnally I would also deprecate ByteString, and get rid of it, just 
my opinion).

Thanks for the link, another good chapter .

Regards,

Alain





Re: [Pharo-users] Ridiculous we are

2014-09-24 Thread Sven Van Caekenberghe

On 25 Sep 2014, at 01:04, Alain Rastoul alf.mmm@gmail.com wrote:

 Le 25/09/2014 00:06, Sven Van Caekenberghe a écrit :
 Alain,
 
 The character encoding situation in Pharo is pretty good actually. The only 
 problem is that there is some old school code left that encodes strings into 
 strings, but today you can easily write much better and conceptually correct 
 code.
 
 You could have a look at this draft chapter of the upcoming 'Enterprise 
 Pharo' book that I am currently writing:
 
   http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/
 
 Concerning file system paths, FilePathEncoder and FilePluginPrimitives 
 already do the right thing.
 
 Now, your idea about using UTF-8 to represent internal Strings is something 
 that has been discussed before and in many other languages as well. The 
 short answer is that due to it being variable length, the inefficiency is 
 (probably) just too high. Simple indexed access becomes a problem, let alone 
 more complex string manipulations. I am not saying that it cannot be done, I 
 think it is just not worth the trouble. The current solution in Pharo with 
 ByteString and WideString is quite nice (check the chapter I mentioned 
 before).
 
 Sven
 
 Very interesting !
 It seems that most of what I was saying is already here :)
 I was not saying that Pharo should use utf8 (I mentionned utf8 because it is 
 a standard, but I find the variable length encoding very weird), I was rather 
 talking of using WideString in UTF 16 or 32 and that's done.
 I saw asWideString but didn't know about automatic convertion or codepoint 
 selector and internal wide string support.
 Does it means that Pharo Greek users (for example) use WideString for Strings 
 without having to specify it or make explicit convertions (except of course 
 when dealing with bytes if they want to) ?
 If yes, very good, job is almost done :)
 (personnally I would also deprecate ByteString, and get rid of it, just my 
 opinion).
 Thanks for the link, another good chapter .
 
 Regards,
 
 Alain

Yes, the Greek users won't notice a difference, it is all transparent. 
ByteString is important because it is an optimalization of the most common 
case. As a normal user you should only think of abstract Strings and never use 
#asByteString (but use proper encoding).

Feedback on the chapter is always welcome.

Sven


Re: [Pharo-users] Ridiculous we are

2014-09-23 Thread Damien Cassou
On Mon, Sep 22, 2014 at 10:07 PM, Hilaire hila...@drgeo.eu wrote:
 However font path seems ok:
  File @ /home/hilaire/Téléchargements/DrGeo.app/Contents/Resources.
 Inspecting this path, it looks like 'Téléchargements' is 8 bits, but it
 should be utf-8, right?


I recently read documents about utf-8 encoding. In all of them, the
author says that pathnames should be kept as is because you never know
which encoding the filesystem uses. So, a filename should probably be
a bytearray.

-- 
Damien Cassou
http://damiencassou.seasidehosting.st

Success is the ability to go from one failure to another without
losing enthusiasm.
Winston Churchill



Re: [Pharo-users] Ridiculous we are

2014-09-23 Thread Hilaire
Le 23/09/2014 14:09, Damien Cassou a écrit :
 I recently read documents about utf-8 encoding. In all of them, the
 author says that pathnames should be kept as is because you never know
 which encoding the filesystem uses. So, a filename should probably be
 a bytearray.


yes, but a #é should be encoded in two bytes.
But although it looks strange, I am not sure it is the exact problem
because I can use accented file name for sketch, but problem arise when
loading a font. So may be the code loading a font. (cf my bug report)

Hilaire

-- 
Dr. Geo - http://drgeo.eu
iStoa - http://istoa.drgeo.eu




[Pharo-users] Ridiculous we are

2014-09-22 Thread Hilaire
Hello,

Tested on Linux, when I move DrGeo.app folder under hierarchy tree with
accent characters (For example, /home/hilaire/Téléchargement/), loading
font does not work

However font path seems ok:
 File @ /home/hilaire/Téléchargements/DrGeo.app/Contents/Resources.
Inspecting this path, it looks like 'Téléchargements' is 8 bits, but it
should be utf-8, right?

I think there are issue on Windows, as some user reported to me.

Holy shit.

Hilaire

-- 
Dr. Geo - http://drgeo.eu
iStoa - http://istao.drgeo.eu




Re: [Pharo-users] Ridiculous we are

2014-09-22 Thread Alexandre Bergel
:-(

I will soon face the same problem I fear, when I will start my lecture…

Alexandre
-- 
_,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:
Alexandre Bergel  http://www.bergel.eu
^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;._,.;:~^~:;.



On Sep 22, 2014, at 5:07 PM, Hilaire hila...@drgeo.eu wrote:

 Hello,
 
 Tested on Linux, when I move DrGeo.app folder under hierarchy tree with
 accent characters (For example, /home/hilaire/Téléchargement/), loading
 font does not work
 
 However font path seems ok:
 File @ /home/hilaire/Téléchargements/DrGeo.app/Contents/Resources.
 Inspecting this path, it looks like 'Téléchargements' is 8 bits, but it
 should be utf-8, right?
 
 I think there are issue on Windows, as some user reported to me.
 
 Holy shit.
 
 Hilaire
 
 -- 
 Dr. Geo - http://drgeo.eu
 iStoa - http://istao.drgeo.eu
 
 



Re: [Pharo-users] Ridiculous we are

2014-09-22 Thread Juraj Kubelka
Can you create an issue? I am cleaning the fonts and in some case I could 
consider this issue. If it is problem only on Windows, I will need someone’s 
assistance.

Cheers,
Juraj

On Sep 22, 2014, at 5:07 PM, Hilaire hila...@drgeo.eu wrote:

 Hello,
 
 Tested on Linux, when I move DrGeo.app folder under hierarchy tree with
 accent characters (For example, /home/hilaire/Téléchargement/), loading
 font does not work
 
 However font path seems ok:
 File @ /home/hilaire/Téléchargements/DrGeo.app/Contents/Resources.
 Inspecting this path, it looks like 'Téléchargements' is 8 bits, but it
 should be utf-8, right?
 
 I think there are issue on Windows, as some user reported to me.
 
 Holy shit.
 
 Hilaire
 
 -- 
 Dr. Geo - http://drgeo.eu
 iStoa - http://istao.drgeo.eu
 
 




Re: [Pharo-users] Ridiculous we are

2014-09-22 Thread Hilaire
You can use screenshot.

But back to the issue, in other part of DrGeo, when saving/loading
sketch, path or filename with accent, space are ok.
So not sure what's going on.

Hilaire

Le 22/09/2014 22:15, Alexandre Bergel a écrit :
 :-(
 
 I will soon face the same problem I fear, when I will start my lecture…
 
 Alexandre
 


-- 
Dr. Geo - http://drgeo.eu
iStoa - http://istao.drgeo.eu




Re: [Pharo-users] Ridiculous we are

2014-09-22 Thread stepharo

Hilaire

These are two days that after upgrading my iPhone, the recovery process 
crash.
After two days trying I finally succeeded to upload my recovery to my 
iPhone and
now my iPhone crashes continously at boot time. I get a nice sepia 
screenshot and

it restarts. I will have to send my iPhone to Apple for real check.
Just because I did an update!

So I do not accept the title of your email. Simply I cannot.

Do you imagine the billions injected into iPhone. So probably iPhone is 
one order of magnitude
more complex than Pharo but the money injected into Pharo is our 
collective time and

it is far from being an order of magnitude smaller than several billions.

Stef


On 22/9/14 22:07, Hilaire wrote:

Hello,

Tested on Linux, when I move DrGeo.app folder under hierarchy tree with
accent characters (For example, /home/hilaire/Téléchargement/), loading
font does not work

However font path seems ok:
  File @ /home/hilaire/Téléchargements/DrGeo.app/Contents/Resources.
Inspecting this path, it looks like 'Téléchargements' is 8 bits, but it
should be utf-8, right?

I think there are issue on Windows, as some user reported to me.

Holy shit.

Hilaire






Re: [Pharo-users] Ridiculous we are

2014-09-22 Thread Hilaire
The issue is already there
https://pharo.fogbugz.com/f/cases/14054/Issue-with-path-with-accented-characters

I try to document it but it is odd, because for some other part in DrGeo
I don't have issue with accented path.
But should not the path be utf-8 encoded? Or is my fresh linuxmint box
using non utf-8 filename, not it can't be.

Hilaire

Le 22/09/2014 22:20, Juraj Kubelka a écrit :
 Can you create an issue? I am cleaning the fonts and in some case I could 
 consider this issue. If it is problem only on Windows, I will need someone’s 
 assistance.
 


-- 
Dr. Geo - http://drgeo.eu
iStoa - http://istao.drgeo.eu




Re: [Pharo-users] Ridiculous we are

2014-09-22 Thread Hilaire
Le 22/09/2014 22:35, stepharo a écrit :
 So I do not accept the title of your email. Simply I cannot.

Don't worry, it is a temporary cry/yield of frustration.

-- 
Dr. Geo - http://drgeo.eu
iStoa - http://istao.drgeo.eu




Re: [Pharo-users] Ridiculous we are

2014-09-22 Thread p...@highoctane.be
Also, sometimes things do look like Téléchargement but are still
Downloads under the hood as the OS translates the UI.

Phil





On Mon, Sep 22, 2014 at 10:35 PM, stepharo steph...@free.fr wrote:

 Hilaire

 These are two days that after upgrading my iPhone, the recovery process
 crash.
 After two days trying I finally succeeded to upload my recovery to my
 iPhone and
 now my iPhone crashes continously at boot time. I get a nice sepia
 screenshot and
 it restarts. I will have to send my iPhone to Apple for real check.
 Just because I did an update!

 So I do not accept the title of your email. Simply I cannot.

 Do you imagine the billions injected into iPhone. So probably iPhone is
 one order of magnitude
 more complex than Pharo but the money injected into Pharo is our
 collective time and
 it is far from being an order of magnitude smaller than several billions.

 Stef


 On 22/9/14 22:07, Hilaire wrote:

 Hello,

 Tested on Linux, when I move DrGeo.app folder under hierarchy tree with
 accent characters (For example, /home/hilaire/Téléchargement/), loading
 font does not work

 However font path seems ok:
   File @ /home/hilaire/Téléchargements/DrGeo.app/Contents/Resources.
 Inspecting this path, it looks like 'Téléchargements' is 8 bits, but it
 should be utf-8, right?

 I think there are issue on Windows, as some user reported to me.

 Holy shit.

 Hilaire







Re: [Pharo-users] Ridiculous we are

2014-09-22 Thread Hilaire
Le 22/09/2014 23:14, p...@highoctane.be a écrit :
 Also, sometimes things do look like Téléchargement but are still
 Downloads under the hood as the OS translates the UI.

Yes, I check within another path of my own like 'été', still same issue.
Strange is I have no issue to search for sketch file with accent. Only
when loading the font.
Hilaire


-- 
Dr. Geo - http://drgeo.eu
iStoa - http://istoa.drgeo.eu




Re: [Pharo-users] Ridiculous we are

2014-09-22 Thread Nicolai Hess
There is a similar issue for windows

13127 https://pharo.fogbugz.com/default.asp?13127
can not (always) read permissions for directoryentries on a path with
nonascii characters


2014-09-22 23:21 GMT+02:00 Hilaire hila...@drgeo.eu:

 Le 22/09/2014 23:14, p...@highoctane.be a écrit :
  Also, sometimes things do look like Téléchargement but are still
  Downloads under the hood as the OS translates the UI.

 Yes, I check within another path of my own like 'été', still same issue.
 Strange is I have no issue to search for sketch file with accent. Only
 when loading the font.
 Hilaire


 --
 Dr. Geo - http://drgeo.eu
 iStoa - http://istoa.drgeo.eu





Re: [Pharo-users] Ridiculous we are

2014-09-22 Thread Sven Van Caekenberghe
I also find the way some problems are reported quite disturbing. How much 
testing did you do ? On which platforms ?

I can do this (in Pharo 3) without any problems (we're talking about arbitrary 
Unicode characters in path names):

('/tmp' asFileReference / 'été') ensureCreateDirectory. 
'/tmp/été' asFileReference exists.
('/tmp/été' asFileReference / 'Ελλάδα.txt') writeStreamDo: [ :out |
  out  'What about Greece ?' ].
('/tmp/été' asFileReference / 'Ελλάδα.txt') exists.
('/tmp/été' asFileReference / 'Ελλάδα.txt') contents.

And in a terminal, I get:

$ ls /tmp/été/Ελλάδα.txt 
/tmp/été/Ελλάδα.txt

$ cat !$
cat /tmp/été/Ελλάδα.txt
What about Greece ?

This is on Mac OS X.

So this part fundamentally works in the image and on one VM. There might of 
course be problems in how paths are used in certain places or on certain 
VM/platforms.

Sven

On 22 Sep 2014, at 22:35, stepharo steph...@free.fr wrote:

 Hilaire
 
 These are two days that after upgrading my iPhone, the recovery process crash.
 After two days trying I finally succeeded to upload my recovery to my iPhone 
 and
 now my iPhone crashes continously at boot time. I get a nice sepia screenshot 
 and
 it restarts. I will have to send my iPhone to Apple for real check.
 Just because I did an update!
 
 So I do not accept the title of your email. Simply I cannot.
 
 Do you imagine the billions injected into iPhone. So probably iPhone is one 
 order of magnitude
 more complex than Pharo but the money injected into Pharo is our collective 
 time and
 it is far from being an order of magnitude smaller than several billions.
 
 Stef
 
 
 On 22/9/14 22:07, Hilaire wrote:
 Hello,
 
 Tested on Linux, when I move DrGeo.app folder under hierarchy tree with
 accent characters (For example, /home/hilaire/Téléchargement/), loading
 font does not work
 
 However font path seems ok:
  File @ /home/hilaire/Téléchargements/DrGeo.app/Contents/Resources.
 Inspecting this path, it looks like 'Téléchargements' is 8 bits, but it
 should be utf-8, right?
 
 I think there are issue on Windows, as some user reported to me.
 
 Holy shit.
 
 Hilaire
 
 
 




Re: [Pharo-users] Ridiculous we are

2014-09-22 Thread Robert Shiplett
so I stay with my 8Gb  iTouch iOS 3 ; with no prospect of an upgrade, I am
sorta worry-free.

If only it were also a phone ...

 Don't dial ... DO ! 

;-)

[ this msg was last seen in my default font ]


On 22 September 2014 17:35, stepharo steph...@free.fr wrote:

 Hilaire

 These are two days that after upgrading my iPhone, the recovery process
 crash.
 After two days trying I finally succeeded to upload my recovery to my
 iPhone and
 now my iPhone crashes continously at boot time. I get a nice sepia
 screenshot and
 it restarts. I will have to send my iPhone to Apple for real check.
 Just because I did an update!

 So I do not accept the title of your email. Simply I cannot.

 Do you imagine the billions injected into iPhone. So probably iPhone is
 one order of magnitude
 more complex than Pharo but the money injected into Pharo is our
 collective time and
 it is far from being an order of magnitude smaller than several billions.

 Stef


 On 22/9/14 22:07, Hilaire wrote:

 Hello,

 Tested on Linux, when I move DrGeo.app folder under hierarchy tree with
 accent characters (For example, /home/hilaire/Téléchargement/), loading
 font does not work

 However font path seems ok:
   File @ /home/hilaire/Téléchargements/DrGeo.app/Contents/Resources.
 Inspecting this path, it looks like 'Téléchargements' is 8 bits, but it
 should be utf-8, right?

 I think there are issue on Windows, as some user reported to me.

 Holy shit.

 Hilaire