Re: [Pharo-users] Class name with diacritic character and Pharo

Sven Van Caekenberghe Sun, 27 Jan 2019 08:05:24 -0800

Hi Dominique,

> On 27 Jan 2019, at 11:40, Dominique Dartois <d...@dartois.org> wrote:
> 
> Hello all. 
> If a use french diacritic character in a class name, the code runs but I 
> can’t fileout the package nor save it with Monticello. 
> For example, the C cedilla in the class name drive me to an 
> ‘ZnInvalidUTF8:Illegal byte for utf-8 encoding' when filing out.
> 
> Is it a bug or a feature?
> Thank you
> 
> <image.png>--- 
> Dominique Dartois


Thanks for reporting this. This is most definitely a bug, I can confirm its 
occurrence.

I'm CC pharo-dev as this is quite important. This will be a long mail.


This is one manifestation of a problem that has been present for quite a while.

I'll start by describing what I did, what went well and where/how this fails, 
some generic points, and two conceptual solutions (that need further 
verification).

Like you, I created a new subclass:

Object subclass: #ClasseFrançaise
        instanceVariableNames: ''
        classVariableNames: ''
        package: '_UnpackagedPackage'

With comment:

I am ClasseFrançaise.

Try:

        ClasseFrançaise new élève.
        ClasseFrançaise new euro.

And two methods (in the 'test' protocol):

élève
        ^ 'élève'

euro
        ^ '€'

I added the euro sign (because that is encoded in UTF-8 with 3 bytes, not 2 
like ç).
Like you said, the system can cope with such class and method names and seems 
to function fine.

Looking at the .changes file, the correct source code was appended:

----SNAPSHOT----2019-01-26T23:36:18.548555+01:00 work.image priorSource: 339848!

Object subclass: #ClasseFrançaise
        instanceVariableNames: ''
        classVariableNames: ''
        package: '_UnpackagedPackage'!
!ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:25' prior: 0!
I am ClasseFrançaise.!
!ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 
12:26'!
élève
        ^ 'élève'! !
!ClasseFrançaise commentStamp: 'SvenVanCaekenberghe 1/27/2019 12:27' prior: 
33898360!
I am ClasseFrançaise.

Try:

        ClasseFrançaise new élève.
        ClasseFrançaise new euro.
!
!ClasseFrançaise methodsFor: 'test' stamp: 'SvenVanCaekenberghe 1/27/2019 
12:27'!
euro
        ^ '€'! !


Doing a file out (or otherwise saving the source code) fails. The reason is an 
incorrect manipulation of this source file while looking for what is called the 
method preamble, in SourcFileArray>>#getPreambleFrom:at: position

An programmatic way to invoke the same error is by doing

(ClasseFrançaise>>#élève) timeStamp.
(ClasseFrançaise>>#élève) author.

Both fail with the same error.


The source code of methods is (currently) stored in a .sources or .changes 
file. CompiledMethods know their source pointer, an offset in one of these 
files. Right before the place where the source starts is a preamble that 
contains some meta information (including the author and timestamp). To access 
that preamble, the source code pointer is moved backwards to the beginning of 
the preamble (which begins and ends with a !).


The current approach fails in the presence of non-ASCII characters. More 
specifically because of a mixup between the concept of byte position and 
character position when using UTF-8, a variable length encoding (both the 
.changes and the .sources are UTF-8 encoded).

For example, consider

'à partir de 10 €' size. "16"
'à partir de 10 €' utf8Encoded size. "19"

So although the string contains 16 characters, it is encoded as 19 bytes, à 
using 2 bytes and € using 3 bytes. In general, moving backwards or forwards in 
UTF-8 encoded bytes cannot be done without understanding UTF-8 itself.

ZnUTF8Encoder can do both (moving forward is #nextFromStream: while moving 
backwards is #backOnStream:). However, ZnUTF8Encoder is also strict: it will 
signal an error when forced to operate in between encoded characters, which is 
what happens here.

It is thus not possible to move to arbitrary bytes positions and assume/hope to 
always arrive on the correct character boundaries and it is also wrong to take 
the difference between two byte positions as the count of characters present 
(since their encoding is of variable length).

SourcFileArray>>#getPreambleFrom:at: is doing both of these wrong (but gets 
away with it in 99.99% of all cases since very few people name their classes 
like that).

There are two solutions: operate mostly on the byte level or operate correctly 
on the character level. Here are two conceptual solutions (you must execute 
either solution 1 or 2, not both), with two different inputs.


src := '!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 
1/27/2019 12:27''!
euro
        ^ ''€''! !'.

"startPosition := 83"

str := ZnCharacterReadStream on: (src utf8Encoded readStream).
str position: 83. "at start of euro, the methods source string"
str upToEnd.

str position: (83 - 3). "before ! before euro"

"find the previous ! before position"
position := str position.
binary := str wrappedStream.
encoder := str encoder.

"solution 1"
[ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] 
whileTrue: [ position := position - 1 ].
position.
encoder decodeBytes: (binary next: 80 - position).

"solution 2"
count := 0.
[ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [ 
        encoder backOnStream: binary; backOnStream: binary. count := count + 1 
].
str position.
str next: count.


Same code, different input (and starting position):


src := '!ABC!ClasseFrançaise methodsFor: ''test'' stamp: ''SvenVanCaekenberghe 
1/27/2019 12:27''!
euro
        ^ ''€''! !'.

"startPosition := 87"

str := ZnCharacterReadStream on: (src utf8Encoded readStream).
str position: 87. "at start of euro, the methods source string"
str upToEnd.

str position: (87 - 3). "before ! before euro"

"find the previous ! before position"
position := str position.
binary := str wrappedStream.
encoder := str encoder.

"solution 1"
[ position >= 0 and: [ binary position: position. binary next ~= 33 ] ] 
whileTrue: [ position := position - 1 ].
position.
encoder decodeBytes: (binary next: 80 - position).

"solution 2"
count := 0.
[ str position >= 0 and: [ str next ~= $! ] ] whileTrue: [ 
        encoder backOnStream: binary; backOnStream: binary. count := count + 1 
].
str position.
str next: count.


Solution 1 (at the byte level) works because $! (ASCII code 33) is also encoded 
as such in UTF-8 and this byte pattern cannot be part of a 2, 3 or 4 byte UTF-8 
encoded character. The final byte segment must then be given to the proper 
encoder to turn it into characters.

Solution 2 (at the character level) works because it essentially counts 
characters while moving backwards correctly (it moves back 2 steps each time 
because #next moves 1 step forward, while #peek would not help).

Note that in Solution 1 the moving position is held in an external variable, 
while in Solution 2 it is held inside the stream.


Implementing either solution requires more internal access of the streams 
(#wrappedStream), so SourcFileArray>>#getPreambleFrom:at: should at least be 
moved to SourceFile, IMHO.

Obviously this is a sensitive/critical area to touch.

We will also need a regression unit test.

On a higher design level, I think that CompiledMethod>>#timeStamp and 
CompiledMethod>>#author should be cached at the instance level (a unix 
timestamp and a symbol would cost very little).

Sven

Re: [Pharo-users] Class name with diacritic character and Pharo

Reply via email to