Re: [Pharo-dev] Unicode Support

2015-12-13 Thread stepharo



I am pretty sure that this whole discussion does more harm than good for most 
people's understanding of Unicode.

It is best and (mostly) correct to think of a Unicode string as a sequence of 
Unicode characters, each defined/identified by a code point (out of 10.000s 
covering all languages). That is what we have today in Pharo (with the 
distinction between ByteString and WideString as mostly invisible 
implementation details).

To encode Unicode for external representation as bytes, we use UTF-8 like the 
rest of the modern world.

So far, so good.

Why all the confusion ? Because the world is a complex place and the Unicode 
standard tries to cover all possible things. Citing all these exceptions and 
special cases will make people crazy and give up. I am sure that most stopped 
reading this thread.



like me ;)
I will wait for a conclusion with code :)

Stef



Why then is there confusion about the seemingly simple concept of a character ? 
Because Unicode allows different ways to say the same thing. The simplest 
example in a common language is (the French letter é) is

LATIN SMALL LETTER E WITH ACUTE [U+00E9]

which can also be written as

LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301]

The former being a composed normal form, the latter a decomposed normal form. 
(And yes, it is even much more complicated than that, it goes on for 1000s of 
pages).

In the above example, the concept of character/string is indeed fuzzy.

HTH,

Sven








Re: [Pharo-dev] Unicode Support

2015-12-11 Thread Eliot Miranda
Hi Todd,

> On Dec 11, 2015, at 12:57 PM, Todd Blanchard  wrote:
> 
> 
>> On Dec 11, 2015, at 12:19, EuanM  wrote:
>> 
>> "If it hasn't already been said, please do not conflate Unicode and
>> UTF-8. I think that would be a recipe for
>> a high P.I.T.A. factor."  --Richard Sargent
> 
> Well, yes. But  I think you guys are making this way too hard.
> 
> A unicode character is an abstract idea - for instance the letter 'a'.
> The letter 'a' has a code point - its the number 97.  How the number 97 is 
> represented in the computer is irrelevant.
> 
> Now we get to transfer encodings.  These are UTF8, UTF16, etc  A transfer 
> encoding specifies the binary representation of the sequence of code points.
> 
> UTF8 is a variable length byte encoding.  You read it one byte at a time, 
> aggregating byte sequences to 'code points'.  ByteArray would be an excellent 
> choice as a superclass but it must be understood that #at: or #at:put refers 
> to a byte, not a character.  If you want characters, you have to start at the 
> beginning and process it sequentially, like a stream (if working in the ASCII 
> domain - you can generally 'cheat' this a bit).  A C representation would be 
> char utf8[]
> 
> UTF16 is also a variable length encoding of two byte quantities - what C used 
> to call a 'short int'.  You process it in two byte chunks instead of one byte 
> chunks.  Like UTF8, you must read it sequentially to interpret the 
> characters.  #at and #at:put: would necessarily refer to byte pairs and not 
> characters.  A C representation would be short utf16[];  It would also to 50% 
> space inefficient for ASCII - which is normally the bulk of your text.
> 
> Realistically, you need exactly one in-memory format and stream 
> readers/writers that can convert (these are typically table driven state 
> machines).  My choice would be UTF8 for the internal memory format and the 
> ability to read and write from UTF8 to UTF16.  
> 
> But I stress again...strings don't really need indexability as much as you 
> think and neither UTF8 nor UTF16 provide this property anyhow as they are 
> variable length encodings.  I don't see any sensible reason to have more than 
> one in-memory binary format in the image.

The only reasons are space and time.  If a string only contains code points in 
the range 0-255 there's no point in squandering 4 bytes per code point (same 
goes for 0-65535).  Further, if in some application interchange is more 
important than random access it may make sense in performance grounds to use 
utf-8 directly.

Again, Smalltalk's dynamic typing makes it easy to have one's cake and eat it 
too.

> My $0.02c

_,,,^..^,,,_ (phone)

> 
>> I agree. :-)
>> 
>> Regarding UTF-16, I just want to be able to export to, and receive
>> from, Windows (and any other platforms using UTF-16 as their native
>> character representation).
>> 
>> Windows will always be able to accept UTF-16.  All Windows apps *might
>> well* export UTF-16.  There may be other platforms which use UTF-16 as
>> their native format.  I'd just like to be able to cope with those
>> situations.  Nothing more.
>> 
>> All this is requires is a Utf16String class that has an asUtf8String
>> method (and any other required conversion methods). 
> 


Re: [Pharo-dev] Unicode Support

2015-12-11 Thread Richard Sargent
EuanM wrote
> ...
> all ISO-8859-1 maps 1:1 to Unicode UTF-8
> ...

I am late coming in to this conversation. If it hasn't already been said,
please do not conflate Unicode and UTF-8. I think that would be a recipe for
a high P.I.T.A. factor.

Unicode defines the meaning of the code points.
UTF-8 (and -16) define an interchange mechanism.

In other words, when you write the code points to an external medium
(socket, file, whatever), encode them via UTF-whatever. Read UTF-whatever
from an external medium and re-instantiate the code points.
(Personally, I see no use for UTF-16 as an interchange mechanism. Others may
have justification for it. I don't.)

Having characters be a consistent size in their object representation makes
everything easier. #at:, #indexOf:, #includes: ... no one wants to be
scanning through bytes representing variable sized characters.

Model Unicode strings using classes such as e.g. Unicode7, Unicode16, and
Unicode32, with automatic coercion to the larger character width.




--
View this message in context: 
http://forum.world.st/Unicode-Support-tp4865139p4866610.html
Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.



Re: [Pharo-dev] Unicode Support

2015-12-10 Thread Ben Coman
On Wed, Dec 9, 2015 at 5:35 PM, Guillermo Polito
 wrote:
>
>> On 8 dic 2015, at 10:07 p.m., EuanM  wrote:
>>
>> "No. a codepoint is the numerical value assigned to a character. An
>> "encoded character" is the way a codepoint is represented in bytes
>> using a given encoding."
>>
>> No.
>>
>> A codepoint may represent a component part of an abstract character,
>> or may represent an abstract character, or it may do both (but not
>> always at the same time).
>>
>> Codepoints represent a single encoding of a single concept.
>>
>> Sometimes that concept represents a whole abstract character.
>> Sometimes it represent part of an abstract character.
>
> Well. I do not agree with this. I agree with the quote.
>
> Can you explain a bit more about what you mean by abstract character and 
> concept?

This seems to be what Swift is doing, where Strings are not composed
not of codepoints but of graphemes.

>>> "Every instance of Swift’s Character type represents a single extended 
>>> grapheme cluster. An extended grapheme cluster is a sequence** of one or 
>>> more Unicode scalars that (when combined) produce a single human-readable 
>>> character. [1]

** i.e. not an array

>>> Here’s an example. The letter é can be represented as the single Unicode 
>>> scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same 
>>> letter can also be represented as a pair of scalars—a standard letter e 
>>> (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT 
>>> scalar (U+0301). TheCOMBINING ACUTE ACCENT scalar is graphically applied to 
>>> the scalar that precedes it, turning an e into an éwhen it is rendered by a 
>>> Unicode-aware text-rendering system. [1]

>>> In both cases, the letter é is represented as a single Swift Character 
>>> value that represents an extended grapheme cluster. In the first case, the 
>>> cluster contains a single scalar; in the second case, it is a cluster of 
>>> two scalars:" [1]

>>> Swiftʼs string implemenation makes working with Unicode easier and 
>>> significantly less error-prone. As a programmer, you still have to be aware 
>>> of possible edge cases, but this probably cannot be avoided completely 
>>> considering the characteristics of Unicode. [2]

Indeed I've tried searched for what problems it causes and get a null
result.  So I read  *all*good*  things about Swift's unicode
implementation reducing common errors dealing with Unicode.  Can
anyone point to complaints about Swift's unicode implementation?
Maybe this...

>>> An argument could be made that the implementation of String as a sequence 
>>> that requires iterating over characters from the beginning of the string 
>>> for many operations poses a significant performance problem but I do not 
>>> think so. My guess is that Appleʼs engineers have considered the 
>>> implications of their implementation and apps that do not deal with 
>>> enormous amounts of text will be fine. Moreover, the idea that you could 
>>> get away with an implementation that supports random access of characters 
>>> is an illusion given the complexity of Unicode. [2]

Considering our common pattern: Make it work, Make it right, Make it
fast  -- maybe Strings as arrays are a premature optimisation, that
was the right choice in the past prior to Unicode, but considering
Moore's Law versus programmer time, is not the best choice now.
Should we at least start with a UnicodeString and UnicodeCharacter
that operates like Swift, and over time *maybe* move the tools to use
them.

[1] 
https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
[2] http://oleb.net/blog/2014/07/swift-strings/

cheers -ben

>
>>
>> This is the key difference between Unicode and most character encodings.
>>
>> A codepoint does not always represent a whole character.
>>
>> On 7 December 2015 at 13:06, Henrik Johansen



Re: [Pharo-dev] Unicode Support // e acute example --> decomposition in Pharo?

2015-12-10 Thread H. Hirzel
Hello Sven

On 12/9/15, Sven Van Caekenberghe  wrote:

> The simplest example in a common language is (the French letter é) is
>
> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>
> which can also be written as
>
> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301]
>
> The former being a composed normal form, the latter a decomposed normal
> form. (And yes, it is even much more complicated than that, it goes on for
> 1000s of pages).
>
> In the above example, the concept of character/string is indeed fuzzy.
>
> HTH,
>
> Sven

Thanks for this example. I have created a wiki page with it

I wonder what the Pharo equivalent is of the following Squeak expression

$é asString asDecomposedUnicode

Regards

Hannes



Re: [Pharo-dev] Unicode Support

2015-12-09 Thread Sven Van Caekenberghe

> On 09 Dec 2015, at 14:16, EuanM  wrote:
> 
> "To encode Unicode for external representation as bytes, we use UTF-8
> like the rest of the modern world.
> 
> So far, so good.
> 
> Why all the confusion ?"

That was a rhetorical question.

I know that we lack normalization, we don't need another encoding or 
representation.

Sorting/collation can also be done regardless of encoding or representation.

These are orthogonal concerns to the working situation that we have today.

> The confusion arises because simply providing *a* valid UTF-8 encoding
> of does not ensure sortability, nor equivalence testability.
> 
> It might provide sortable strings. It might not.
> 
> It might provide a string that can be compared to another string
> successfully.  It might not.
> 
> So being able to perform valid UTF-8 encoding is *necessary*, but *not
> sufficient*.
> 
> i.e. the confusion arises because UTF-8 can provide for several
> competing, non-sortable encodings of even a single character.  This
> means that *valid* UTF-8 cannot be relied upon to provide these
> facilities *unless* all the UTF-8 strings can be relied upon to have
> been encoded to UTF-8 by the same specification of process.  i.e.
> *unless* it has gone through a process of being converted by *a
> specific* valid method of encoding to UTF-8.
> 
> Understanding the concept of abstract character is, imo key to
> understanding the differences between the various valid UTF-8 forms of
> a given abstract character.
> 
> 
> Cheers,
>Euan
> 
> On 9 December 2015 at 10:45, Sven Van Caekenberghe  wrote:
>> 
>>> On 09 Dec 2015, at 10:35, Guillermo Polito  
>>> wrote:
>>> 
>>> 
 On 8 dic 2015, at 10:07 p.m., EuanM  wrote:
 
 "No. a codepoint is the numerical value assigned to a character. An
 "encoded character" is the way a codepoint is represented in bytes
 using a given encoding."
 
 No.
 
 A codepoint may represent a component part of an abstract character,
 or may represent an abstract character, or it may do both (but not
 always at the same time).
 
 Codepoints represent a single encoding of a single concept.
 
 Sometimes that concept represents a whole abstract character.
 Sometimes it represent part of an abstract character.
>>> 
>>> Well. I do not agree with this. I agree with the quote.
>>> 
>>> Can you explain a bit more about what you mean by abstract character and 
>>> concept?
>> 
>> I am pretty sure that this whole discussion does more harm than good for 
>> most people's understanding of Unicode.
>> 
>> It is best and (mostly) correct to think of a Unicode string as a sequence 
>> of Unicode characters, each defined/identified by a code point (out of 
>> 10.000s covering all languages). That is what we have today in Pharo (with 
>> the distinction between ByteString and WideString as mostly invisible 
>> implementation details).
>> 
>> To encode Unicode for external representation as bytes, we use UTF-8 like 
>> the rest of the modern world.
>> 
>> So far, so good.
>> 
>> Why all the confusion ? Because the world is a complex place and the Unicode 
>> standard tries to cover all possible things. Citing all these exceptions and 
>> special cases will make people crazy and give up. I am sure that most 
>> stopped reading this thread.
>> 
>> Why then is there confusion about the seemingly simple concept of a 
>> character ? Because Unicode allows different ways to say the same thing. The 
>> simplest example in a common language is (the French letter é) is
>> 
>> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>> 
>> which can also be written as
>> 
>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301]
>> 
>> The former being a composed normal form, the latter a decomposed normal 
>> form. (And yes, it is even much more complicated than that, it goes on for 
>> 1000s of pages).
>> 
>> In the above example, the concept of character/string is indeed fuzzy.
>> 
>> HTH,
>> 
>> Sven
>> 
>> 
> 




Re: [Pharo-dev] Unicode Support

2015-12-09 Thread EuanM
"Well. I do not agree with this. I agree with the quote.

Can you explain a bit more about what you mean by abstract character
and concept?"

Guillermo

The problem with the quote is, that *while true*, it *does not
disambiguate* between:
either
compatibility character and abstract character;
or
character as composable component of an abstract character and
character as the entire
embodiment of an abstract character;

Abstract character is the key concept of Unicode.  Differentiation
between abstract character and codepoints is the key differentiator of
the Unicode approach and most previous approaches to character
encoding, e,g, ASCII, EBCDIC, ISO Latin 1, etc

Please see my previous posts which use the example of Angstrom,
Capital A with circle (or whatever the canonical name is) and the
composed sequence of "Capital A" and "circle above a letter" for a
fuller explanation of the concept of "abstract character".



On 9 December 2015 at 09:35, Guillermo Polito  wrote:
>
>> On 8 dic 2015, at 10:07 p.m., EuanM  wrote:
>>
>> "No. a codepoint is the numerical value assigned to a character. An
>> "encoded character" is the way a codepoint is represented in bytes
>> using a given encoding."
>>
>> No.
>>
>> A codepoint may represent a component part of an abstract character,
>> or may represent an abstract character, or it may do both (but not
>> always at the same time).
>>
>> Codepoints represent a single encoding of a single concept.
>>
>> Sometimes that concept represents a whole abstract character.
>> Sometimes it represent part of an abstract character.
>
> Well. I do not agree with this. I agree with the quote.
>
> Can you explain a bit more about what you mean by abstract character and 
> concept?
>
>>
>> This is the key difference between Unicode and most character encodings.
>>
>> A codepoint does not always represent a whole character.
>>
>> On 7 December 2015 at 13:06, Henrik Johansen
>>  wrote:
>>>
>>> On 07 Dec 2015, at 1:05 , EuanM  wrote:
>>>
>>> Hi Henry,
>>>
>>> To be honest, at some point I'm going to long for the for the much
>>> more succinct semantics of healthcare systems and sports scoring and
>>> administration systems again.  :-)
>>>
>>> codepoints are any of *either*
>>> - the representation of a component of an abstract character, *or*
>>> eg. "A" #(0041) as a component of
>>> - the sole representation of the whole of an abstract character *or* of
>>> -  a representation of an abstract character provided for backwards
>>> compatibility which is more properly represented by a series of
>>> codepoints representing a composed character
>>>
>>> e.g.
>>>
>>> The "A" #(0041) as a codepoint can be:
>>> the sole representation of the whole of an abstract character "A" #(0041)
>>>
>>> The representation of a component of the composed (i.e. preferred)
>>> version of the abstract character Å #(0041 030a)
>>>
>>> Å (#00C5) represents one valid compatibility form of the abstract
>>> character Å which is most properly represented by #(0041 030a).
>>>
>>> Å (#212b) also represents one valid compatibility form of the abstract
>>> character Å which is most properly represented by #(0041 030a).
>>>
>>> With any luck, this satisfies both our semantic understandings of the
>>> concept of "codepoint"
>>>
>>> Would you agree with that?
>>>
>>> In Unicode, codepoints are *NOT* an abstract numerical representation
>>> of a text character.
>>>
>>> At least not as we generally understand the term "text character" from
>>> our experience of non-Unicode character mappings.
>>>
>>>
>>> I agree, they are numerical representations of what Unicode refers to as
>>> characters.
>>>
>>>
>>> codepoints represent "*encoded characters*"
>>>
>>>
>>> No. a codepoint is the numerical value assigned to a character. An "encoded
>>> character" is the way a codepoint is represented in bytes using a given
>>> encoding.
>>>
>>> and "a *text element* ...
>>> is represented by a sequence of one or more codepoints".  (And the
>>> term "text element" is deliberately left undefined in the Unicode
>>> standard)
>>>
>>> Individual codepoints are very often *not* the encoded form of an
>>> abstract character that we are interested in.  Unless we are
>>> communicating to or from another system  (Which in some cases is the
>>> Smalltalk ByteString class)
>>>
>>>
>>>
>>>
>>> i.e. in other words
>>>
>>> *Some* individual codepoints *may* be a representation of a specific
>>> *abstract character*, but only in special cases.
>>>
>>> The general case in Unicode is that Unicode defines (a)
>>> representation(s) of a Unicode *abstract character*.
>>>
>>> The Unicode standard representation of an abstract character is a
>>> composed sequence of codepoints, where in some cases that sequence is
>>> as short as 1 codepoint.
>>>
>>> In other cases, Unicode has a compatibility alias of a single
>>> codepoint which is *also* a representation of an abstract character
>>>
>>> There are some cases where an abstract character can be represented by
>>> more t

Re: [Pharo-dev] Unicode Support

2015-12-09 Thread H. Hirzel
See example with ANGSTROM

Abstract Characters (Unicode)
http://wiki.squeak.org/squeak/6256



On 12/9/15, Guillermo Polito  wrote:
>
>> On 8 dic 2015, at 10:07 p.m., EuanM  wrote:
>>
>> "No. a codepoint is the numerical value assigned to a character. An
>> "encoded character" is the way a codepoint is represented in bytes
>> using a given encoding."
>>
>> No.
>>
>> A codepoint may represent a component part of an abstract character,
>> or may represent an abstract character, or it may do both (but not
>> always at the same time).
>>
>> Codepoints represent a single encoding of a single concept.
>>
>> Sometimes that concept represents a whole abstract character.
>> Sometimes it represent part of an abstract character.
>
> Well. I do not agree with this. I agree with the quote.
>
> Can you explain a bit more about what you mean by abstract character and
> concept?
>
>>
>> This is the key difference between Unicode and most character encodings.
>>
>> A codepoint does not always represent a whole character.
>>
>> On 7 December 2015 at 13:06, Henrik Johansen
>>  wrote:
>>>
>>> On 07 Dec 2015, at 1:05 , EuanM  wrote:
>>>
>>> Hi Henry,
>>>
>>> To be honest, at some point I'm going to long for the for the much
>>> more succinct semantics of healthcare systems and sports scoring and
>>> administration systems again.  :-)
>>>
>>> codepoints are any of *either*
>>> - the representation of a component of an abstract character, *or*
>>> eg. "A" #(0041) as a component of
>>> - the sole representation of the whole of an abstract character *or* of
>>> -  a representation of an abstract character provided for backwards
>>> compatibility which is more properly represented by a series of
>>> codepoints representing a composed character
>>>
>>> e.g.
>>>
>>> The "A" #(0041) as a codepoint can be:
>>> the sole representation of the whole of an abstract character "A"
>>> #(0041)
>>>
>>> The representation of a component of the composed (i.e. preferred)
>>> version of the abstract character Å #(0041 030a)
>>>
>>> Å (#00C5) represents one valid compatibility form of the abstract
>>> character Å which is most properly represented by #(0041 030a).
>>>
>>> Å (#212b) also represents one valid compatibility form of the abstract
>>> character Å which is most properly represented by #(0041 030a).
>>>
>>> With any luck, this satisfies both our semantic understandings of the
>>> concept of "codepoint"
>>>
>>> Would you agree with that?
>>>
>>> In Unicode, codepoints are *NOT* an abstract numerical representation
>>> of a text character.
>>>
>>> At least not as we generally understand the term "text character" from
>>> our experience of non-Unicode character mappings.
>>>
>>>
>>> I agree, they are numerical representations of what Unicode refers to as
>>> characters.
>>>
>>>
>>> codepoints represent "*encoded characters*"
>>>
>>>
>>> No. a codepoint is the numerical value assigned to a character. An
>>> "encoded
>>> character" is the way a codepoint is represented in bytes using a given
>>> encoding.
>>>
>>> and "a *text element* ...
>>> is represented by a sequence of one or more codepoints".  (And the
>>> term "text element" is deliberately left undefined in the Unicode
>>> standard)
>>>
>>> Individual codepoints are very often *not* the encoded form of an
>>> abstract character that we are interested in.  Unless we are
>>> communicating to or from another system  (Which in some cases is the
>>> Smalltalk ByteString class)
>>>
>>>
>>>
>>>
>>> i.e. in other words
>>>
>>> *Some* individual codepoints *may* be a representation of a specific
>>> *abstract character*, but only in special cases.
>>>
>>> The general case in Unicode is that Unicode defines (a)
>>> representation(s) of a Unicode *abstract character*.
>>>
>>> The Unicode standard representation of an abstract character is a
>>> composed sequence of codepoints, where in some cases that sequence is
>>> as short as 1 codepoint.
>>>
>>> In other cases, Unicode has a compatibility alias of a single
>>> codepoint which is *also* a representation of an abstract character
>>>
>>> There are some cases where an abstract character can be represented by
>>> more than one single-codepoint compatibility codepoint.
>>>
>>> Cheers,
>>> Euan
>>>
>>>
>>> I agree you have a good grasp of the distinction between an abstract
>>> character (characters and character sequences which should be treated
>>> equivalent wrt, equality / sorting / display, etc.) and a character
>>> (which
>>> each have a code point assigned).
>>> That is besides the point both Sven and I tried to get through, which is
>>> the
>>> difference between a code point and the encoded form(s) of said code
>>> point.
>>> When you write:
>>> "and therefore encodable in UTF-8 as compatibility codepoint e9 hex
>>> and as the composed character #(0065 00b4) (all in hex) and as the
>>> same composed character as both
>>> #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are
>>> included"
>>>
>>> I's quite clear you confuse the 

Re: [Pharo-dev] Unicode Support

2015-12-09 Thread Sven Van Caekenberghe

> On 09 Dec 2015, at 10:35, Guillermo Polito  wrote:
> 
> 
>> On 8 dic 2015, at 10:07 p.m., EuanM  wrote:
>> 
>> "No. a codepoint is the numerical value assigned to a character. An
>> "encoded character" is the way a codepoint is represented in bytes
>> using a given encoding."
>> 
>> No.
>> 
>> A codepoint may represent a component part of an abstract character,
>> or may represent an abstract character, or it may do both (but not
>> always at the same time).
>> 
>> Codepoints represent a single encoding of a single concept.
>> 
>> Sometimes that concept represents a whole abstract character.
>> Sometimes it represent part of an abstract character.
> 
> Well. I do not agree with this. I agree with the quote.
> 
> Can you explain a bit more about what you mean by abstract character and 
> concept?

I am pretty sure that this whole discussion does more harm than good for most 
people's understanding of Unicode. 

It is best and (mostly) correct to think of a Unicode string as a sequence of 
Unicode characters, each defined/identified by a code point (out of 10.000s 
covering all languages). That is what we have today in Pharo (with the 
distinction between ByteString and WideString as mostly invisible 
implementation details).

To encode Unicode for external representation as bytes, we use UTF-8 like the 
rest of the modern world. 

So far, so good.

Why all the confusion ? Because the world is a complex place and the Unicode 
standard tries to cover all possible things. Citing all these exceptions and 
special cases will make people crazy and give up. I am sure that most stopped 
reading this thread.

Why then is there confusion about the seemingly simple concept of a character ? 
Because Unicode allows different ways to say the same thing. The simplest 
example in a common language is (the French letter é) is

LATIN SMALL LETTER E WITH ACUTE [U+00E9]

which can also be written as

LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301]

The former being a composed normal form, the latter a decomposed normal form. 
(And yes, it is even much more complicated than that, it goes on for 1000s of 
pages).

In the above example, the concept of character/string is indeed fuzzy.

HTH,

Sven




Re: [Pharo-dev] Unicode Support

2015-12-09 Thread Guillermo Polito

> On 8 dic 2015, at 10:07 p.m., EuanM  wrote:
> 
> "No. a codepoint is the numerical value assigned to a character. An
> "encoded character" is the way a codepoint is represented in bytes
> using a given encoding."
> 
> No.
> 
> A codepoint may represent a component part of an abstract character,
> or may represent an abstract character, or it may do both (but not
> always at the same time).
> 
> Codepoints represent a single encoding of a single concept.
> 
> Sometimes that concept represents a whole abstract character.
> Sometimes it represent part of an abstract character.

Well. I do not agree with this. I agree with the quote.

Can you explain a bit more about what you mean by abstract character and 
concept?

> 
> This is the key difference between Unicode and most character encodings.
> 
> A codepoint does not always represent a whole character.
> 
> On 7 December 2015 at 13:06, Henrik Johansen
>  wrote:
>> 
>> On 07 Dec 2015, at 1:05 , EuanM  wrote:
>> 
>> Hi Henry,
>> 
>> To be honest, at some point I'm going to long for the for the much
>> more succinct semantics of healthcare systems and sports scoring and
>> administration systems again.  :-)
>> 
>> codepoints are any of *either*
>> - the representation of a component of an abstract character, *or*
>> eg. "A" #(0041) as a component of
>> - the sole representation of the whole of an abstract character *or* of
>> -  a representation of an abstract character provided for backwards
>> compatibility which is more properly represented by a series of
>> codepoints representing a composed character
>> 
>> e.g.
>> 
>> The "A" #(0041) as a codepoint can be:
>> the sole representation of the whole of an abstract character "A" #(0041)
>> 
>> The representation of a component of the composed (i.e. preferred)
>> version of the abstract character Å #(0041 030a)
>> 
>> Å (#00C5) represents one valid compatibility form of the abstract
>> character Å which is most properly represented by #(0041 030a).
>> 
>> Å (#212b) also represents one valid compatibility form of the abstract
>> character Å which is most properly represented by #(0041 030a).
>> 
>> With any luck, this satisfies both our semantic understandings of the
>> concept of "codepoint"
>> 
>> Would you agree with that?
>> 
>> In Unicode, codepoints are *NOT* an abstract numerical representation
>> of a text character.
>> 
>> At least not as we generally understand the term "text character" from
>> our experience of non-Unicode character mappings.
>> 
>> 
>> I agree, they are numerical representations of what Unicode refers to as
>> characters.
>> 
>> 
>> codepoints represent "*encoded characters*"
>> 
>> 
>> No. a codepoint is the numerical value assigned to a character. An "encoded
>> character" is the way a codepoint is represented in bytes using a given
>> encoding.
>> 
>> and "a *text element* ...
>> is represented by a sequence of one or more codepoints".  (And the
>> term "text element" is deliberately left undefined in the Unicode
>> standard)
>> 
>> Individual codepoints are very often *not* the encoded form of an
>> abstract character that we are interested in.  Unless we are
>> communicating to or from another system  (Which in some cases is the
>> Smalltalk ByteString class)
>> 
>> 
>> 
>> 
>> i.e. in other words
>> 
>> *Some* individual codepoints *may* be a representation of a specific
>> *abstract character*, but only in special cases.
>> 
>> The general case in Unicode is that Unicode defines (a)
>> representation(s) of a Unicode *abstract character*.
>> 
>> The Unicode standard representation of an abstract character is a
>> composed sequence of codepoints, where in some cases that sequence is
>> as short as 1 codepoint.
>> 
>> In other cases, Unicode has a compatibility alias of a single
>> codepoint which is *also* a representation of an abstract character
>> 
>> There are some cases where an abstract character can be represented by
>> more than one single-codepoint compatibility codepoint.
>> 
>> Cheers,
>> Euan
>> 
>> 
>> I agree you have a good grasp of the distinction between an abstract
>> character (characters and character sequences which should be treated
>> equivalent wrt, equality / sorting / display, etc.) and a character (which
>> each have a code point assigned).
>> That is besides the point both Sven and I tried to get through, which is the
>> difference between a code point and the encoded form(s) of said code point.
>> When you write:
>> "and therefore encodable in UTF-8 as compatibility codepoint e9 hex
>> and as the composed character #(0065 00b4) (all in hex) and as the
>> same composed character as both
>> #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are
>> included"
>> 
>> I's quite clear you confuse the two. 0xFEFF is the codepoint of the
>> character used as bom.
>> When you state that it can be written ffef (I assume you meant FFFE), you
>> are again confusing the code point and its encoded value (an encoded value
>> which only occurs in UTF1

Re: [Pharo-dev] Unicode Support

2015-12-08 Thread EuanM
"No. a codepoint is the numerical value assigned to a character. An
"encoded character" is the way a codepoint is represented in bytes
using a given encoding."

No.

A codepoint may represent a component part of an abstract character,
or may represent an abstract character, or it may do both (but not
always at the same time).

Codepoints represent a single encoding of a single concept.

Sometimes that concept represents a whole abstract character.
Sometimes it represent part of an abstract character.

This is the key difference between Unicode and most character encodings.

A codepoint does not always represent a whole character.

On 7 December 2015 at 13:06, Henrik Johansen
 wrote:
>
> On 07 Dec 2015, at 1:05 , EuanM  wrote:
>
> Hi Henry,
>
> To be honest, at some point I'm going to long for the for the much
> more succinct semantics of healthcare systems and sports scoring and
> administration systems again.  :-)
>
> codepoints are any of *either*
>  - the representation of a component of an abstract character, *or*
> eg. "A" #(0041) as a component of
>  - the sole representation of the whole of an abstract character *or* of
> -  a representation of an abstract character provided for backwards
> compatibility which is more properly represented by a series of
> codepoints representing a composed character
>
> e.g.
>
> The "A" #(0041) as a codepoint can be:
> the sole representation of the whole of an abstract character "A" #(0041)
>
> The representation of a component of the composed (i.e. preferred)
> version of the abstract character Å #(0041 030a)
>
> Å (#00C5) represents one valid compatibility form of the abstract
> character Å which is most properly represented by #(0041 030a).
>
> Å (#212b) also represents one valid compatibility form of the abstract
> character Å which is most properly represented by #(0041 030a).
>
> With any luck, this satisfies both our semantic understandings of the
> concept of "codepoint"
>
> Would you agree with that?
>
> In Unicode, codepoints are *NOT* an abstract numerical representation
> of a text character.
>
> At least not as we generally understand the term "text character" from
> our experience of non-Unicode character mappings.
>
>
> I agree, they are numerical representations of what Unicode refers to as
> characters.
>
>
> codepoints represent "*encoded characters*"
>
>
> No. a codepoint is the numerical value assigned to a character. An "encoded
> character" is the way a codepoint is represented in bytes using a given
> encoding.
>
> and "a *text element* ...
> is represented by a sequence of one or more codepoints".  (And the
> term "text element" is deliberately left undefined in the Unicode
> standard)
>
> Individual codepoints are very often *not* the encoded form of an
> abstract character that we are interested in.  Unless we are
> communicating to or from another system  (Which in some cases is the
> Smalltalk ByteString class)
>
>
>
>
> i.e. in other words
>
> *Some* individual codepoints *may* be a representation of a specific
> *abstract character*, but only in special cases.
>
> The general case in Unicode is that Unicode defines (a)
> representation(s) of a Unicode *abstract character*.
>
> The Unicode standard representation of an abstract character is a
> composed sequence of codepoints, where in some cases that sequence is
> as short as 1 codepoint.
>
> In other cases, Unicode has a compatibility alias of a single
> codepoint which is *also* a representation of an abstract character
>
> There are some cases where an abstract character can be represented by
> more than one single-codepoint compatibility codepoint.
>
> Cheers,
>  Euan
>
>
> I agree you have a good grasp of the distinction between an abstract
> character (characters and character sequences which should be treated
> equivalent wrt, equality / sorting / display, etc.) and a character (which
> each have a code point assigned).
> That is besides the point both Sven and I tried to get through, which is the
> difference between a code point and the encoded form(s) of said code point.
> When you write:
> "and therefore encodable in UTF-8 as compatibility codepoint e9 hex
> and as the composed character #(0065 00b4) (all in hex) and as the
> same composed character as both
> #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are
> included"
>
> I's quite clear you confuse the two. 0xFEFF is the codepoint of the
> character used as bom.
> When you state that it can be written ffef (I assume you meant FFFE), you
> are again confusing the code point and its encoded value (an encoded value
> which only occurs in UTF16/32, no less).
>
> When this distinction is clear, it might be easier to see that value in that
> Strings are kept as Unicode code points arrays, and converted to encoded
> forms when entering/exiting the system.
>
> Cheers,
> Henry
>



Re: [Pharo-dev] Unicode Support

2015-12-07 Thread Henrik Johansen

> On 07 Dec 2015, at 1:05 , EuanM  wrote:
> 
> Hi Henry,
> 
> To be honest, at some point I'm going to long for the for the much
> more succinct semantics of healthcare systems and sports scoring and
> administration systems again.  :-)
> 
> codepoints are any of *either*
>  - the representation of a component of an abstract character, *or*
> eg. "A" #(0041) as a component of
>  - the sole representation of the whole of an abstract character *or* of
> -  a representation of an abstract character provided for backwards
> compatibility which is more properly represented by a series of
> codepoints representing a composed character
> 
> e.g.
> 
> The "A" #(0041) as a codepoint can be:
> the sole representation of the whole of an abstract character "A" #(0041)
> 
> The representation of a component of the composed (i.e. preferred)
> version of the abstract character Å #(0041 030a)
> 
> Å (#00C5) represents one valid compatibility form of the abstract
> character Å which is most properly represented by #(0041 030a).
> 
> Å (#212b) also represents one valid compatibility form of the abstract
> character Å which is most properly represented by #(0041 030a).
> 
> With any luck, this satisfies both our semantic understandings of the
> concept of "codepoint"
> 
> Would you agree with that?
> 
> In Unicode, codepoints are *NOT* an abstract numerical representation
> of a text character.
> 
> At least not as we generally understand the term "text character" from
> our experience of non-Unicode character mappings.

I agree, they are numerical representations of what Unicode refers to as 
characters.

> 
> codepoints represent "*encoded characters*"

No. a codepoint is the numerical value assigned to a character. An "encoded 
character" is the way a codepoint is represented in bytes using a given 
encoding.

> and "a *text element* ...
> is represented by a sequence of one or more codepoints".  (And the
> term "text element" is deliberately left undefined in the Unicode
> standard)
> 
> Individual codepoints are very often *not* the encoded form of an
> abstract character that we are interested in.  Unless we are
> communicating to or from another system  (Which in some cases is the
> Smalltalk ByteString class)


> 
> i.e. in other words
> 
> *Some* individual codepoints *may* be a representation of a specific
> *abstract character*, but only in special cases.
> 
> The general case in Unicode is that Unicode defines (a)
> representation(s) of a Unicode *abstract character*.
> 
> The Unicode standard representation of an abstract character is a
> composed sequence of codepoints, where in some cases that sequence is
> as short as 1 codepoint.
> 
> In other cases, Unicode has a compatibility alias of a single
> codepoint which is *also* a representation of an abstract character
> 
> There are some cases where an abstract character can be represented by
> more than one single-codepoint compatibility codepoint.
> 
> Cheers,
>  Euan

I agree you have a good grasp of the distinction between an abstract character 
(characters and character sequences which should be treated equivalent wrt, 
equality / sorting / display, etc.) and a character (which each have a code 
point assigned).
That is besides the point both Sven and I tried to get through, which is the 
difference between a code point and the encoded form(s) of said code point.
When you write:
"and therefore encodable in UTF-8 as compatibility codepoint e9 hex
and as the composed character #(0065 00b4) (all in hex) and as the
same composed character as both
#(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are included"

I's quite clear you confuse the two. 0xFEFF is the codepoint of the character 
used as bom.
When you state that it can be written ffef (I assume you meant FFFE), you are 
again confusing the code point and its encoded value (an encoded value which 
only occurs in UTF16/32, no less).

When this distinction is clear, it might be easier to see that value in that 
Strings are kept as Unicode code points arrays, and converted to encoded forms 
when entering/exiting the system.

Cheers,
Henry



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Pharo-dev] Unicode Support

2015-12-07 Thread Henrik Johansen

> On 07 Dec 2015, at 11:51 , EuanM  wrote:
> 
> And indeed, in principle.
> 
> On 7 December 2015 at 10:51, EuanM  wrote:
>> Verifying assumptions is the key reason why you should documents like
>> this out for review.
>> 
>> Sven -
>> 
>> I'm confident I understand the use of UTF-8 in principal.

I can only second Sven's sentiment that you need to better differentiate code 
points (an abstract numerical representation of a character, where a set of 
such mappings
define a charset, such as Unicode), and character encoding forms. (which are 
how code points are represented in bytes by a defined process such as UTF-8, 
UTF-16 etc).

I know you'll probably think I'm arguing semantics again, but these are 
*important* semantics ;)

Cheers,
Henry


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Pharo-dev] Unicode Support

2015-12-07 Thread Sven Van Caekenberghe

> On 07 Dec 2015, at 11:51, EuanM  wrote:
> 
> Verifying assumptions is the key reason why you should documents like
> this out for review.

Fair enough, discussion can only help.

> Sven -
> 
> Cuis is encoded with ISO 8859-15  (aka ISO Latin 9)
> 
> Sven, this is *NOT* as you state, ISO 99591, (and not as I stated, 8859-1).

Ah, that was a typo, I meant, of course (and sorry for the confusion):

'Les élèves Français' encodeWith: #iso88591. 

"#[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]"

'Les élèves Français' utf8Encoded  

"#[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 97 
105 115]"

Or shorter, $é is encoded in ISO-88591-1 as #[233], but as #[195 169] in UTF-8.

That Cuis chose ISO-8859-15 makes no real difference.

The thing is: you started talking about UTF-8 encoded strings in the image, and 
then the difference between code point and encoding is really important. 

Only in ASCII is the encoding identical, not for anything else.

> We caught the right specification bug for the wrong reason.
> 
> Juan: "Cuis: Chose not to use Squeak approach. Chose to make the base
> image include and use only 1-byte strings. Chose to use ISO-8859-15"
> 
> I have double-checked - each character encoded in ISO Latin 15 (ISO
> 8859-15) is exactly the character represented by the corresponding
> 1-byte codepoint in Unicode  to 00FF,
> 
> with the following exceptions:
> 
> codepoint 20ac - Euro symbol
> character code a4 (replaces codepoint 00a4 generic currency symbol)
> 
> codepoint 0160 Latin Upper Case S with Caron
> character code a6  (replaces codepoint 00A6 was | Unix pipe character)
> 
> codepoint 0161 Latin Lower Case s with Caron
> character code a8 (replaces codepoint 00A8 was dierisis)
> 
> codepoint 017d Latin Upper Case Z with Caron
> character code b4 (replaces codepoint 00b4 was Acute accent)
> 
> codepoint 017e Latin Lower Case Z with Caron
> character code b8 (replaces codepoint 00b8 was cedilla)
> 
> codepoint 0152 Upper Case OE ligature = Ethel
> character code bc (replaces codepoint 00bc was 1/4 symbol)
> 
> codepoint 0153 Lower Case oe ligature = ethel
> character code bd (replaces codepoint 00bd was 1/2 symbol)
> 
> codepoint 0178 Upper Case Y diaeresis
> character code be (replaces codepoint 00be was 3/4 symbol)
> 
> Juan - I don't suppose we could persuade you to change to ISO  Latin-1
> from ISO Latin-9 ?
> 
> It means we could run the same 1 byte string encoding across  Cuis,
> Squeak, Pharo, and, as far as I can make out so far, Dolphin Smalltalk
> and Gnu Smalltalk.
> 
> The downside would be that French Y diaeresis would lose the ability
> to use that character, along with users of oe, OE, and s, S, z, Z with
> caron.  Along with the Euro.
> 
> https://en.wikipedia.org/wiki/ISO/IEC_8859-15.
> 
> I'm confident I understand the use of UTF-8 in principal.
> 
> 
> On 7 December 2015 at 08:27, Sven Van Caekenberghe  wrote:
>> I am sorry but one of your basic assumptions is completely wrong:
>> 
>> 'Les élèves Français' encodeWith: #iso99591.
>> 
>> => #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]
>> 
>> 'Les élèves Français' utf8Encoded.
>> 
>> => #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 
>> 97 105 115]
>> 
>> ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII 
>> part !!
>> 
>> Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in 
>> UTF-8.
>> 
>> So more than half the points you make, or the facts that you state, are thus 
>> plain wrong.
>> 
>> The only thing that is correct is that the code points are equal, but that 
>> is not the same as the encoding !
>> 
>> From this I am inclined to conclude that you do not fundamentally understand 
>> how UTF-8 works, which does not strike me as good basis to design something 
>> called a UTF8String.
>> 
>> Sorry.
>> 
>> PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in 
>> a Unicode world.
>> 
>>> On 07 Dec 2015, at 04:21, EuanM  wrote:
>>> 
>>> This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
>>> http://smalltalk.uk.to/unicode-utf8.html
>>> and my Smalltalk in Small Steps blog at:
>>> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
>>> 
>>> My current thinking, and understanding.
>>> ==
>>> 
>>> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
>>>  b) UTF-8 can encode all of those characters in 1 byte, but can
>>> prefer some of them to be encoded as sequences of multiple bytes.  And
>>> can encode additional characters as sequences of multiple bytes.
>>> 
>>> 1) Smalltalk has long had multiple String classes.
>>> 
>>> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
>>>  is encoded as a UTF-8 codepoint of nn hex.
>>> 
>>> 3) All valid ISO-8859-1 characters have a character code between 20
>>> hex and 7E hex, or between A0 hex and FF

Re: [Pharo-dev] Unicode Support

2015-12-07 Thread EuanM
And indeed, in principle.

On 7 December 2015 at 10:51, EuanM  wrote:
> Verifying assumptions is the key reason why you should documents like
> this out for review.
>
> Sven -
>
> Cuis is encoded with ISO 8859-15  (aka ISO Latin 9)
>
> Sven, this is *NOT* as you state, ISO 99591, (and not as I stated, 8859-1).
>
> We caught the right specification bug for the wrong reason.
>
> Juan: "Cuis: Chose not to use Squeak approach. Chose to make the base
> image include and use only 1-byte strings. Chose to use ISO-8859-15"
>
> I have double-checked - each character encoded in ISO Latin 15 (ISO
> 8859-15) is exactly the character represented by the corresponding
> 1-byte codepoint in Unicode  to 00FF,
>
> with the following exceptions:
>
> codepoint 20ac - Euro symbol
> character code a4 (replaces codepoint 00a4 generic currency symbol)
>
> codepoint 0160 Latin Upper Case S with Caron
> character code a6  (replaces codepoint 00A6 was | Unix pipe character)
>
> codepoint 0161 Latin Lower Case s with Caron
> character code a8 (replaces codepoint 00A8 was dierisis)
>
> codepoint 017d Latin Upper Case Z with Caron
> character code b4 (replaces codepoint 00b4 was Acute accent)
>
> codepoint 017e Latin Lower Case Z with Caron
> character code b8 (replaces codepoint 00b8 was cedilla)
>
> codepoint 0152 Upper Case OE ligature = Ethel
> character code bc (replaces codepoint 00bc was 1/4 symbol)
>
> codepoint 0153 Lower Case oe ligature = ethel
> character code bd (replaces codepoint 00bd was 1/2 symbol)
>
> codepoint 0178 Upper Case Y diaeresis
> character code be (replaces codepoint 00be was 3/4 symbol)
>
> Juan - I don't suppose we could persuade you to change to ISO  Latin-1
> from ISO Latin-9 ?
>
> It means we could run the same 1 byte string encoding across  Cuis,
> Squeak, Pharo, and, as far as I can make out so far, Dolphin Smalltalk
> and Gnu Smalltalk.
>
> The downside would be that French Y diaeresis would lose the ability
> to use that character, along with users of oe, OE, and s, S, z, Z with
> caron.  Along with the Euro.
>
> https://en.wikipedia.org/wiki/ISO/IEC_8859-15.
>
> I'm confident I understand the use of UTF-8 in principal.
>
>
> On 7 December 2015 at 08:27, Sven Van Caekenberghe  wrote:
>> I am sorry but one of your basic assumptions is completely wrong:
>>
>> 'Les élèves Français' encodeWith: #iso99591.
>>
>> => #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]
>>
>> 'Les élèves Français' utf8Encoded.
>>
>> => #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 
>> 97 105 115]
>>
>> ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII 
>> part !!
>>
>> Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in 
>> UTF-8.
>>
>> So more than half the points you make, or the facts that you state, are thus 
>> plain wrong.
>>
>> The only thing that is correct is that the code points are equal, but that 
>> is not the same as the encoding !
>>
>> From this I am inclined to conclude that you do not fundamentally understand 
>> how UTF-8 works, which does not strike me as good basis to design something 
>> called a UTF8String.
>>
>> Sorry.
>>
>> PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in 
>> a Unicode world.
>>
>>> On 07 Dec 2015, at 04:21, EuanM  wrote:
>>>
>>> This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
>>> http://smalltalk.uk.to/unicode-utf8.html
>>> and my Smalltalk in Small Steps blog at:
>>> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
>>>
>>> My current thinking, and understanding.
>>> ==
>>>
>>> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
>>>   b) UTF-8 can encode all of those characters in 1 byte, but can
>>> prefer some of them to be encoded as sequences of multiple bytes.  And
>>> can encode additional characters as sequences of multiple bytes.
>>>
>>> 1) Smalltalk has long had multiple String classes.
>>>
>>> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
>>>   is encoded as a UTF-8 codepoint of nn hex.
>>>
>>> 3) All valid ISO-8859-1 characters have a character code between 20
>>> hex and 7E hex, or between A0 hex and FF hex.
>>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>>>
>>> 4) All valid ASCII characters have a character code between 00 hex and 7E 
>>> hex.
>>> https://en.wikipedia.org/wiki/ASCII
>>>
>>>
>>> 5) a) All character codes which are defined within ISO-8859-1 and also
>>> defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
>>> defined identically in both.
>>>
>>> b) All printable ASCII characters are defined identically in both
>>> ASCII and ISO-8859-1
>>>
>>> 6) All character codes defined in ASCII  (00 hex to 7E hex) are
>>> defined identically in Unicode UTF-8.
>>>
>>> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
>>> - FF hex ) are defined identically in UTF-8.
>>>
>>> 8) =>

Re: [Pharo-dev] Unicode Support

2015-12-07 Thread EuanM
Verifying assumptions is the key reason why you should documents like
this out for review.

Sven -

Cuis is encoded with ISO 8859-15  (aka ISO Latin 9)

Sven, this is *NOT* as you state, ISO 99591, (and not as I stated, 8859-1).

We caught the right specification bug for the wrong reason.

Juan: "Cuis: Chose not to use Squeak approach. Chose to make the base
image include and use only 1-byte strings. Chose to use ISO-8859-15"

I have double-checked - each character encoded in ISO Latin 15 (ISO
8859-15) is exactly the character represented by the corresponding
1-byte codepoint in Unicode  to 00FF,

with the following exceptions:

codepoint 20ac - Euro symbol
character code a4 (replaces codepoint 00a4 generic currency symbol)

codepoint 0160 Latin Upper Case S with Caron
character code a6  (replaces codepoint 00A6 was | Unix pipe character)

codepoint 0161 Latin Lower Case s with Caron
character code a8 (replaces codepoint 00A8 was dierisis)

codepoint 017d Latin Upper Case Z with Caron
character code b4 (replaces codepoint 00b4 was Acute accent)

codepoint 017e Latin Lower Case Z with Caron
character code b8 (replaces codepoint 00b8 was cedilla)

codepoint 0152 Upper Case OE ligature = Ethel
character code bc (replaces codepoint 00bc was 1/4 symbol)

codepoint 0153 Lower Case oe ligature = ethel
character code bd (replaces codepoint 00bd was 1/2 symbol)

codepoint 0178 Upper Case Y diaeresis
character code be (replaces codepoint 00be was 3/4 symbol)

Juan - I don't suppose we could persuade you to change to ISO  Latin-1
from ISO Latin-9 ?

It means we could run the same 1 byte string encoding across  Cuis,
Squeak, Pharo, and, as far as I can make out so far, Dolphin Smalltalk
and Gnu Smalltalk.

The downside would be that French Y diaeresis would lose the ability
to use that character, along with users of oe, OE, and s, S, z, Z with
caron.  Along with the Euro.

https://en.wikipedia.org/wiki/ISO/IEC_8859-15.

I'm confident I understand the use of UTF-8 in principal.


On 7 December 2015 at 08:27, Sven Van Caekenberghe  wrote:
> I am sorry but one of your basic assumptions is completely wrong:
>
> 'Les élèves Français' encodeWith: #iso99591.
>
> => #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]
>
> 'Les élèves Français' utf8Encoded.
>
> => #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 
> 97 105 115]
>
> ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII 
> part !!
>
> Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in UTF-8.
>
> So more than half the points you make, or the facts that you state, are thus 
> plain wrong.
>
> The only thing that is correct is that the code points are equal, but that is 
> not the same as the encoding !
>
> From this I am inclined to conclude that you do not fundamentally understand 
> how UTF-8 works, which does not strike me as good basis to design something 
> called a UTF8String.
>
> Sorry.
>
> PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in 
> a Unicode world.
>
>> On 07 Dec 2015, at 04:21, EuanM  wrote:
>>
>> This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
>> http://smalltalk.uk.to/unicode-utf8.html
>> and my Smalltalk in Small Steps blog at:
>> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
>>
>> My current thinking, and understanding.
>> ==
>>
>> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
>>   b) UTF-8 can encode all of those characters in 1 byte, but can
>> prefer some of them to be encoded as sequences of multiple bytes.  And
>> can encode additional characters as sequences of multiple bytes.
>>
>> 1) Smalltalk has long had multiple String classes.
>>
>> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
>>   is encoded as a UTF-8 codepoint of nn hex.
>>
>> 3) All valid ISO-8859-1 characters have a character code between 20
>> hex and 7E hex, or between A0 hex and FF hex.
>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
>>
>> 4) All valid ASCII characters have a character code between 00 hex and 7E 
>> hex.
>> https://en.wikipedia.org/wiki/ASCII
>>
>>
>> 5) a) All character codes which are defined within ISO-8859-1 and also
>> defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
>> defined identically in both.
>>
>> b) All printable ASCII characters are defined identically in both
>> ASCII and ISO-8859-1
>>
>> 6) All character codes defined in ASCII  (00 hex to 7E hex) are
>> defined identically in Unicode UTF-8.
>>
>> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
>> - FF hex ) are defined identically in UTF-8.
>>
>> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
>>all ASCII maps 1:1 to Unicode UTF-8
>>all ISO-8859-1 maps 1:1 to Unicode UTF-8
>>
>> 9) All ByteStrings elements which are either a valid ISO-8859-1
>> character  or a val

Re: [Pharo-dev] Unicode Support

2015-12-07 Thread Sven Van Caekenberghe
I am sorry but one of your basic assumptions is completely wrong:

'Les élèves Français' encodeWith: #iso99591.

=> #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]

'Les élèves Français' utf8Encoded.  

=> #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 97 
105 115]

ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII 
part !!

Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in UTF-8.

So more than half the points you make, or the facts that you state, are thus 
plain wrong.

The only thing that is correct is that the code points are equal, but that is 
not the same as the encoding !

From this I am inclined to conclude that you do not fundamentally understand 
how UTF-8 works, which does not strike me as good basis to design something 
called a UTF8String.

Sorry.

PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in a 
Unicode world.

> On 07 Dec 2015, at 04:21, EuanM  wrote:
> 
> This a long email.  A *lot* of it is encapsulated in the Venn diagram both:
> http://smalltalk.uk.to/unicode-utf8.html
> and my Smalltalk in Small Steps blog at:
> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
> 
> My current thinking, and understanding.
> ==
> 
> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte.
>   b) UTF-8 can encode all of those characters in 1 byte, but can
> prefer some of them to be encoded as sequences of multiple bytes.  And
> can encode additional characters as sequences of multiple bytes.
> 
> 1) Smalltalk has long had multiple String classes.
> 
> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex
>   is encoded as a UTF-8 codepoint of nn hex.
> 
> 3) All valid ISO-8859-1 characters have a character code between 20
> hex and 7E hex, or between A0 hex and FF hex.
> https://en.wikipedia.org/wiki/ISO/IEC_8859-1
> 
> 4) All valid ASCII characters have a character code between 00 hex and 7E hex.
> https://en.wikipedia.org/wiki/ASCII
> 
> 
> 5) a) All character codes which are defined within ISO-8859-1 and also
> defined within ASCII.  (i.e. character codes 20 hex to 7E hex) are
> defined identically in both.
> 
> b) All printable ASCII characters are defined identically in both
> ASCII and ISO-8859-1
> 
> 6) All character codes defined in ASCII  (00 hex to 7E hex) are
> defined identically in Unicode UTF-8.
> 
> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex
> - FF hex ) are defined identically in UTF-8.
> 
> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1.
>all ASCII maps 1:1 to Unicode UTF-8
>all ISO-8859-1 maps 1:1 to Unicode UTF-8
> 
> 9) All ByteStrings elements which are either a valid ISO-8859-1
> character  or a valid ASCII character are *also* a valid UTF-8
> character.
> 
> 10) ISO-8859-1 characters representing a character with a diacritic,
> or a two-character ligature, have no ASCII equivalent.  In Unicode
> UTF-8, those character codes which are representing compound glyphs,
> are called "compatibility codepoints".
> 
> 11) The preferred Unicode representation of the characters which have
> compatibility codepoints is as a  a short set of codepoints
> representing the characters which are combined together to form the
> glyph of the convenience codepoint, as a sequence of bytes
> representing the component characters.
> 
> 
> 12) Some concrete examples:
> 
> A - aka Upper Case A
> In ASCII, in ISO 8859-1
> ASCII A - 41 hex
> ISO-8859-1 A - 41 hex
> UTF-8 A - 41 hex
> 
> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord)
> In ASCII, not in ISO 8859-1
> ASCII : BEL  - 07 hex
> ISO-8859-1 : 07 hex is not a valid character code
> UTF-8 : BEL - 07 hex
> 
> £ (GBP currency symbol)
> In ISO-8859-1, not in ASCII
> ASCII : A3 hex is not a valid ASCII code
> UTF-8: £ - A3 hex
> ISO-8859-1: £ - A3 hex
> 
> Upper Case C cedilla
> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint
> *and* a composed set of codepoints
> ASCII : C7 hex is not a valid ASCII character code
> ISO-8859-1 : Upper Case C cedilla - C7 hex
> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex
> Unicode preferred Upper Case C cedilla  (composed set of codepoints)
>  Upper case C 0043 hex (Upper case C)
>  followed by
>  cedilla 00B8 hex (cedilla)
> 
> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string,
> aByteString is completely adequate for editing and display.
> 
> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1
> string, upper and lower case versions of the same character will be
> treated differently.
> 
> 15) When sorting any valid ISO-8859-1 string containing
> letter+diacritic combination glyphs or ligature combination glyphs,
> the glyphs in combination will treated differently to a "plain" glyph
> of the character
> i.e. "C" and "C cedilla" will be treated very differentl

Re: [Pharo-dev] Unicode Support

2015-12-06 Thread EuanM
Steph -  I'll dig out the Fr phone book ordering from wherever it was
I read about it!

I thought I ghad it to hand, but I haven;t found it tonight. It can't
be far away.

On 5 December 2015 at 13:08, stepharo  wrote:
> Hi EuanM
>
> Le 4/12/15 12:42, EuanM a écrit :
>>
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is.  I am doing this to establish what still needs to
>> be done to provide full Unicode support.
>
>
> this is great. Thanks for pushing this. I wrote and collected some roadmap
> (analyses on different topics)
> on the pharo github project feel free to add this one there.
>>
>>
>> This seems to me to be an area where it would be best to write it
>> once, and then have the same codebase incorporated into the Smalltalks
>> that most share a common ancestry.
>>
>> I am keen to get: equality-testing for strings; sortability for
>> strings which have ligatures and diacritic characters; and correct
>> round-tripping of data.
>
> Go!
> My suggestion is
> start small
> make steady progress
> write tests
> commit often :)
>
> Stef
>
> What is the french phoneBook ordering because this is the first time I hear
> about it.
>
>>
>> Call to action:
>> ==
>>
>> If you have comments on these proposals - such as "but we already have
>> that facility" or "the reason we do not have these facilities is
>> because they are dog-slow" - please let me know them.
>>
>> If you would like to help out, please let me know.
>>
>> If you have Unicode experience and expertise, and would like to be, or
>> would be willing to be, in the  'council of experts' for this project,
>> please let me know.
>>
>> If you have comments or ideas on anything mentioned in this email
>>
>> In the first instance, the initiative's website will be:
>> http://smalltalk.uk.to/unicode.html
>>
>> I have created a SqueakSource.com project called UnicodeSupport
>>
>> I want to avoid re-inventing any facilities which already exist.
>> Except where they prevent us reaching the goals of:
>>- sortable UTF8 strings
>>- sortable UTF16 strings
>>- equivalence testing of 2 UTF8 strings
>>- equivalence testing of 2 UTF16 strings
>>- round-tripping UTF8 strings through Smalltalk
>>- roundtripping UTF16 strings through Smalltalk.
>> As I understand it, we have limited Unicode support atm.
>>
>> Current state of play
>> ===
>> ByteString gets converted to WideString when need is automagically
>> detected.
>>
>> Is there anything else that currently exists?
>>
>> Definition of Terms
>> ==
>> A quick definition of terms before I go any further:
>>
>> Standard terms from the Unicode standard
>> ===
>> a compatibility character : an additional encoding of a *normal*
>> character, for compatibility and round-trip conversion purposes.  For
>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>
>> Made-up terms
>> 
>> a convenience codepoint :  a single codepoint which represents an item
>> that is also encoded as a string of codepoints.
>>
>> (I tend to use the terms compatibility character and compatibility
>> codepoint interchangably.  The standard only refers to them as
>> compatibility characters.  However, the standard is determined to
>> emphasise that characters are abstract and that codepoints are
>> concrete.  So I think it is often more useful and productive to think
>> of compatibility or convenience codepoints).
>>
>> a composed character :  a character made up of several codepoints
>>
>> Unicode encoding explained
>> =
>> A convenience codepoint can therefore be thought of as a code point
>> used for a character which also has a composed form.
>>
>> The way Unicode works is that sometimes you can encode a character in
>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>> sometimes not.
>>
>> You can therefore have a long stream of ASCII which is single-byte
>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>> stream, it would be represented either by a compatibility character or
>> by a multi-byte combination.
>>
>> Using compatibility characters can prevent proper sorting and
>> equivalence testing.
>>
>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>> and round-tripping probelms.  Although avoiding them can *also* cause
>> compatibility issues and round-tripping problems.
>>
>> Currently my thinking is:
>>
>> a Utf8String class
>> an Ordered collection, with 1 byte characters as the modal element,
>> but short arrays of wider strings where necessary
>> a Utf16String class
>> an Ordered collection, with 2 byte characters as the modal element,
>> but short arrays of wider strings
>> beginning with a 2-byte endianness indicator.
>>
>> Utf8Strings sometimes need to be sortable, and sometimes need to be
>> compatible.
>>
>> So my thinking is that Utf8String will contain convenience codepoints,
>> for rou

Re: [Pharo-dev] Unicode Support

2015-12-06 Thread EuanM
Todd, As long as others are using it, it's useful to be able to send
UTF16, and to successfully import it.

I like systems that play well with others. :-)

On 5 December 2015 at 16:35, Todd Blanchard  wrote:
> would suggest that the only worthwhile encoding is UTF8 - the rest are
> distractions except for being able to read and convert from other encodings
> to UTF8. UTF16 is a complete waste of time.
>
> Read http://utf8everywhere.org/
>
> I have extensive Unicode chops from around 1999 to 2004 and my experience
> leads me to strongly agree with the views on that site.
>
>
> Sent from the road
>
> On Dec 5, 2015, at 05:08, stepharo  wrote:
>
> Hi EuanM
>
> Le 4/12/15 12:42, EuanM a écrit :
>
> I'm currently groping my way to seeing how feature-complete our
>
> Unicode support is.  I am doing this to establish what still needs to
>
> be done to provide full Unicode support.
>
>
> this is great. Thanks for pushing this. I wrote and collected some roadmap
> (analyses on different topics)
> on the pharo github project feel free to add this one there.
>
>
> This seems to me to be an area where it would be best to write it
>
> once, and then have the same codebase incorporated into the Smalltalks
>
> that most share a common ancestry.
>
>
> I am keen to get: equality-testing for strings; sortability for
>
> strings which have ligatures and diacritic characters; and correct
>
> round-tripping of data.
>
> Go!
> My suggestion is
>start small
>make steady progress
>write tests
>commit often :)
>
> Stef
>
> What is the french phoneBook ordering because this is the first time I hear
> about it.
>
>
> Call to action:
>
> ==
>
>
> If you have comments on these proposals - such as "but we already have
>
> that facility" or "the reason we do not have these facilities is
>
> because they are dog-slow" - please let me know them.
>
>
> If you would like to help out, please let me know.
>
>
> If you have Unicode experience and expertise, and would like to be, or
>
> would be willing to be, in the  'council of experts' for this project,
>
> please let me know.
>
>
> If you have comments or ideas on anything mentioned in this email
>
>
> In the first instance, the initiative's website will be:
>
> http://smalltalk.uk.to/unicode.html
>
>
> I have created a SqueakSource.com project called UnicodeSupport
>
>
> I want to avoid re-inventing any facilities which already exist.
>
> Except where they prevent us reaching the goals of:
>
>   - sortable UTF8 strings
>
>   - sortable UTF16 strings
>
>   - equivalence testing of 2 UTF8 strings
>
>   - equivalence testing of 2 UTF16 strings
>
>   - round-tripping UTF8 strings through Smalltalk
>
>   - roundtripping UTF16 strings through Smalltalk.
>
> As I understand it, we have limited Unicode support atm.
>
>
> Current state of play
>
> ===
>
> ByteString gets converted to WideString when need is automagically detected.
>
>
> Is there anything else that currently exists?
>
>
> Definition of Terms
>
> ==
>
> A quick definition of terms before I go any further:
>
>
> Standard terms from the Unicode standard
>
> ===
>
> a compatibility character : an additional encoding of a *normal*
>
> character, for compatibility and round-trip conversion purposes.  For
>
> instance, a 1-byte encoding of a Latin character with a diacritic.
>
>
> Made-up terms
>
> 
>
> a convenience codepoint :  a single codepoint which represents an item
>
> that is also encoded as a string of codepoints.
>
>
> (I tend to use the terms compatibility character and compatibility
>
> codepoint interchangably.  The standard only refers to them as
>
> compatibility characters.  However, the standard is determined to
>
> emphasise that characters are abstract and that codepoints are
>
> concrete.  So I think it is often more useful and productive to think
>
> of compatibility or convenience codepoints).
>
>
> a composed character :  a character made up of several codepoints
>
>
> Unicode encoding explained
>
> =
>
> A convenience codepoint can therefore be thought of as a code point
>
> used for a character which also has a composed form.
>
>
> The way Unicode works is that sometimes you can encode a character in
>
> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>
> sometimes not.
>
>
> You can therefore have a long stream of ASCII which is single-byte
>
> Unicode.  If there is an occasional Cyrillic or Greek character in the
>
> stream, it would be represented either by a compatibility character or
>
> by a multi-byte combination.
>
>
> Using compatibility characters can prevent proper sorting and
>
> equivalence testing.
>
>
> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>
> and round-tripping probelms.  Although avoiding them can *also* cause
>
> compatibility issues and round-tripping problems.
>
>
> Currently my thinking is:
>
>
> a Utf8String class
>
> an Ordered colle

Re: [Pharo-dev] Unicode Support

2015-12-06 Thread EuanM
Thanks for those pointers, Steph.  I'll make sure they are on my
reading list.  (I have a limited weekly time-budget for Unicode work,
but I expect this is a long-term project).

I'll keep in touch with Steph, so any new facilities can be
immediately useful to Pharo, and someone can guide them to a proper
home in Pharo's Class hierarchy.

For now, I've stuck stuff on my blog,
http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html
in an email here
and at smalltalk.uk.to/unicode-utf.html


On 5 December 2015 at 13:08, stepharo  wrote:
> Hi EuanM
>
> Le 4/12/15 12:42, EuanM a écrit :
>>
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is.  I am doing this to establish what still needs to
>> be done to provide full Unicode support.
>
>
> this is great. Thanks for pushing this. I wrote and collected some roadmap
> (analyses on different topics)
> on the pharo github project feel free to add this one there.
>>
>>
>> This seems to me to be an area where it would be best to write it
>> once, and then have the same codebase incorporated into the Smalltalks
>> that most share a common ancestry.
>>
>> I am keen to get: equality-testing for strings; sortability for
>> strings which have ligatures and diacritic characters; and correct
>> round-tripping of data.
>
> Go!
> My suggestion is
> start small
> make steady progress
> write tests
> commit often :)
>
> Stef
>
> What is the french phoneBook ordering because this is the first time I hear
> about it.
>
>>
>> Call to action:
>> ==
>>
>> If you have comments on these proposals - such as "but we already have
>> that facility" or "the reason we do not have these facilities is
>> because they are dog-slow" - please let me know them.
>>
>> If you would like to help out, please let me know.
>>
>> If you have Unicode experience and expertise, and would like to be, or
>> would be willing to be, in the  'council of experts' for this project,
>> please let me know.
>>
>> If you have comments or ideas on anything mentioned in this email
>>
>> In the first instance, the initiative's website will be:
>> http://smalltalk.uk.to/unicode.html
>>
>> I have created a SqueakSource.com project called UnicodeSupport
>>
>> I want to avoid re-inventing any facilities which already exist.
>> Except where they prevent us reaching the goals of:
>>- sortable UTF8 strings
>>- sortable UTF16 strings
>>- equivalence testing of 2 UTF8 strings
>>- equivalence testing of 2 UTF16 strings
>>- round-tripping UTF8 strings through Smalltalk
>>- roundtripping UTF16 strings through Smalltalk.
>> As I understand it, we have limited Unicode support atm.
>>
>> Current state of play
>> ===
>> ByteString gets converted to WideString when need is automagically
>> detected.
>>
>> Is there anything else that currently exists?
>>
>> Definition of Terms
>> ==
>> A quick definition of terms before I go any further:
>>
>> Standard terms from the Unicode standard
>> ===
>> a compatibility character : an additional encoding of a *normal*
>> character, for compatibility and round-trip conversion purposes.  For
>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>
>> Made-up terms
>> 
>> a convenience codepoint :  a single codepoint which represents an item
>> that is also encoded as a string of codepoints.
>>
>> (I tend to use the terms compatibility character and compatibility
>> codepoint interchangably.  The standard only refers to them as
>> compatibility characters.  However, the standard is determined to
>> emphasise that characters are abstract and that codepoints are
>> concrete.  So I think it is often more useful and productive to think
>> of compatibility or convenience codepoints).
>>
>> a composed character :  a character made up of several codepoints
>>
>> Unicode encoding explained
>> =
>> A convenience codepoint can therefore be thought of as a code point
>> used for a character which also has a composed form.
>>
>> The way Unicode works is that sometimes you can encode a character in
>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>> sometimes not.
>>
>> You can therefore have a long stream of ASCII which is single-byte
>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>> stream, it would be represented either by a compatibility character or
>> by a multi-byte combination.
>>
>> Using compatibility characters can prevent proper sorting and
>> equivalence testing.
>>
>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>> and round-tripping probelms.  Although avoiding them can *also* cause
>> compatibility issues and round-tripping problems.
>>
>> Currently my thinking is:
>>
>> a Utf8String class
>> an Ordered collection, with 1 byte characters as the modal element,
>> but short arrays of wider strings where necessary
>

Re: [Pharo-dev] Unicode Support

2015-12-06 Thread Max Leske

> On 06 Dec 2015, at 18:44, Sven Van Caekenberghe  wrote:
> 
> 
>> On 05 Dec 2015, at 17:35, Todd Blanchard  wrote:
>> 
>> would suggest that the only worthwhile encoding is UTF8 - the rest are 
>> distractions except for being able to read and convert from other encodings 
>> to UTF8. UTF16 is a complete waste of time. 
>> 
>> Read http://utf8everywhere.org/
>> 
>> I have extensive Unicode chops from around 1999 to 2004 and my experience 
>> leads me to strongly agree with the views on that site.
> 
> Well, I read the page/document/site as well. It was very interesting indeed, 
> thanks for sharing it.
> 
> In some sense it made me reconsider my aversion against in-image utf-8 
> encoding, maybe it could have some value. Absolute storage is more efficient, 
> some processing might also be more efficient, i/o conversions to/from utf-8 
> become a no-op. What I found nice is the suggestion that most structured 
> parsing (XML, JSON, CSV, STON, ...) could actually ignore the encoding for a 
> large part and just assume its ASCII, which would/could be nice for 
> performance. Also the fact that a lot of strings are (or should be) treated 
> as opaque makes a lot of sense.
> 
> What I did not like is that much of argumentation is based on issue in the 
> Windows world, take all that away and the document shrinks in half. I would 
> have liked a bit more fundamental CS arguments.
> 
> Canonicalisation and sorting issues are hardly discussed.
> 
> In one place, the fact that a lot of special characters can have multiple 
> representations is a big argument, while it is not mentioned how just 
> treating things like a byte sequence would solve this (it doesn't AFAIU). 
> Like how do you search for $e or $é if you know that it is possible to 
> represent $é as just $é and as $e + $´ ?

That’s what normalization is for: http://unicode.org/faq/normalization.html. It 
will generate the same codepoint for two strings where one contains the 
combining character and the other is a “single character”.

> 
> Sven
> 
>> Sent from the road
>> 
>> On Dec 5, 2015, at 05:08, stepharo  wrote:
>> 
>>> Hi EuanM
>>> 
>>> Le 4/12/15 12:42, EuanM a écrit :
 I'm currently groping my way to seeing how feature-complete our
 Unicode support is.  I am doing this to establish what still needs to
 be done to provide full Unicode support.
>>> 
>>> this is great. Thanks for pushing this. I wrote and collected some roadmap 
>>> (analyses on different topics)
>>> on the pharo github project feel free to add this one there.
 
 This seems to me to be an area where it would be best to write it
 once, and then have the same codebase incorporated into the Smalltalks
 that most share a common ancestry.
 
 I am keen to get: equality-testing for strings; sortability for
 strings which have ligatures and diacritic characters; and correct
 round-tripping of data.
>>> Go!
>>> My suggestion is
>>>   start small
>>>   make steady progress
>>>   write tests
>>>   commit often :)
>>> 
>>> Stef
>>> 
>>> What is the french phoneBook ordering because this is the first time I hear 
>>> about it.
 
 Call to action:
 ==
 
 If you have comments on these proposals - such as "but we already have
 that facility" or "the reason we do not have these facilities is
 because they are dog-slow" - please let me know them.
 
 If you would like to help out, please let me know.
 
 If you have Unicode experience and expertise, and would like to be, or
 would be willing to be, in the  'council of experts' for this project,
 please let me know.
 
 If you have comments or ideas on anything mentioned in this email
 
 In the first instance, the initiative's website will be:
 http://smalltalk.uk.to/unicode.html
 
 I have created a SqueakSource.com project called UnicodeSupport
 
 I want to avoid re-inventing any facilities which already exist.
 Except where they prevent us reaching the goals of:
  - sortable UTF8 strings
  - sortable UTF16 strings
  - equivalence testing of 2 UTF8 strings
  - equivalence testing of 2 UTF16 strings
  - round-tripping UTF8 strings through Smalltalk
  - roundtripping UTF16 strings through Smalltalk.
 As I understand it, we have limited Unicode support atm.
 
 Current state of play
 ===
 ByteString gets converted to WideString when need is automagically 
 detected.
 
 Is there anything else that currently exists?
 
 Definition of Terms
 ==
 A quick definition of terms before I go any further:
 
 Standard terms from the Unicode standard
 ===
 a compatibility character : an additional encoding of a *normal*
 character, for compatibility and round-trip conversion purposes.  For
 instance, a 1-byte encoding of a Latin character with a diacritic.
 
 Made-up terms
>

Re: [Pharo-dev] Unicode Support

2015-12-06 Thread Sven Van Caekenberghe

> On 05 Dec 2015, at 17:35, Todd Blanchard  wrote:
> 
> would suggest that the only worthwhile encoding is UTF8 - the rest are 
> distractions except for being able to read and convert from other encodings 
> to UTF8. UTF16 is a complete waste of time. 
> 
> Read http://utf8everywhere.org/
> 
> I have extensive Unicode chops from around 1999 to 2004 and my experience 
> leads me to strongly agree with the views on that site.

Well, I read the page/document/site as well. It was very interesting indeed, 
thanks for sharing it.

In some sense it made me reconsider my aversion against in-image utf-8 
encoding, maybe it could have some value. Absolute storage is more efficient, 
some processing might also be more efficient, i/o conversions to/from utf-8 
become a no-op. What I found nice is the suggestion that most structured 
parsing (XML, JSON, CSV, STON, ...) could actually ignore the encoding for a 
large part and just assume its ASCII, which would/could be nice for 
performance. Also the fact that a lot of strings are (or should be) treated as 
opaque makes a lot of sense.

What I did not like is that much of argumentation is based on issue in the 
Windows world, take all that away and the document shrinks in half. I would 
have liked a bit more fundamental CS arguments.

Canonicalisation and sorting issues are hardly discussed.

In one place, the fact that a lot of special characters can have multiple 
representations is a big argument, while it is not mentioned how just treating 
things like a byte sequence would solve this (it doesn't AFAIU). Like how do 
you search for $e or $é if you know that it is possible to represent $é as just 
$é and as $e + $´ ?

Sven

> Sent from the road
> 
> On Dec 5, 2015, at 05:08, stepharo  wrote:
> 
>> Hi EuanM
>> 
>> Le 4/12/15 12:42, EuanM a écrit :
>>> I'm currently groping my way to seeing how feature-complete our
>>> Unicode support is.  I am doing this to establish what still needs to
>>> be done to provide full Unicode support.
>> 
>> this is great. Thanks for pushing this. I wrote and collected some roadmap 
>> (analyses on different topics)
>> on the pharo github project feel free to add this one there.
>>> 
>>> This seems to me to be an area where it would be best to write it
>>> once, and then have the same codebase incorporated into the Smalltalks
>>> that most share a common ancestry.
>>> 
>>> I am keen to get: equality-testing for strings; sortability for
>>> strings which have ligatures and diacritic characters; and correct
>>> round-tripping of data.
>> Go!
>> My suggestion is
>>start small
>>make steady progress
>>write tests
>>commit often :)
>> 
>> Stef
>> 
>> What is the french phoneBook ordering because this is the first time I hear 
>> about it.
>>> 
>>> Call to action:
>>> ==
>>> 
>>> If you have comments on these proposals - such as "but we already have
>>> that facility" or "the reason we do not have these facilities is
>>> because they are dog-slow" - please let me know them.
>>> 
>>> If you would like to help out, please let me know.
>>> 
>>> If you have Unicode experience and expertise, and would like to be, or
>>> would be willing to be, in the  'council of experts' for this project,
>>> please let me know.
>>> 
>>> If you have comments or ideas on anything mentioned in this email
>>> 
>>> In the first instance, the initiative's website will be:
>>> http://smalltalk.uk.to/unicode.html
>>> 
>>> I have created a SqueakSource.com project called UnicodeSupport
>>> 
>>> I want to avoid re-inventing any facilities which already exist.
>>> Except where they prevent us reaching the goals of:
>>>   - sortable UTF8 strings
>>>   - sortable UTF16 strings
>>>   - equivalence testing of 2 UTF8 strings
>>>   - equivalence testing of 2 UTF16 strings
>>>   - round-tripping UTF8 strings through Smalltalk
>>>   - roundtripping UTF16 strings through Smalltalk.
>>> As I understand it, we have limited Unicode support atm.
>>> 
>>> Current state of play
>>> ===
>>> ByteString gets converted to WideString when need is automagically detected.
>>> 
>>> Is there anything else that currently exists?
>>> 
>>> Definition of Terms
>>> ==
>>> A quick definition of terms before I go any further:
>>> 
>>> Standard terms from the Unicode standard
>>> ===
>>> a compatibility character : an additional encoding of a *normal*
>>> character, for compatibility and round-trip conversion purposes.  For
>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>> 
>>> Made-up terms
>>> 
>>> a convenience codepoint :  a single codepoint which represents an item
>>> that is also encoded as a string of codepoints.
>>> 
>>> (I tend to use the terms compatibility character and compatibility
>>> codepoint interchangably.  The standard only refers to them as
>>> compatibility characters.  However, the standard is determined to
>>> emphasise that characters are abstract and that codepoints a

Re: [Pharo-dev] Unicode Support

2015-12-05 Thread stepharo

Hi todd

thanks for the link.
It looks really interesting.

Stef

Le 5/12/15 17:35, Todd Blanchard a écrit :
would suggest that the only worthwhile encoding is UTF8 - the rest are 
distractions except for being able to read and convert from other 
encodings to UTF8. UTF16 is a complete waste of time.


Read http://utf8everywhere.org/

I have extensive Unicode chops from around 1999 to 2004 and my 
experience leads me to strongly agree with the views on that site.



Sent from the road

On Dec 5, 2015, at 05:08, stepharo > wrote:



Hi EuanM

Le 4/12/15 12:42, EuanM a écrit :

I'm currently groping my way to seeing how feature-complete our
Unicode support is.  I am doing this to establish what still needs to
be done to provide full Unicode support.


this is great. Thanks for pushing this. I wrote and collected some 
roadmap (analyses on different topics)

on the pharo github project feel free to add this one there.


This seems to me to be an area where it would be best to write it
once, and then have the same codebase incorporated into the Smalltalks
that most share a common ancestry.

I am keen to get: equality-testing for strings; sortability for
strings which have ligatures and diacritic characters; and correct
round-tripping of data.

Go!
My suggestion is
   start small
   make steady progress
   write tests
   commit often :)

Stef

What is the french phoneBook ordering because this is the first time 
I hear about it.


Call to action:
==

If you have comments on these proposals - such as "but we already have
that facility" or "the reason we do not have these facilities is
because they are dog-slow" - please let me know them.

If you would like to help out, please let me know.

If you have Unicode experience and expertise, and would like to be, or
would be willing to be, in the  'council of experts' for this project,
please let me know.

If you have comments or ideas on anything mentioned in this email

In the first instance, the initiative's website will be:
http://smalltalk.uk.to/unicode.html

I have created a SqueakSource.com  project 
called UnicodeSupport


I want to avoid re-inventing any facilities which already exist.
Except where they prevent us reaching the goals of:
  - sortable UTF8 strings
  - sortable UTF16 strings
  - equivalence testing of 2 UTF8 strings
  - equivalence testing of 2 UTF16 strings
  - round-tripping UTF8 strings through Smalltalk
  - roundtripping UTF16 strings through Smalltalk.
As I understand it, we have limited Unicode support atm.

Current state of play
===
ByteString gets converted to WideString when need is automagically 
detected.


Is there anything else that currently exists?

Definition of Terms
==
A quick definition of terms before I go any further:

Standard terms from the Unicode standard
===
a compatibility character : an additional encoding of a *normal*
character, for compatibility and round-trip conversion purposes.  For
instance, a 1-byte encoding of a Latin character with a diacritic.

Made-up terms

a convenience codepoint :  a single codepoint which represents an item
that is also encoded as a string of codepoints.

(I tend to use the terms compatibility character and compatibility
codepoint interchangably.  The standard only refers to them as
compatibility characters.  However, the standard is determined to
emphasise that characters are abstract and that codepoints are
concrete.  So I think it is often more useful and productive to think
of compatibility or convenience codepoints).

a composed character :  a character made up of several codepoints

Unicode encoding explained
=
A convenience codepoint can therefore be thought of as a code point
used for a character which also has a composed form.

The way Unicode works is that sometimes you can encode a character in
one byte, sometimes not.  Sometimes you can encode it in two bytes,
sometimes not.

You can therefore have a long stream of ASCII which is single-byte
Unicode.  If there is an occasional Cyrillic or Greek character in the
stream, it would be represented either by a compatibility character or
by a multi-byte combination.

Using compatibility characters can prevent proper sorting and
equivalence testing.

Using "pure" Unicode, ie. "normal encodings", can cause compatibility
and round-tripping probelms.  Although avoiding them can *also* cause
compatibility issues and round-tripping problems.

Currently my thinking is:

a Utf8String class
an Ordered collection, with 1 byte characters as the modal element,
but short arrays of wider strings where necessary
a Utf16String class
an Ordered collection, with 2 byte characters as the modal element,
but short arrays of wider strings
beginning with a 2-byte endianness indicator.

Utf8Strings sometimes need to be sortable, and sometimes need to be 
compatible.


So my thinking is that Utf8String will contain 

Re: [Pharo-dev] Unicode Support

2015-12-05 Thread Todd Blanchard
would suggest that the only worthwhile encoding is UTF8 - the rest are 
distractions except for being able to read and convert from other encodings to 
UTF8. UTF16 is a complete waste of time. 

Read http://utf8everywhere.org/

I have extensive Unicode chops from around 1999 to 2004 and my experience leads 
me to strongly agree with the views on that site.


Sent from the road

> On Dec 5, 2015, at 05:08, stepharo  wrote:
> 
> Hi EuanM
> 
> Le 4/12/15 12:42, EuanM a écrit :
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is.  I am doing this to establish what still needs to
>> be done to provide full Unicode support.
> 
> this is great. Thanks for pushing this. I wrote and collected some roadmap 
> (analyses on different topics)
> on the pharo github project feel free to add this one there.
>> 
>> This seems to me to be an area where it would be best to write it
>> once, and then have the same codebase incorporated into the Smalltalks
>> that most share a common ancestry.
>> 
>> I am keen to get: equality-testing for strings; sortability for
>> strings which have ligatures and diacritic characters; and correct
>> round-tripping of data.
> Go!
> My suggestion is
>start small
>make steady progress
>write tests
>commit often :)
> 
> Stef
> 
> What is the french phoneBook ordering because this is the first time I hear 
> about it.
>> 
>> Call to action:
>> ==
>> 
>> If you have comments on these proposals - such as "but we already have
>> that facility" or "the reason we do not have these facilities is
>> because they are dog-slow" - please let me know them.
>> 
>> If you would like to help out, please let me know.
>> 
>> If you have Unicode experience and expertise, and would like to be, or
>> would be willing to be, in the  'council of experts' for this project,
>> please let me know.
>> 
>> If you have comments or ideas on anything mentioned in this email
>> 
>> In the first instance, the initiative's website will be:
>> http://smalltalk.uk.to/unicode.html
>> 
>> I have created a SqueakSource.com project called UnicodeSupport
>> 
>> I want to avoid re-inventing any facilities which already exist.
>> Except where they prevent us reaching the goals of:
>>   - sortable UTF8 strings
>>   - sortable UTF16 strings
>>   - equivalence testing of 2 UTF8 strings
>>   - equivalence testing of 2 UTF16 strings
>>   - round-tripping UTF8 strings through Smalltalk
>>   - roundtripping UTF16 strings through Smalltalk.
>> As I understand it, we have limited Unicode support atm.
>> 
>> Current state of play
>> ===
>> ByteString gets converted to WideString when need is automagically detected.
>> 
>> Is there anything else that currently exists?
>> 
>> Definition of Terms
>> ==
>> A quick definition of terms before I go any further:
>> 
>> Standard terms from the Unicode standard
>> ===
>> a compatibility character : an additional encoding of a *normal*
>> character, for compatibility and round-trip conversion purposes.  For
>> instance, a 1-byte encoding of a Latin character with a diacritic.
>> 
>> Made-up terms
>> 
>> a convenience codepoint :  a single codepoint which represents an item
>> that is also encoded as a string of codepoints.
>> 
>> (I tend to use the terms compatibility character and compatibility
>> codepoint interchangably.  The standard only refers to them as
>> compatibility characters.  However, the standard is determined to
>> emphasise that characters are abstract and that codepoints are
>> concrete.  So I think it is often more useful and productive to think
>> of compatibility or convenience codepoints).
>> 
>> a composed character :  a character made up of several codepoints
>> 
>> Unicode encoding explained
>> =
>> A convenience codepoint can therefore be thought of as a code point
>> used for a character which also has a composed form.
>> 
>> The way Unicode works is that sometimes you can encode a character in
>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>> sometimes not.
>> 
>> You can therefore have a long stream of ASCII which is single-byte
>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>> stream, it would be represented either by a compatibility character or
>> by a multi-byte combination.
>> 
>> Using compatibility characters can prevent proper sorting and
>> equivalence testing.
>> 
>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>> and round-tripping probelms.  Although avoiding them can *also* cause
>> compatibility issues and round-tripping problems.
>> 
>> Currently my thinking is:
>> 
>> a Utf8String class
>> an Ordered collection, with 1 byte characters as the modal element,
>> but short arrays of wider strings where necessary
>> a Utf16String class
>> an Ordered collection, with 2 byte characters as the modal element,
>> but short arrays of wider strings
>> beginn

Re: [Pharo-dev] Unicode Support

2015-12-05 Thread Todd Blanchard


Sent from the road

> On Dec 5, 2015, at 05:08, stepharo  wrote:
> 
> Hi EuanM
> 
> Le 4/12/15 12:42, EuanM a écrit :
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is.  I am doing this to establish what still needs to
>> be done to provide full Unicode support.
> 
> this is great. Thanks for pushing this. I wrote and collected some roadmap 
> (analyses on different topics)
> on the pharo github project feel free to add this one there.
>> 
>> This seems to me to be an area where it would be best to write it
>> once, and then have the same codebase incorporated into the Smalltalks
>> that most share a common ancestry.
>> 
>> I am keen to get: equality-testing for strings; sortability for
>> strings which have ligatures and diacritic characters; and correct
>> round-tripping of data.
> Go!
> My suggestion is
>start small
>make steady progress
>write tests
>commit often :)
> 
> Stef
> 
> What is the french phoneBook ordering because this is the first time I hear 
> about it.
>> 
>> Call to action:
>> ==
>> 
>> If you have comments on these proposals - such as "but we already have
>> that facility" or "the reason we do not have these facilities is
>> because they are dog-slow" - please let me know them.
>> 
>> If you would like to help out, please let me know.
>> 
>> If you have Unicode experience and expertise, and would like to be, or
>> would be willing to be, in the  'council of experts' for this project,
>> please let me know.
>> 
>> If you have comments or ideas on anything mentioned in this email
>> 
>> In the first instance, the initiative's website will be:
>> http://smalltalk.uk.to/unicode.html
>> 
>> I have created a SqueakSource.com project called UnicodeSupport
>> 
>> I want to avoid re-inventing any facilities which already exist.
>> Except where they prevent us reaching the goals of:
>>   - sortable UTF8 strings
>>   - sortable UTF16 strings
>>   - equivalence testing of 2 UTF8 strings
>>   - equivalence testing of 2 UTF16 strings
>>   - round-tripping UTF8 strings through Smalltalk
>>   - roundtripping UTF16 strings through Smalltalk.
>> As I understand it, we have limited Unicode support atm.
>> 
>> Current state of play
>> ===
>> ByteString gets converted to WideString when need is automagically detected.
>> 
>> Is there anything else that currently exists?
>> 
>> Definition of Terms
>> ==
>> A quick definition of terms before I go any further:
>> 
>> Standard terms from the Unicode standard
>> ===
>> a compatibility character : an additional encoding of a *normal*
>> character, for compatibility and round-trip conversion purposes.  For
>> instance, a 1-byte encoding of a Latin character with a diacritic.
>> 
>> Made-up terms
>> 
>> a convenience codepoint :  a single codepoint which represents an item
>> that is also encoded as a string of codepoints.
>> 
>> (I tend to use the terms compatibility character and compatibility
>> codepoint interchangably.  The standard only refers to them as
>> compatibility characters.  However, the standard is determined to
>> emphasise that characters are abstract and that codepoints are
>> concrete.  So I think it is often more useful and productive to think
>> of compatibility or convenience codepoints).
>> 
>> a composed character :  a character made up of several codepoints
>> 
>> Unicode encoding explained
>> =
>> A convenience codepoint can therefore be thought of as a code point
>> used for a character which also has a composed form.
>> 
>> The way Unicode works is that sometimes you can encode a character in
>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>> sometimes not.
>> 
>> You can therefore have a long stream of ASCII which is single-byte
>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>> stream, it would be represented either by a compatibility character or
>> by a multi-byte combination.
>> 
>> Using compatibility characters can prevent proper sorting and
>> equivalence testing.
>> 
>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>> and round-tripping probelms.  Although avoiding them can *also* cause
>> compatibility issues and round-tripping problems.
>> 
>> Currently my thinking is:
>> 
>> a Utf8String class
>> an Ordered collection, with 1 byte characters as the modal element,
>> but short arrays of wider strings where necessary
>> a Utf16String class
>> an Ordered collection, with 2 byte characters as the modal element,
>> but short arrays of wider strings
>> beginning with a 2-byte endianness indicator.
>> 
>> Utf8Strings sometimes need to be sortable, and sometimes need to be 
>> compatible.
>> 
>> So my thinking is that Utf8String will contain convenience codepoints,
>> for round-tripping.  And where there are multiple convenience
>> codepoints for a character, that it standardises on one.
>> 
>> And that there is 

Re: [Pharo-dev] Unicode Support

2015-12-05 Thread stepharo

Hi EuanM

Le 4/12/15 12:42, EuanM a écrit :

I'm currently groping my way to seeing how feature-complete our
Unicode support is.  I am doing this to establish what still needs to
be done to provide full Unicode support.


this is great. Thanks for pushing this. I wrote and collected some 
roadmap (analyses on different topics)

on the pharo github project feel free to add this one there.


This seems to me to be an area where it would be best to write it
once, and then have the same codebase incorporated into the Smalltalks
that most share a common ancestry.

I am keen to get: equality-testing for strings; sortability for
strings which have ligatures and diacritic characters; and correct
round-tripping of data.

Go!
My suggestion is
start small
make steady progress
write tests
commit often :)

Stef

What is the french phoneBook ordering because this is the first time I 
hear about it.


Call to action:
==

If you have comments on these proposals - such as "but we already have
that facility" or "the reason we do not have these facilities is
because they are dog-slow" - please let me know them.

If you would like to help out, please let me know.

If you have Unicode experience and expertise, and would like to be, or
would be willing to be, in the  'council of experts' for this project,
please let me know.

If you have comments or ideas on anything mentioned in this email

In the first instance, the initiative's website will be:
http://smalltalk.uk.to/unicode.html

I have created a SqueakSource.com project called UnicodeSupport

I want to avoid re-inventing any facilities which already exist.
Except where they prevent us reaching the goals of:
   - sortable UTF8 strings
   - sortable UTF16 strings
   - equivalence testing of 2 UTF8 strings
   - equivalence testing of 2 UTF16 strings
   - round-tripping UTF8 strings through Smalltalk
   - roundtripping UTF16 strings through Smalltalk.
As I understand it, we have limited Unicode support atm.

Current state of play
===
ByteString gets converted to WideString when need is automagically detected.

Is there anything else that currently exists?

Definition of Terms
==
A quick definition of terms before I go any further:

Standard terms from the Unicode standard
===
a compatibility character : an additional encoding of a *normal*
character, for compatibility and round-trip conversion purposes.  For
instance, a 1-byte encoding of a Latin character with a diacritic.

Made-up terms

a convenience codepoint :  a single codepoint which represents an item
that is also encoded as a string of codepoints.

(I tend to use the terms compatibility character and compatibility
codepoint interchangably.  The standard only refers to them as
compatibility characters.  However, the standard is determined to
emphasise that characters are abstract and that codepoints are
concrete.  So I think it is often more useful and productive to think
of compatibility or convenience codepoints).

a composed character :  a character made up of several codepoints

Unicode encoding explained
=
A convenience codepoint can therefore be thought of as a code point
used for a character which also has a composed form.

The way Unicode works is that sometimes you can encode a character in
one byte, sometimes not.  Sometimes you can encode it in two bytes,
sometimes not.

You can therefore have a long stream of ASCII which is single-byte
Unicode.  If there is an occasional Cyrillic or Greek character in the
stream, it would be represented either by a compatibility character or
by a multi-byte combination.

Using compatibility characters can prevent proper sorting and
equivalence testing.

Using "pure" Unicode, ie. "normal encodings", can cause compatibility
and round-tripping probelms.  Although avoiding them can *also* cause
compatibility issues and round-tripping problems.

Currently my thinking is:

a Utf8String class
an Ordered collection, with 1 byte characters as the modal element,
but short arrays of wider strings where necessary
a Utf16String class
an Ordered collection, with 2 byte characters as the modal element,
but short arrays of wider strings
beginning with a 2-byte endianness indicator.

Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.

So my thinking is that Utf8String will contain convenience codepoints,
for round-tripping.  And where there are multiple convenience
codepoints for a character, that it standardises on one.

And that there is a Utf8SortableString which uses *only* normal characters.

We then need methods to convert between the two.

aUtf8String asUtf8SortableString

and

aUtf8SortableString asUtf8String


Sort orders are culture and context dependent - Sweden and Germany
have different sort orders for the same diacritic-ed characters.  Some
countries have one order in general usage, and another for specific
usages, such as phone directories 

Re: [Pharo-dev] Unicode Support

2015-12-04 Thread Sven Van Caekenberghe

> On 04 Dec 2015, at 17:00, Max Leske  wrote:
> 
> Hi Euan
> 
> I think it’s great that you’re trying this. I hope you know what you’re 
> getting yourself into :)
> 
> 
> I’m no Unicode expert but I want to add two points to your list (although 
> you’ve probably already thought of them):
> - Normalisation and conversion (http://unicode.org/faq/normalization.html).
>   Unicode / ICU provide libraries (libuconv / libiconv) that handle this 
> stuff. Specifically normalisation conversions
>   aren’t trivial and I think it wouldn’t make much sense to reimplement 
> those algorithms. I do think however, that
>   having them available is important (where I work we’re currently 
> writing a VM plugin for access to libiconv through
>   primitives so that we can clean out combining characters through 
> normalisation. And we’ll obviously get nice sorting
>   properties and speeds for free)
> - Sorting and comparison.
>   Basically the same point as above. libuconv / libiconv provide 
> algorithms for this. Do we need our own implementation?

These 2 are indeed missing and it would be good to add them.

We already have UTF8/UTF16 encoding/decoding, even 2 implementations. See 
http://files.pharo.org/books/enterprisepharo/book/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html
 for the modern version. 

But IMHO it would not be a good idea to try to implement functionality on in 
image strings with those representations, it would be too slow.

But of course, if you want to try to implement something and show us, go for it.

> Cheers,
> Max
> 
> 
>> On 04 Dec 2015, at 12:42, EuanM  wrote:
>> 
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is.  I am doing this to establish what still needs to
>> be done to provide full Unicode support.
>> 
>> This seems to me to be an area where it would be best to write it
>> once, and then have the same codebase incorporated into the Smalltalks
>> that most share a common ancestry.
>> 
>> I am keen to get: equality-testing for strings; sortability for
>> strings which have ligatures and diacritic characters; and correct
>> round-tripping of data.
>> 
>> Call to action:
>> ==
>> 
>> If you have comments on these proposals - such as "but we already have
>> that facility" or "the reason we do not have these facilities is
>> because they are dog-slow" - please let me know them.
>> 
>> If you would like to help out, please let me know.
>> 
>> If you have Unicode experience and expertise, and would like to be, or
>> would be willing to be, in the  'council of experts' for this project,
>> please let me know.
>> 
>> If you have comments or ideas on anything mentioned in this email
>> 
>> In the first instance, the initiative's website will be:
>> http://smalltalk.uk.to/unicode.html
>> 
>> I have created a SqueakSource.com project called UnicodeSupport
>> 
>> I want to avoid re-inventing any facilities which already exist.
>> Except where they prevent us reaching the goals of:
>> - sortable UTF8 strings
>> - sortable UTF16 strings
>> - equivalence testing of 2 UTF8 strings
>> - equivalence testing of 2 UTF16 strings
>> - round-tripping UTF8 strings through Smalltalk
>> - roundtripping UTF16 strings through Smalltalk.
>> As I understand it, we have limited Unicode support atm.
>> 
>> Current state of play
>> ===
>> ByteString gets converted to WideString when need is automagically detected.
>> 
>> Is there anything else that currently exists?
>> 
>> Definition of Terms
>> ==
>> A quick definition of terms before I go any further:
>> 
>> Standard terms from the Unicode standard
>> ===
>> a compatibility character : an additional encoding of a *normal*
>> character, for compatibility and round-trip conversion purposes.  For
>> instance, a 1-byte encoding of a Latin character with a diacritic.
>> 
>> Made-up terms
>> 
>> a convenience codepoint :  a single codepoint which represents an item
>> that is also encoded as a string of codepoints.
>> 
>> (I tend to use the terms compatibility character and compatibility
>> codepoint interchangably.  The standard only refers to them as
>> compatibility characters.  However, the standard is determined to
>> emphasise that characters are abstract and that codepoints are
>> concrete.  So I think it is often more useful and productive to think
>> of compatibility or convenience codepoints).
>> 
>> a composed character :  a character made up of several codepoints
>> 
>> Unicode encoding explained
>> =
>> A convenience codepoint can therefore be thought of as a code point
>> used for a character which also has a composed form.
>> 
>> The way Unicode works is that sometimes you can encode a character in
>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>> sometimes not.
>> 
>> You can therefore have a long stream of ASCII which is single-byte
>> Unicode.  If there is an occasional Cyrillic or

Re: [Pharo-dev] Unicode Support

2015-12-04 Thread Max Leske
Hi Euan

I think it’s great that you’re trying this. I hope you know what you’re getting 
yourself into :)


I’m no Unicode expert but I want to add two points to your list (although 
you’ve probably already thought of them):
- Normalisation and conversion (http://unicode.org/faq/normalization.html).
Unicode / ICU provide libraries (libuconv / libiconv) that handle this 
stuff. Specifically normalisation conversions
aren’t trivial and I think it wouldn’t make much sense to reimplement 
those algorithms. I do think however, that
having them available is important (where I work we’re currently 
writing a VM plugin for access to libiconv through
primitives so that we can clean out combining characters through 
normalisation. And we’ll obviously get nice sorting
properties and speeds for free)
- Sorting and comparison.
Basically the same point as above. libuconv / libiconv provide 
algorithms for this. Do we need our own implementation?

Cheers,
Max


> On 04 Dec 2015, at 12:42, EuanM  wrote:
> 
> I'm currently groping my way to seeing how feature-complete our
> Unicode support is.  I am doing this to establish what still needs to
> be done to provide full Unicode support.
> 
> This seems to me to be an area where it would be best to write it
> once, and then have the same codebase incorporated into the Smalltalks
> that most share a common ancestry.
> 
> I am keen to get: equality-testing for strings; sortability for
> strings which have ligatures and diacritic characters; and correct
> round-tripping of data.
> 
> Call to action:
> ==
> 
> If you have comments on these proposals - such as "but we already have
> that facility" or "the reason we do not have these facilities is
> because they are dog-slow" - please let me know them.
> 
> If you would like to help out, please let me know.
> 
> If you have Unicode experience and expertise, and would like to be, or
> would be willing to be, in the  'council of experts' for this project,
> please let me know.
> 
> If you have comments or ideas on anything mentioned in this email
> 
> In the first instance, the initiative's website will be:
> http://smalltalk.uk.to/unicode.html
> 
> I have created a SqueakSource.com project called UnicodeSupport
> 
> I want to avoid re-inventing any facilities which already exist.
> Except where they prevent us reaching the goals of:
>  - sortable UTF8 strings
>  - sortable UTF16 strings
>  - equivalence testing of 2 UTF8 strings
>  - equivalence testing of 2 UTF16 strings
>  - round-tripping UTF8 strings through Smalltalk
>  - roundtripping UTF16 strings through Smalltalk.
> As I understand it, we have limited Unicode support atm.
> 
> Current state of play
> ===
> ByteString gets converted to WideString when need is automagically detected.
> 
> Is there anything else that currently exists?
> 
> Definition of Terms
> ==
> A quick definition of terms before I go any further:
> 
> Standard terms from the Unicode standard
> ===
> a compatibility character : an additional encoding of a *normal*
> character, for compatibility and round-trip conversion purposes.  For
> instance, a 1-byte encoding of a Latin character with a diacritic.
> 
> Made-up terms
> 
> a convenience codepoint :  a single codepoint which represents an item
> that is also encoded as a string of codepoints.
> 
> (I tend to use the terms compatibility character and compatibility
> codepoint interchangably.  The standard only refers to them as
> compatibility characters.  However, the standard is determined to
> emphasise that characters are abstract and that codepoints are
> concrete.  So I think it is often more useful and productive to think
> of compatibility or convenience codepoints).
> 
> a composed character :  a character made up of several codepoints
> 
> Unicode encoding explained
> =
> A convenience codepoint can therefore be thought of as a code point
> used for a character which also has a composed form.
> 
> The way Unicode works is that sometimes you can encode a character in
> one byte, sometimes not.  Sometimes you can encode it in two bytes,
> sometimes not.
> 
> You can therefore have a long stream of ASCII which is single-byte
> Unicode.  If there is an occasional Cyrillic or Greek character in the
> stream, it would be represented either by a compatibility character or
> by a multi-byte combination.
> 
> Using compatibility characters can prevent proper sorting and
> equivalence testing.
> 
> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
> and round-tripping probelms.  Although avoiding them can *also* cause
> compatibility issues and round-tripping problems.
> 
> Currently my thinking is:
> 
> a Utf8String class
> an Ordered collection, with 1 byte characters as the modal element,
> but short arrays of wider strings where necessary
> a Utf16String class
> an Ordered collection, with 2 b