Re: [Pharo-dev] Unicode Support

Todd Blanchard Sat, 05 Dec 2015 08:37:33 -0800

would suggest that the only worthwhile encoding is UTF8 - the rest are 
distractions except for being able to read and convert from other encodings to 
UTF8. UTF16 is a complete waste of time.


Read http://utf8everywhere.org/

I have extensive Unicode chops from around 1999 to 2004 and my experience leads 
me to strongly agree with the views on that site.


Sent from the road

> On Dec 5, 2015, at 05:08, stepharo <steph...@free.fr> wrote:
> 
> Hi EuanM
> 
> Le 4/12/15 12:42, EuanM a écrit :
>> I'm currently groping my way to seeing how feature-complete our
>> Unicode support is.  I am doing this to establish what still needs to
>> be done to provide full Unicode support.
> 
> this is great. Thanks for pushing this. I wrote and collected some roadmap 
> (analyses on different topics)
> on the pharo github project feel free to add this one there.
>> 
>> This seems to me to be an area where it would be best to write it
>> once, and then have the same codebase incorporated into the Smalltalks
>> that most share a common ancestry.
>> 
>> I am keen to get: equality-testing for strings; sortability for
>> strings which have ligatures and diacritic characters; and correct
>> round-tripping of data.
> Go!
> My suggestion is
>    start small
>    make steady progress
>    write tests
>    commit often :)
> 
> Stef
> 
> What is the french phoneBook ordering because this is the first time I hear 
> about it.
>> 
>> Call to action:
>> ==========
>> 
>> If you have comments on these proposals - such as "but we already have
>> that facility" or "the reason we do not have these facilities is
>> because they are dog-slow" - please let me know them.
>> 
>> If you would like to help out, please let me know.
>> 
>> If you have Unicode experience and expertise, and would like to be, or
>> would be willing to be, in the  'council of experts' for this project,
>> please let me know.
>> 
>> If you have comments or ideas on anything mentioned in this email
>> 
>> In the first instance, the initiative's website will be:
>> http://smalltalk.uk.to/unicode.html
>> 
>> I have created a SqueakSource.com project called UnicodeSupport
>> 
>> I want to avoid re-inventing any facilities which already exist.
>> Except where they prevent us reaching the goals of:
>>   - sortable UTF8 strings
>>   - sortable UTF16 strings
>>   - equivalence testing of 2 UTF8 strings
>>   - equivalence testing of 2 UTF16 strings
>>   - round-tripping UTF8 strings through Smalltalk
>>   - roundtripping UTF16 strings through Smalltalk.
>> As I understand it, we have limited Unicode support atm.
>> 
>> Current state of play
>> ===============
>> ByteString gets converted to WideString when need is automagically detected.
>> 
>> Is there anything else that currently exists?
>> 
>> Definition of Terms
>> ==============
>> A quick definition of terms before I go any further:
>> 
>> Standard terms from the Unicode standard
>> ===============================
>> a compatibility character : an additional encoding of a *normal*
>> character, for compatibility and round-trip conversion purposes.  For
>> instance, a 1-byte encoding of a Latin character with a diacritic.
>> 
>> Made-up terms
>> ============
>> a convenience codepoint :  a single codepoint which represents an item
>> that is also encoded as a string of codepoints.
>> 
>> (I tend to use the terms compatibility character and compatibility
>> codepoint interchangably.  The standard only refers to them as
>> compatibility characters.  However, the standard is determined to
>> emphasise that characters are abstract and that codepoints are
>> concrete.  So I think it is often more useful and productive to think
>> of compatibility or convenience codepoints).
>> 
>> a composed character :  a character made up of several codepoints
>> 
>> Unicode encoding explained
>> =====================
>> A convenience codepoint can therefore be thought of as a code point
>> used for a character which also has a composed form.
>> 
>> The way Unicode works is that sometimes you can encode a character in
>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>> sometimes not.
>> 
>> You can therefore have a long stream of ASCII which is single-byte
>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>> stream, it would be represented either by a compatibility character or
>> by a multi-byte combination.
>> 
>> Using compatibility characters can prevent proper sorting and
>> equivalence testing.
>> 
>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>> and round-tripping probelms.  Although avoiding them can *also* cause
>> compatibility issues and round-tripping problems.
>> 
>> Currently my thinking is:
>> 
>> a Utf8String class
>> an Ordered collection, with 1 byte characters as the modal element,
>> but short arrays of wider strings where necessary
>> a Utf16String class
>> an Ordered collection, with 2 byte characters as the modal element,
>> but short arrays of wider strings
>> beginning with a 2-byte endianness indicator.
>> 
>> Utf8Strings sometimes need to be sortable, and sometimes need to be 
>> compatible.
>> 
>> So my thinking is that Utf8String will contain convenience codepoints,
>> for round-tripping.  And where there are multiple convenience
>> codepoints for a character, that it standardises on one.
>> 
>> And that there is a Utf8SortableString which uses *only* normal characters.
>> 
>> We then need methods to convert between the two.
>> 
>> aUtf8String asUtf8SortableString
>> 
>> and
>> 
>> aUtf8SortableString asUtf8String
>> 
>> 
>> Sort orders are culture and context dependent - Sweden and Germany
>> have different sort orders for the same diacritic-ed characters.  Some
>> countries have one order in general usage, and another for specific
>> usages, such as phone directories (e.g. UK and France)
>> 
>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>> conversion methods
>> 
>> A list of sorted words would be a SortedCollection, and there could be
>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>> seOrder, ukOrder, etc
>> 
>> along the lines of
>> aListOfWords := SortedCollection sortBlock: deOrder
>> 
>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>> then we can perform equivalence testing on them trivially.
>> 
>> To make sure a Utf8String is well formed, we would need to have a way
>> of cleaning up any convenience codepoints which were valid, but which
>> were for a character which has multiple equally-valid alternative
>> convenience codepoints, and for which the string currently had the
>> "wrong" convenience codepoint.  (i.e for any character with valid
>> alternative convenience codepoints, we would choose one to be in the
>> well-formed Utf8String, and we would need a method for cleaning the
>> alternative convenience codepoints out of the string, and replacing
>> them with the chosen approved convenience codepoint.
>> 
>> aUtf8String cleanUtf8String
>> 
>> With WideString, a lot of the issues disappear - except
>> round-tripping(although I'm sure I have seen something recently about
>> 4-byte strings that also have an additional bit.  Which would make
>> some Unicode characters 5-bytes long.)
>> 
>> 
>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>> subtle, or somewhere in between, please let me know)
>> 
>> Cheers,
>>     Euan
> 
>

Re: [Pharo-dev] Unicode Support

Reply via email to