Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

2015-12-15 Thread H. Hirzel
On 12/15/15, Ben Coman  wrote:
> On Thu, Dec 10, 2015 at 12:37 AM, Todd Blanchard 
> wrote:
>> They are practically the same thing.
>>
>> ICU was developed by Taligent which was a joint venture between Apple and
>> IBM.  Makes sense that NSString and ICU's UnicodeString are pretty close
>> in implementation.  ICU was also ported to Java for Sun by IBM.  The point
>> is - this is a very elaborate chunk of code with far reach. If ICU is
>> wrong on some point - it is universally wrong and thus likely to be taken
>> as "right" as it is at least consistent.  I think re-implementing it is
>> folly TBH.  Just use it.
>
> Apple seem to have moved on from NSString to support Unicode in a
> different way in Switft...

Could you please give some more details?

I have read
https://www.objc.io/issues/9-strings/unicode/#nsstring-and-unicode
so far.

It says that an NSString object actually represents an array of
UTF-16-encoded code units.

This in contrast to Squeak / Pharo where a String is an
ArrayedCollection of 21 bit Unicode code points (transparently
optimizing to a ByteArray if the string only contains values of the
first code page).


>>
>>> On Dec 8, 2015, at 15:52, EuanM  wrote:
>>>
>>> Equally old are the NextStep Object C functions which are now embodied
>>> within MacOS X.
>>>
>>
>>
>
>



Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

2015-12-15 Thread Ben Coman
On Thu, Dec 10, 2015 at 12:37 AM, Todd Blanchard  wrote:
> They are practically the same thing.
>
> ICU was developed by Taligent which was a joint venture between Apple and 
> IBM.  Makes sense that NSString and ICU's UnicodeString are pretty close in 
> implementation.  ICU was also ported to Java for Sun by IBM.  The point is - 
> this is a very elaborate chunk of code with far reach. If ICU is wrong on 
> some point - it is universally wrong and thus likely to be taken as "right" 
> as it is at least consistent.  I think re-implementing it is folly TBH.  Just 
> use it.

Apple seem to have moved on from NSString to support Unicode in a
different way in Switft...

>
>> On Dec 8, 2015, at 15:52, EuanM  wrote:
>>
>> Equally old are the NextStep Object C functions which are now embodied
>> within MacOS X.
>>
>
>



Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

2015-12-15 Thread Eliot Miranda
Hi All,

> On Dec 14, 2015, at 7:23 AM, Richard Sargent 
>  wrote:
> 
> EuanM wrote
>> Hi Todd, it's taken me til now to put my thoughts into words on this
>> issue.
>> 
>> I think we should make it work first.  This will allow us to gain more
>> insight into the issues, and create documentation about the process
>> that we, as a community, understand.
>> 
>> If ICU is the right way to go, we can *then* "make it right".
> 
> One way to cross the Grand Canyon is to climb down the wall, assemble the
> materials to build a bridge, build the bridge, cross it, then clamber up the
> wall. One would learn a lot about climbing and bridge building, worthwhile
> if the goal is to learn about those subjects.
> 
> Alternatively, you could use the highway bridge that was built long ago and
> just cross to the other side. This is definitely my inclination.
> 

Using ICU is a disaster at least for the VM.  It means we cannot simulate an 
important part of the system without being able to call ICU from within the 
simulator.  It means bugs in Unicode can potentially cause hard crashes in c++ 
code, destroying the perfectly safe VM simulation environment /and/ the 
perfectly safe Smalltalk string library.  Goodbye debugger hello gdb.

It means having a dependency on a large c++ library for a core part of the 
system, with all that entails about more complex builds, portability, etc.  

It means opacity in a core part of the system.

This move will be a terrible mistake.


>> On 8 December 2015 at 22:36, Todd Blanchard 
> 
>> tblanchard@
> 
>>  wrote:
>>> I just want to second Dale's endorsement of the ICU library.  It has been
>>> around a long time (originally developed by Taligent) and it provides the
>>> base unicode capabilities for an awful lot of software.
>>> 
>>> I think it would make more sense to bring icu into Smalltalk as a
>>> NativeBoost library than to spend resources reimplementing and
>>> maintaining
>>> it.
>>> 
>>> -Todd Blanchard
>>> 
>>> On Dec 8, 2015, at 11:20, Dale Henrichs 
> 
>> dale.henrichs@
> 
>> 
>>> wrote:
>>> 
>>> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>>> 
>>> Dale
>>> 
>>> Thank you for your answer with links to the ICU library and the notes
>>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
>>> subclass of ByteArray.
>>> 
>>> I understand that Gemstone uses the ICU library and thus does not
>>> implement the algorithms in Smalltalk.
>>> 
>>> I am currently looking into what the  ICU  library provides.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://forum.world.st/Sorting-Unicode-strings-Re-Unicode-collation-sequences-Re-squeak-dev-Unicode-Support-tp4865876p4866992.html
> Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.
> 



Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

2015-12-15 Thread Eliot Miranda


> On Dec 14, 2015, at 7:23 AM, Richard Sargent 
>  wrote:
> 
> EuanM wrote
>> Hi Todd, it's taken me til now to put my thoughts into words on this
>> issue.
>> 
>> I think we should make it work first.  This will allow us to gain more
>> insight into the issues, and create documentation about the process
>> that we, as a community, understand.
>> 
>> If ICU is the right way to go, we can *then* "make it right".
> 
> One way to cross the Grand Canyon is to climb down the wall, assemble the
> materials to build a bridge, build the bridge, cross it, then clamber up the
> wall. One would learn a lot about climbing and bridge building, worthwhile
> if the goal is to learn about those subjects.
> 
> Alternatively, you could use the highway bridge that was built long ago and
> just cross to the other side. This is definitely my inclination.

and given that translating c++ to Smalltalk is doable since Smalltalk is a more 
powerful language, whereas going the other way is decidedly non-trivial, the 
right way to use the highway bridge is to point moose at it, extract the plans 
for the bridge, and reconstruct it out of quality materials.

> 
> 
>> On 8 December 2015 at 22:36, Todd Blanchard 
> 
>> tblanchard@
> 
>>  wrote:
>>> I just want to second Dale's endorsement of the ICU library.  It has been
>>> around a long time (originally developed by Taligent) and it provides the
>>> base unicode capabilities for an awful lot of software.
>>> 
>>> I think it would make more sense to bring icu into Smalltalk as a
>>> NativeBoost library than to spend resources reimplementing and
>>> maintaining
>>> it.
>>> 
>>> -Todd Blanchard
>>> 
>>> On Dec 8, 2015, at 11:20, Dale Henrichs 
> 
>> dale.henrichs@
> 
>> 
>>> wrote:
>>> 
>>> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>>> 
>>> Dale
>>> 
>>> Thank you for your answer with links to the ICU library and the notes
>>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
>>> subclass of ByteArray.
>>> 
>>> I understand that Gemstone uses the ICU library and thus does not
>>> implement the algorithms in Smalltalk.
>>> 
>>> I am currently looking into what the  ICU  library provides.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://forum.world.st/Sorting-Unicode-strings-Re-Unicode-collation-sequences-Re-squeak-dev-Unicode-Support-tp4865876p4866992.html
> Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.
> 



Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

2015-12-15 Thread Todd Blanchard
I wouldn't say that necessarily. There's a whole lot of API that is dependent 
on objective c - I suspect they have a toll free bridging in there where 
swift/objective c strings share enough common protocol to be fully 
substitutable like with CFString.

Anyhow swifts step away from smalltalk ideals and the added complexity of 
language bridging has me avoiding it.

Sent from the road

> On Dec 15, 2015, at 00:49, Ben Coman  wrote:
> 
> Apple seem to have moved on from NSString to support Unicode in a
> different way in Switft...



Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

2015-12-14 Thread Richard Sargent
EuanM wrote
> Hi Todd, it's taken me til now to put my thoughts into words on this
> issue.
> 
> I think we should make it work first.  This will allow us to gain more
> insight into the issues, and create documentation about the process
> that we, as a community, understand.
> 
> If ICU is the right way to go, we can *then* "make it right".

One way to cross the Grand Canyon is to climb down the wall, assemble the
materials to build a bridge, build the bridge, cross it, then clamber up the
wall. One would learn a lot about climbing and bridge building, worthwhile
if the goal is to learn about those subjects.

Alternatively, you could use the highway bridge that was built long ago and
just cross to the other side. This is definitely my inclination.



> On 8 December 2015 at 22:36, Todd Blanchard 

> tblanchard@

>  wrote:
>> I just want to second Dale's endorsement of the ICU library.  It has been
>> around a long time (originally developed by Taligent) and it provides the
>> base unicode capabilities for an awful lot of software.
>>
>> I think it would make more sense to bring icu into Smalltalk as a
>> NativeBoost library than to spend resources reimplementing and
>> maintaining
>> it.
>>
>> -Todd Blanchard
>>
>> On Dec 8, 2015, at 11:20, Dale Henrichs 

> dale.henrichs@

> 
>> wrote:
>>
>> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>>
>> Dale
>>
>> Thank you for your answer with links to the ICU library and the notes
>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
>> subclass of ByteArray.
>>
>> I understand that Gemstone uses the ICU library and thus does not
>> implement the algorithms in Smalltalk.
>>
>> I am currently looking into what the  ICU  library provides.
>>
>>





--
View this message in context: 
http://forum.world.st/Sorting-Unicode-strings-Re-Unicode-collation-sequences-Re-squeak-dev-Unicode-Support-tp4865876p4866992.html
Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.



Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

2015-12-09 Thread Todd Blanchard
So we should start with that.

Fwiw - icu also provides time and date format internationalization and some 
other utilities. It is a very comprehensive library.

Sent from the road

> On Dec 9, 2015, at 07:03, H. Hirzel  wrote:
> 
>> On 12/8/15, Todd Blanchard  wrote:
>> I just want to second Dale's endorsement of the ICU library.  It has been
>> around a long time (originally developed by Taligent) and it provides the
>> base unicode capabilities for an awful lot of software.
>> 
>> I think it would make more sense to bring icu into Smalltalk as a
>> NativeBoost library than to spend resources reimplementing and maintaining
>> it.
>> 
>> -Todd Blanchard
> 
> ICU already seems to be available for Smalltalk accessed through wrappers
> 
> http://wiki.squeak.org/squeak/6234
> 
> --Hannes
> 
>>> On Dec 8, 2015, at 11:20, Dale Henrichs 
>>> wrote:
>>> 
 On 12/07/2015 11:31 PM, H. Hirzel wrote:
 Dale
 
 Thank you for your answer with links to the ICU library and the notes
 about classes in Gemstone. Noteworthy that you have a class Utf8 as a
 subclass of ByteArray.
 
 I understand that Gemstone uses the ICU library and thus does not
 implement the algorithms in Smalltalk.
 
 I am currently looking into what the  ICU  library provides.
>> 
>> 



Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

2015-12-08 Thread EuanM
Dale - is that you can't depend on the value of a codepoint
*unless the string is either in fully-composed form
(or has just been fully-decomposed from a fully-composed form) *

OR are there circumstances where even those two cases cannot be relied upon?

On 8 December 2015 at 19:20, Dale Henrichs
 wrote:
>
>
> On 12/07/2015 11:31 PM, H. Hirzel wrote:
>>
>> Dale
>>
>> Thank you for your answer with links to the ICU library and the notes
>> about classes in Gemstone. Noteworthy that you have a class Utf8 as a
>> subclass of ByteArray.
>>
>> I understand that Gemstone uses the ICU library and thus does not
>> implement the algorithms in Smalltalk.
>>
>> I am currently looking into what the  ICU  library provides.
>>
>> I found as well a Ruby library [2] which implements CLDR [3]
>>
>> It has methods like this
>>
>> "Alphabetize a list using regular Ruby sort:"
>>
>> $> ["Art", "Wasa", "Älg", "Ved"].sort
>> $> ["Art", "Ved", "Wasa", "Älg"]
>>
>> Alphabetize a list using TwitterCLDR’s locale-aware sort:
>>
>> $> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
>> $> ["Älg", "Art", "Ved", "Wasa"]
>>
>> I hope that given such an example it would not be too difficult to
>> reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
>> the interest is in getting sorting done in a cross-dialect-way.
>>
>
> I think that the issue (from a performance perspective) is that you can't
> depend upon the value of the code point when doing collation --- the main
> algorithm[5] is pretty much table based --- In addition to the different
> sort orders based on characters there are even more arcane sort rules where
> characters at the end of a word can affect the sort order of the word (for
> more info see[4]).
>
> It is worth looking at the Conformance section of the Unicode spec[1] as
> there are different levels of collation conformance .
>
> ICU conforms[2] to to UTS #10[3], the highest level of conformance ...
>
> It looks like  TwitterCLDR[6] uses the Main Algorithm[5] with tailoring[7].
> They don't claim to be conformant to the Unicode Collation Algorithm[3], but
> they are covering a big chunk of the standard use cases 
>
> Dale
>
> [1] http://unicode.org/reports/tr10/#Conformance
> [2] http://userguide.icu-project.org/collation
> [3] http://www.unicode.org/reports/tr10/
> [4] http://www.unicode.org/reports/tr10/#Introduction
> [5] http://www.unicode.org/reports/tr10/#Main_Algorithm
> [6]
> https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
> [7] http://unicode.org/reports/tr10/#Tailoring



Re: [Pharo-dev] [Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

2015-12-08 Thread Dale Henrichs

Euan,

What I meant is that you can't _always_ use the code point for 
collation, i.e., sorting based on the value of code points is not always 
correct[1].


If I'm not mistaken the fully-composed and fully-decomposed forms can 
only be used for testing the  equivalence of two strings[2] ...


The Main Algorithm[3], starts by producing a normalized form of the 
string, but the subsequent steps (produce array, form sort key and 
compare) involves table lookups among other things 


Once you've produced a sort key for a string, the sort key does use 
"binary comparison" for collating , which is a byte by byte numeric 
comparison ...


Dale

[1] http://www.unicode.org/reports/tr10/#Common_Misperceptions
[2] http://unicode.org/reports/tr15/pdtr15.html
[3] http://www.unicode.org/reports/tr10/#Main_Algorithm

On 12/08/2015 12:22 PM, EuanM wrote:

Dale - is that you can't depend on the value of a codepoint
*unless the string is either in fully-composed form
(or has just been fully-decomposed from a fully-composed form) *

OR are there circumstances where even those two cases cannot be relied upon?

On 8 December 2015 at 19:20, Dale Henrichs
 wrote:


On 12/07/2015 11:31 PM, H. Hirzel wrote:

Dale

Thank you for your answer with links to the ICU library and the notes
about classes in Gemstone. Noteworthy that you have a class Utf8 as a
subclass of ByteArray.

I understand that Gemstone uses the ICU library and thus does not
implement the algorithms in Smalltalk.

I am currently looking into what the  ICU  library provides.

I found as well a Ruby library [2] which implements CLDR [3]

It has methods like this

"Alphabetize a list using regular Ruby sort:"

$> ["Art", "Wasa", "Älg", "Ved"].sort
$> ["Art", "Ved", "Wasa", "Älg"]

Alphabetize a list using TwitterCLDR’s locale-aware sort:

$> ["Art", "Wasa", "Älg", "Ved"].localize(:de).sort.to_a
$> ["Älg", "Art", "Ved", "Wasa"]

I hope that given such an example it would not be too difficult to
reimplement a similar sort algorithm in Squeak/Cuis/Pharo. Currently
the interest is in getting sorting done in a cross-dialect-way.


I think that the issue (from a performance perspective) is that you can't
depend upon the value of the code point when doing collation --- the main
algorithm[5] is pretty much table based --- In addition to the different
sort orders based on characters there are even more arcane sort rules where
characters at the end of a word can affect the sort order of the word (for
more info see[4]).

It is worth looking at the Conformance section of the Unicode spec[1] as
there are different levels of collation conformance .

ICU conforms[2] to to UTS #10[3], the highest level of conformance ...

It looks like  TwitterCLDR[6] uses the Main Algorithm[5] with tailoring[7].
They don't claim to be conformant to the Unicode Collation Algorithm[3], but
they are covering a big chunk of the standard use cases 

Dale

[1] http://unicode.org/reports/tr10/#Conformance
[2] http://userguide.icu-project.org/collation
[3] http://www.unicode.org/reports/tr10/
[4] http://www.unicode.org/reports/tr10/#Introduction
[5] http://www.unicode.org/reports/tr10/#Main_Algorithm
[6]
https://blog.twitter.com/2012/twittercldr-improving-internationalization-support-in-ruby
[7] http://unicode.org/reports/tr10/#Tailoring