Collation (was RE: [OT] o-circumflex)

2001-09-13 Thread Edward Cherlin

English and several other languages have dozens of collations. Compare telephone 
books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 
Sorting and Searching gives an example of a set of library sorting rules that runs to 
more than a page, and suggests programming it as an exercise. ;-) Among the rules are 
to spell out numbers. 
For example,

1984 (Nineteen Eighty Four)
1066 and all that (Ten Sixty Six)
3001 (Three Thousand One)
2050 (Twenty Fifty)
2010 (Twenty Ten)
2001, A Space Odyssey (Two Thousand One)

Bell Labs invented a whole programming language, Snobol, to deal with telephone 
listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, 
others one after the other but separate from other names.

Edward Cherlin
Generalist
A knot! Oh, do let me help to undo it. 
Alice in Wonderland


 -Original Message-
 Behalf Of Michael (michka) Kaplan
 Sent: Mon, September 10, 2001 8:36 AM
 From: Mark Davis [EMAIL PROTECTED]
 
  Michael, that isn't the point. There is a problem even 
 when you stick to
 one
  language.


 By that time, many langauges may have TWO collations, since 
 users have been
 expecting something else for the last few decades?
 
 MichKa
 
 Michael Kaplan
 Trigeminal Software, Inc.
 http://www.trigeminal.com/
 
 
 





Collation (was RE: [OT] o-circumflex)

2001-09-13 Thread $B$F$s$I$&$j$e$&$8(B
Whoever invented English number words, then, had a very sick sense of humour. Why 
doesn't the word for "one" start with "a", the word for "two" with "b", etc.,?


rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby
Well, I guess what you say is true,
I could never be the right kind of girl for you,
I could never be your woman
  - White Town


--- Original Message ---
$B:9=P?M(B: Edward Cherlin [EMAIL PROTECTED];
$B08@h(B: [EMAIL PROTECTED];
Cc: 
$BF|;~(B: 01/09/13 7:40
$B7oL>(B: Collation (was RE: [OT] o-circumflex)

English and several other languages have dozens of collations. Compare telephone 
books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 
Sorting and Searching gives an example of a set of library sorting rules that runs to 
more than a page, and suggests programming it as an exercise. ;-) Among the rules are 
to spell out numbers. 
For example,

1984 (Nineteen Eighty Four)
1066 and all that (Ten Sixty Six)
3001 (Three Thousand One)
2050 (Twenty Fifty)
2010 (Twenty Ten)
2001, A Space Odyssey (Two Thousand One)

Bell Labs invented a whole programming language, Snobol, to deal with telephone 
listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, 
others one after the other but separate from other names.

Edward Cherlin
Generalist
"A knot! Oh, do let me help to undo it." 
Alice in Wonderland


 -Original Message-
 Behalf Of Michael (michka) Kaplan
 Sent: Mon, September 10, 2001 8:36 AM
 From: "Mark Davis" [EMAIL PROTECTED]
 
  Michael, that isn't the point. There is a problem even 
 when you stick to
 one
  language.


 By that time, many langauges may have TWO collations, since 
 users have been
 expecting something else for the last few decades?
 
 MichKa
 
 Michael Kaplan
 Trigeminal Software, Inc.
 http://www.trigeminal.com/
 
 
 





Re: PDUTR #26 posted

2001-09-13 Thread Marcin 'Qrczak' Kowalczyk

Wed, 12 Sep 2001 11:08:41 -0700, Julie Doll Allen [EMAIL PROTECTED] pisze:

 Proposed Draft Unicode Technical Report #26: Compatibility Encoding
 Scheme for UTF-16: 8-Bit (CESU-8) is now available at:
 http://www.unicode.org/unicode/reports/tr26/

IMHO Unicode would have been a better standard if UTF-16
hadn't existed. Just UTF-8 and UTF-32, code points in the range
U+..7FFF, no surrogates, no confusion about "how many bits is
Unicode", an ASCII-compatible encoding in most external transmissions,
uniform width for internal processing, and practically no byte
ordering issues. Much simpler.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK





Re: Collation (was RE: [OT] o-circumflex)

2001-09-13 Thread David Gallardo

Java's collation class has a rule-based  collator that is in effect
programmable using a little language. Here is how an example from Sun's API
doc for Norwegian:

String Norwegian =  a,A b,B c,C d,D e,E f,F g,G h,H i,I j,J
  k,K l,L m,M n,N o,O p,P q,Q r,R s,S t,T
  u,U v,V w,W x,X y,Y z,Z
  å=a?,Å=A?
 ;aa,AA æ,Æ ø,Ø;
 RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);

There is also syntax for things such as specifying reverse order (for French
accents for example), contraction and expansion.

- David Gallardo

- Original Message -
From: Edward Cherlin [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, September 13, 2001 3:40 AM
Subject: Collation (was RE: [OT] o-circumflex)


 English and several other languages have dozens of collations. Compare
telephone books, library catalogs, book indexes (sic), and other sorted
data. Knuth vol. 3 Sorting and Searching gives an example of a set of
library sorting rules that runs to more than a page, and suggests
programming it as an exercise. ;-) Among the rules are to spell out numbers.
 For example,

 1984 (Nineteen Eighty Four)
 1066 and all that (Ten Sixty Six)
 3001 (Three Thousand One)
 2050 (Twenty Fifty)
 2010 (Twenty Ten)
 2001, A Space Odyssey (Two Thousand One)

 Bell Labs invented a whole programming language, Snobol, to deal with
telephone listing conversions, matches, and sorts. Many phone books sort Mc-
and Mac- together, others one after the other but separate from other names.

 Edward Cherlin
 Generalist
 A knot! Oh, do let me help to undo it.
 Alice in Wonderland








Re: Collation (was RE: [OT] o-circumflex)

2001-09-13 Thread Mark Davis

In the latest ICU, we took the work we did for Java collation and extended
it substantially (and made it many times faster). It also allows arbitrary
customization at runtime.

I happen to be giving a presentation on it in a few hours at the conference.
For more information, see the draft collation chapter in the User guide, at
http://oss.software.ibm.com/icu/. The presentation (a slightly older draft)
is on my site at www.macchiato.com

Mark
—

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Όμήρου Μαργίτῃ
[http://www.macchiato.com]
- Original Message -
From: David Gallardo [EMAIL PROTECTED]
To: Edward Cherlin [EMAIL PROTECTED];
[EMAIL PROTECTED]
Sent: Thursday, September 13, 2001 8:35 AM
Subject: Re: Collation (was RE: [OT] o-circumflex)


 Java's collation class has a rule-based  collator that is in effect
 programmable using a little language. Here is how an example from Sun's
API
 doc for Norwegian:

 String Norwegian =  a,A b,B c,C d,D e,E f,F g,G h,H i,I j,J
   k,K l,L m,M n,N o,O p,P q,Q r,R s,S t,T
   u,U v,V w,W x,X y,Y z,Z
   å=a?,Å=A?
  ;aa,AA æ,Æ ø,Ø;
  RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian);

 There is also syntax for things such as specifying reverse order (for
French
 accents for example), contraction and expansion.

 - David Gallardo

 - Original Message -
 From: Edward Cherlin [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Thursday, September 13, 2001 3:40 AM
 Subject: Collation (was RE: [OT] o-circumflex)


  English and several other languages have dozens of collations. Compare
 telephone books, library catalogs, book indexes (sic), and other sorted
 data. Knuth vol. 3 Sorting and Searching gives an example of a set of
 library sorting rules that runs to more than a page, and suggests
 programming it as an exercise. ;-) Among the rules are to spell out
numbers.
  For example,
 
  1984 (Nineteen Eighty Four)
  1066 and all that (Ten Sixty Six)
  3001 (Three Thousand One)
  2050 (Twenty Fifty)
  2010 (Twenty Ten)
  2001, A Space Odyssey (Two Thousand One)
 
  Bell Labs invented a whole programming language, Snobol, to deal with
 telephone listing conversions, matches, and sorts. Many phone books sort
Mc-
 and Mac- together, others one after the other but separate from other
names.
 
  Edward Cherlin
  Generalist
  A knot! Oh, do let me help to undo it.
  Alice in Wonderland
 
 









Re: What code point is assigned for the Newton unit?

2001-09-13 Thread Asmus Freytag

Your letter makes clear that Unicode needs to do a better job of 
identifying the preferred character code for many situations. The 
information is there to a large extent, but buried in the fine print or in 
data tables.

You will see that there is a canonical decomposition from U+212B to U+00C5.
This means that once people use Normalization in a widespread fashion, it 
will become practically impossible to maintain a distinction between these 
two codes.

The inclusion of the U+212B is due to historic reasons.

Many other characters have been included in Unicode over the years for 
legitimate purposes as compatibility characters (to allow round trip 
conversion to/from important legacy character sets).

These have all been given compatibility decompositions.

Unfortunately, many characters that have legitimate uses in a legacy-free 
environment, have also been given compatibility mappings at some time. This 
makes it very hard to use this information in its current form to identify 
cases when a distinction between characters should be kept or when not.

There is some very explicit guidance, however, in Unicode TR#20 (Unicode and
XML). The information there is readily applicable to other environments, if 
you pay attention to the rationale for each recommendation and evaluate 
whether it applies in your specific case.

A./

PS:

Ångström is spelled wrong on the code charts at Unicode's home page, BTW.

Can you cite the page number and approximate location on the page (please 
send this information to me and [EMAIL PROTECTED], not to the whole list).





Re: PDUTR #26 posted

2001-09-13 Thread Asmus Freytag

At 11:42 AM 9/13/01 +, Marcin 'Qrczak' Kowalczyk wrote:
IMHO Unicode would have been a better standard if UTF-16
hadn't existed.

Decidedly not. In fact, Unicode would not be widely implemented today.

Just UTF-8 and UTF-32, code points in the range
U+..7FFF, no surrogates, no confusion about how many bits is
Unicode, an ASCII-compatible encoding in most external transmissions,
uniform width for internal processing, and practically no byte
ordering issues. Much simpler.

UTF-32 does have the same byte order issues as UTF-16, except that byte 
order is recognizable without a BOM.

The reason that is possible is the reason why a UTF-16 has its place. 1/4 
of all bytes in UTF-32 are always and redundantly 0x00. To make matters 
worse the next 1/4 of all bytes is redundantly 0x00 as well, except for a 
miniscule portion of all data (granted, this proportion can be higher for 
some specific documents or corpora).

Since you speak of internal processing: One software architect I spoke with 
brought this to a nice point: With UTF-16 I can put twice the data in my 
in-memory hash table and have *on average* the same 1:1 character code:code 
point characteristics for processing. That's a win-win.

Using UTF-32 the same system would have to use double the memory, or face 
twice the rate of memory-fault page operations, and still, because of the 
way scripts work, there are many operations that need to look at more than 
one character code at a time even in UTF-32.

UTF-8, while even more compressed for European data (it's 50% larger than 
utf-16 for ideographs), uses multi-code element encoding for all but ASCII, 
which is why it's useful primarily for external data that are rich in ASCII 
(like HTML etc.). Since most operations are perforce exposed to its 
variable length, unlike UTF-16 processing, which can be optimized for the 
much more frequent 1-unit case, utf-8 cannot as readily be used as internal 
format.

Unicode limited to UTF-8 and UTF-32 would be a lot less attractive and you 
would not have seen it implemented in Windows, Office and other high volume 
platforms as early and as widespread as it has been.

A./




Re: Alternative sorting for digraphs (Was Re: [OT] o-circumflex)

2001-09-13 Thread Roozbeh Pournader

On Mon, 10 Sep 2001, Mark Davis wrote:

 A ZWNJ will break ligatures and cursive connections. While probably safe in
 Danish or Dutch, it is unclear to me that that is safe in all languages
 where this situation occurs. There are diagraphs in Urdu, for example. While
 I don't know their sorting order, if they do sort separately then ZWNJ can't
 be used to express the alternative sorting, since it would give the wrong
 rendering.

:'-(

I would like to ask for stopping the overuse of ZWNJ. I once loved that
character... What about *renaming* the character to Zero Width
All-Purpose Everything Breaker?

roozbeh