Collation (was RE: [OT] o-circumflex)
English and several other languages have dozens of collations. Compare telephone books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 Sorting and Searching gives an example of a set of library sorting rules that runs to more than a page, and suggests programming it as an exercise. ;-) Among the rules are to spell out numbers. For example, 1984 (Nineteen Eighty Four) 1066 and all that (Ten Sixty Six) 3001 (Three Thousand One) 2050 (Twenty Fifty) 2010 (Twenty Ten) 2001, A Space Odyssey (Two Thousand One) Bell Labs invented a whole programming language, Snobol, to deal with telephone listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, others one after the other but separate from other names. Edward Cherlin Generalist A knot! Oh, do let me help to undo it. Alice in Wonderland -Original Message- Behalf Of Michael (michka) Kaplan Sent: Mon, September 10, 2001 8:36 AM From: Mark Davis [EMAIL PROTECTED] Michael, that isn't the point. There is a problem even when you stick to one language. By that time, many langauges may have TWO collations, since users have been expecting something else for the last few decades? MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Collation (was RE: [OT] o-circumflex)
Whoever invented English number words, then, had a very sick sense of humour. Why doesn't the word for "one" start with "a", the word for "two" with "b", etc.,? rubyrb$B$8$e$&$$$C$A$c$s(B/rbrp(/rprtJuuitchan/rtrp)/rp/ruby Well, I guess what you say is true, I could never be the right kind of girl for you, I could never be your woman - White Town --- Original Message --- $B:9=P?M(B: Edward Cherlin [EMAIL PROTECTED]; $B08@h(B: [EMAIL PROTECTED]; Cc: $BF|;~(B: 01/09/13 7:40 $B7oL>(B: Collation (was RE: [OT] o-circumflex) English and several other languages have dozens of collations. Compare telephone books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 Sorting and Searching gives an example of a set of library sorting rules that runs to more than a page, and suggests programming it as an exercise. ;-) Among the rules are to spell out numbers. For example, 1984 (Nineteen Eighty Four) 1066 and all that (Ten Sixty Six) 3001 (Three Thousand One) 2050 (Twenty Fifty) 2010 (Twenty Ten) 2001, A Space Odyssey (Two Thousand One) Bell Labs invented a whole programming language, Snobol, to deal with telephone listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, others one after the other but separate from other names. Edward Cherlin Generalist "A knot! Oh, do let me help to undo it." Alice in Wonderland -Original Message- Behalf Of Michael (michka) Kaplan Sent: Mon, September 10, 2001 8:36 AM From: "Mark Davis" [EMAIL PROTECTED] Michael, that isn't the point. There is a problem even when you stick to one language. By that time, many langauges may have TWO collations, since users have been expecting something else for the last few decades? MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: PDUTR #26 posted
Wed, 12 Sep 2001 11:08:41 -0700, Julie Doll Allen [EMAIL PROTECTED] pisze: Proposed Draft Unicode Technical Report #26: Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is now available at: http://www.unicode.org/unicode/reports/tr26/ IMHO Unicode would have been a better standard if UTF-16 hadn't existed. Just UTF-8 and UTF-32, code points in the range U+..7FFF, no surrogates, no confusion about "how many bits is Unicode", an ASCII-compatible encoding in most external transmissions, uniform width for internal processing, and practically no byte ordering issues. Much simpler. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Collation (was RE: [OT] o-circumflex)
Java's collation class has a rule-based collator that is in effect programmable using a little language. Here is how an example from Sun's API doc for Norwegian: String Norwegian = a,A b,B c,C d,D e,E f,F g,G h,H i,I j,J k,K l,L m,M n,N o,O p,P q,Q r,R s,S t,T u,U v,V w,W x,X y,Y z,Z å=a?,Å=A? ;aa,AA æ,Æ ø,Ø; RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian); There is also syntax for things such as specifying reverse order (for French accents for example), contraction and expansion. - David Gallardo - Original Message - From: Edward Cherlin [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, September 13, 2001 3:40 AM Subject: Collation (was RE: [OT] o-circumflex) English and several other languages have dozens of collations. Compare telephone books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 Sorting and Searching gives an example of a set of library sorting rules that runs to more than a page, and suggests programming it as an exercise. ;-) Among the rules are to spell out numbers. For example, 1984 (Nineteen Eighty Four) 1066 and all that (Ten Sixty Six) 3001 (Three Thousand One) 2050 (Twenty Fifty) 2010 (Twenty Ten) 2001, A Space Odyssey (Two Thousand One) Bell Labs invented a whole programming language, Snobol, to deal with telephone listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, others one after the other but separate from other names. Edward Cherlin Generalist A knot! Oh, do let me help to undo it. Alice in Wonderland
Re: Collation (was RE: [OT] o-circumflex)
In the latest ICU, we took the work we did for Java collation and extended it substantially (and made it many times faster). It also allows arbitrary customization at runtime. I happen to be giving a presentation on it in a few hours at the conference. For more information, see the draft collation chapter in the User guide, at http://oss.software.ibm.com/icu/. The presentation (a slightly older draft) is on my site at www.macchiato.com Mark — Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Όμήρου Μαργίτῃ [http://www.macchiato.com] - Original Message - From: David Gallardo [EMAIL PROTECTED] To: Edward Cherlin [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Thursday, September 13, 2001 8:35 AM Subject: Re: Collation (was RE: [OT] o-circumflex) Java's collation class has a rule-based collator that is in effect programmable using a little language. Here is how an example from Sun's API doc for Norwegian: String Norwegian = a,A b,B c,C d,D e,E f,F g,G h,H i,I j,J k,K l,L m,M n,N o,O p,P q,Q r,R s,S t,T u,U v,V w,W x,X y,Y z,Z å=a?,Å=A? ;aa,AA æ,Æ ø,Ø; RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian); There is also syntax for things such as specifying reverse order (for French accents for example), contraction and expansion. - David Gallardo - Original Message - From: Edward Cherlin [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, September 13, 2001 3:40 AM Subject: Collation (was RE: [OT] o-circumflex) English and several other languages have dozens of collations. Compare telephone books, library catalogs, book indexes (sic), and other sorted data. Knuth vol. 3 Sorting and Searching gives an example of a set of library sorting rules that runs to more than a page, and suggests programming it as an exercise. ;-) Among the rules are to spell out numbers. For example, 1984 (Nineteen Eighty Four) 1066 and all that (Ten Sixty Six) 3001 (Three Thousand One) 2050 (Twenty Fifty) 2010 (Twenty Ten) 2001, A Space Odyssey (Two Thousand One) Bell Labs invented a whole programming language, Snobol, to deal with telephone listing conversions, matches, and sorts. Many phone books sort Mc- and Mac- together, others one after the other but separate from other names. Edward Cherlin Generalist A knot! Oh, do let me help to undo it. Alice in Wonderland
Re: What code point is assigned for the Newton unit?
Your letter makes clear that Unicode needs to do a better job of identifying the preferred character code for many situations. The information is there to a large extent, but buried in the fine print or in data tables. You will see that there is a canonical decomposition from U+212B to U+00C5. This means that once people use Normalization in a widespread fashion, it will become practically impossible to maintain a distinction between these two codes. The inclusion of the U+212B is due to historic reasons. Many other characters have been included in Unicode over the years for legitimate purposes as compatibility characters (to allow round trip conversion to/from important legacy character sets). These have all been given compatibility decompositions. Unfortunately, many characters that have legitimate uses in a legacy-free environment, have also been given compatibility mappings at some time. This makes it very hard to use this information in its current form to identify cases when a distinction between characters should be kept or when not. There is some very explicit guidance, however, in Unicode TR#20 (Unicode and XML). The information there is readily applicable to other environments, if you pay attention to the rationale for each recommendation and evaluate whether it applies in your specific case. A./ PS: Ångström is spelled wrong on the code charts at Unicode's home page, BTW. Can you cite the page number and approximate location on the page (please send this information to me and [EMAIL PROTECTED], not to the whole list).
Re: PDUTR #26 posted
At 11:42 AM 9/13/01 +, Marcin 'Qrczak' Kowalczyk wrote: IMHO Unicode would have been a better standard if UTF-16 hadn't existed. Decidedly not. In fact, Unicode would not be widely implemented today. Just UTF-8 and UTF-32, code points in the range U+..7FFF, no surrogates, no confusion about how many bits is Unicode, an ASCII-compatible encoding in most external transmissions, uniform width for internal processing, and practically no byte ordering issues. Much simpler. UTF-32 does have the same byte order issues as UTF-16, except that byte order is recognizable without a BOM. The reason that is possible is the reason why a UTF-16 has its place. 1/4 of all bytes in UTF-32 are always and redundantly 0x00. To make matters worse the next 1/4 of all bytes is redundantly 0x00 as well, except for a miniscule portion of all data (granted, this proportion can be higher for some specific documents or corpora). Since you speak of internal processing: One software architect I spoke with brought this to a nice point: With UTF-16 I can put twice the data in my in-memory hash table and have *on average* the same 1:1 character code:code point characteristics for processing. That's a win-win. Using UTF-32 the same system would have to use double the memory, or face twice the rate of memory-fault page operations, and still, because of the way scripts work, there are many operations that need to look at more than one character code at a time even in UTF-32. UTF-8, while even more compressed for European data (it's 50% larger than utf-16 for ideographs), uses multi-code element encoding for all but ASCII, which is why it's useful primarily for external data that are rich in ASCII (like HTML etc.). Since most operations are perforce exposed to its variable length, unlike UTF-16 processing, which can be optimized for the much more frequent 1-unit case, utf-8 cannot as readily be used as internal format. Unicode limited to UTF-8 and UTF-32 would be a lot less attractive and you would not have seen it implemented in Windows, Office and other high volume platforms as early and as widespread as it has been. A./
Re: Alternative sorting for digraphs (Was Re: [OT] o-circumflex)
On Mon, 10 Sep 2001, Mark Davis wrote: A ZWNJ will break ligatures and cursive connections. While probably safe in Danish or Dutch, it is unclear to me that that is safe in all languages where this situation occurs. There are diagraphs in Urdu, for example. While I don't know their sorting order, if they do sort separately then ZWNJ can't be used to express the alternative sorting, since it would give the wrong rendering. :'-( I would like to ask for stopping the overuse of ZWNJ. I once loved that character... What about *renaming* the character to Zero Width All-Purpose Everything Breaker? roozbeh