----- Original Message -----
From: "Dylan Thurston" <[EMAIL PROTECTED]>
To: "Andrew J Bromage" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Friday, October 05, 2001 6:00 PM
Subject: Re: UniCode
> On Fri, Oct 05, 2001 at 11:23:50PM +1000, Andrew J Bromage wrote:
> > G'day all.
> >
> > On Fri, Oct 05, 2001 at 02:29:51AM -0700, Krasimir Angelov wrote:
> >
> > > Why Char is 32 bit. UniCode characters is 16 bit.
> >
> > It's not quite as simple as that. There is a set of one million
> > (more correctly, 1M) Unicode characters which are only accessible
> > using surrogate pairs (i.e. two UTF-16 codes). There are currently
> > none of these codes assigned, and when they are, they'll be extremely
> > rare. So rare, in fact, that the cost of strings taking up twice the
> > space that the currently do simply isn't worth the cost.
>
> This is no longer true, as of Unicode 3.1. Almost half of all
> characters currently assigned are outside of the BMP (i.e., require
> surrogate pairs in the UTF-16 encoding), including many Chinese
> characters. In current usage, these characters probably occur mainly
> in names, and are rare, but obviously important for the people
> involved.
In plane 2 (one of the surrogate planes) there are about 41000
H�n characters, in addition to the about 27000 H�n characters
in the BMP. And more are expected to be encoded. However,
IIRC, only about 6000-7000 of them are in modern use.
I don't really want to push for them (since I think they are a major design
mistake), but some people like them: the mathematical alphanumerical
characters in plane 1. There are also the more likable (IMHO)
musical characters in plane 1 ("western", though that attribute was
removed, and Bysantine!). (You cannot set a musical score in
Unicode plain text, it just encodes the characters that you can use IN
a musical score.)
...
> isAscii, isLatin1 - OK
Yes, but why do (or, rather, did) you want them; isLatin1 in particuar?
Then what about "isCP1252" (THE most common encoding today),
"isShiftJis", etc., for several hundered encodings? (I'm not proposing to
remove isAscii, but isLatin1 is dubious.)
> isControl - I don't know about this.
Why do (did) you want it? There are several "kinds" of "control" characters
in Unicode: the traditional C0 and (less used) C1 ones, format control
characters (NO, they do NOT control FORMATTING, though they do control
FORMAT, like cursive connections), ...
> isPrint - Dubious. Is a non-spacing accent a printable character?
A combining character is most definitely "printable". (There is a difference
between non-spacing and combining, even though many combining
characters are non-spacing, not all of them are.)
> isSpace - OK, by the comment in the report: "The isSpace function
> recognizes only white characters in the Latin-1 range".
Sigh. There are several others, most importantly: LINE SEPARATOR,
PARAGRAPH SEPARATOR, and IDEOGRAPHIC SPACE. And the
NEL in the C1 range.
> isUpper, isLower - Maybe OK.
This is property interrogation. There are many other properties of interest.
> toUpper, toLower - Not OK. There are cases where upper casing a
> character yields two characters.
See my other e-mail.
> etc. Any program using this library is bound to get confused on
> Unicode strings. Even before Unicode, there is much functionality
> missing; for instance, I don't see any way to compare strings using
> a localized order.
>
> Is anyone working on honest support for Unicode, in the form of a real
> Unicode library with an interface at the correct level?
Well, IBM's ICU, for one, ... But they only do it for C/C++/Java, not for Haskell...
Kind regards
/kent k
_______________________________________________
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users