----- Original Message -----
From: "Dylan Thurston" <[EMAIL PROTECTED]>
To: "Andrew J Bromage" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Friday, October 05, 2001 6:00 PM
Subject: Re: UniCode


> On Fri, Oct 05, 2001 at 11:23:50PM +1000, Andrew J Bromage wrote:
> > G'day all.
> >
> > On Fri, Oct 05, 2001 at 02:29:51AM -0700, Krasimir Angelov wrote:
> >
> > > Why Char is 32 bit. UniCode characters is 16 bit.
> >
> > It's not quite as simple as that.  There is a set of one million
> > (more correctly, 1M) Unicode characters which are only accessible
> > using surrogate pairs (i.e. two UTF-16 codes).  There are currently
> > none of these codes assigned, and when they are, they'll be extremely
> > rare.  So rare, in fact, that the cost of strings taking up twice the
> > space that the currently do simply isn't worth the cost.
>
> This is no longer true, as of Unicode 3.1.  Almost half of all
> characters currently assigned are outside of the BMP (i.e., require
> surrogate pairs in the UTF-16 encoding), including many Chinese
> characters.  In current usage, these characters probably occur mainly
> in names, and are rare, but obviously important for the people
> involved.

In plane 2 (one of the surrogate planes) there are about 41000
H�n characters, in addition to the about 27000 H�n characters
in the BMP.  And more are expected to be encoded.  However,
IIRC, only about 6000-7000 of them are in modern use.

I don't really want to push for them (since I think they are a major design
mistake), but some people like them: the mathematical alphanumerical
characters in plane 1.  There are also the more likable (IMHO)
musical characters in plane 1 ("western", though that attribute was
removed, and Bysantine!). (You cannot set a musical score in
Unicode plain text, it just encodes the characters that you can use IN
a musical score.)

...
>   isAscii, isLatin1 - OK
Yes, but why do (or, rather, did) you want them; isLatin1 in particuar?
Then what about "isCP1252" (THE most common encoding today),
"isShiftJis", etc., for several hundered encodings? (I'm not proposing to
remove isAscii, but isLatin1 is dubious.)

>   isControl - I don't know about this.
Why do (did) you want it? There are several "kinds" of "control" characters
in Unicode: the traditional C0 and (less used) C1 ones, format control
characters (NO, they do NOT control FORMATTING, though they do control
FORMAT, like cursive connections), ...

>   isPrint - Dubious.  Is a non-spacing accent a printable character?
A combining character is most definitely "printable". (There is a difference
between non-spacing and combining, even though many combining
characters are non-spacing, not all of them are.)

>   isSpace - OK, by the comment in the report: "The isSpace function
>             recognizes only white characters in the Latin-1 range".
Sigh. There are several others, most importantly: LINE SEPARATOR,
PARAGRAPH SEPARATOR, and IDEOGRAPHIC SPACE.  And the
NEL in the C1 range.

>   isUpper, isLower - Maybe OK.
This is property interrogation. There are many other properties of interest.

>   toUpper, toLower - Not OK.  There are cases where upper casing a
>      character yields two characters.
See my other e-mail.

> etc.  Any program using this library is bound to get confused on
> Unicode strings.  Even before Unicode, there is much functionality
> missing; for instance, I don't see any way to compare strings using
> a localized order.
>
> Is anyone working on honest support for Unicode, in the form of a real
> Unicode library with an interface at the correct level?

Well, IBM's ICU, for one, ...  But they only do it for C/C++/Java, not for Haskell...

        Kind regards
        /kent k



_______________________________________________
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Reply via email to