Haskell 1.4 and Unicode

Carl R. Witty 07 Nov 1997 03:18:01 -0800
I have some questions regarding Haskell 1.4 and Unicode.  My source
materials for these questions are "The Haskell 1.4 Report" and the
files

ftp://ftp.unicode.org/Public/2.0-Update/ReadMe-2.0.14.txt   
  and
ftp://ftp.unicode.org/Public/2.0-Update/UnicodeData-2.0.14.txt

It's possible that question 2 below would be resolved if I actually
read the Unicode book; if so, I apologize in advance.

1) I assume that layout processing occurs after Unicode preprocessing;
otherwise, you can't even find the lexemes.  If so, are all Unicode
characters assumed to be the same width?

2) The Report uses the following classes of characters:
uniWhite -> any UNIcode character defined as whitespace
nonbrkspc ???
UNIsmall -> any Unicode lowercase letter
UNIlarge -> any uppercase or titlecase Unicode letter
UNIsymbol -> Any Unicode symbol or punctuation
UNIdigit -> A Unicode numberic

The file ReadMe-2.0.14.txt above defines the following classes of
characters:

Normative
    Mn = Mark, Non-Spacing
    Mc = Mark, Spacing Combining
    Me = Mark, Enclosing

    Nd = Number, Decimal Digit
    Nl = Number, Letter
    No = Number, Other

    Zs = Separator, Space
    Zl = Separator, Line
    Zp = Separator, Paragraph

    Cc = Other, Control
        Cf = Other, Format
    Cs = Other, Surrogate
    Co = Other, Private Use
    Cn = Other, Not Assigned

Informative
    Lu = Letter, Uppercase
    Ll = Letter, Lowercase
    Lt = Letter, Titlecase
    Lm = Letter, Modifier
    Lo = Letter, Other

    Pc = Punctuation, Connector
    Pd = Punctuation, Dash
    Ps = Punctuation, Open
    Pe = Punctuation, Close
    Po = Punctuation, Other

    Sm = Symbol, Math
    Sc = Symbol, Currency
    Sk = Symbol, Modifier
    So = Symbol, Other

It's not obvious how the Unicode-defined classes map onto the classes
in the Report.  My guess is:

uniWhite == classes Zs, Zl, Zp
UNIsmall == class Ll
UNIlarge == classes Lu, Lt
UNIsymbol == classes Sm, Sc, Sk, So
UNIdigit == classes Nd, Nl, No
nonbrkspc == "NO-BREAK SPACE" (\h00a0)

However, it would also seem quite reasonable to include class Lo
(which includes things like "Hebrew letter Alef") in UNIsmall or
UNIlarge; and to include some of the Punctuation classes in UNIsymbol.

3) What does it mean that Char can include any Unicode character?

If I compile and run the following program on my vanilla American UNIX
box:

        main = putChar '\x2473' {- print a "circled number twenty" -}

to get a program "ctwenty", and I run

        ./ctwenty | od -c

(od prints out each byte of output), what will I see?

Will the following program

        main = getChar >>= (print . fromEnum)

ever print out a number greater than 256?

If the answers to the above questions are "implementation dependent",
what are some of the behaviors that implementations might plausibly
have?

Carl Witty
[EMAIL PROTECTED]
Haskell 1.4 and Unicode

Reply via email to