Wed, 27 Sep 2000 20:00:09 +1100, Manuel M. T. Chakravarty <[EMAIL PROTECTED]> pisze:

> But this re-enforces may point, we need an _efficient_ way of
> getting at the unicode ranges for certain character classes.

IMHO usually the best representation of such subset is a function
Char->Bool, probably composed of predicates like isAlpha.

The structure of such expression should be simpler and more compact
than an explicit dispatching table on individual characters, and
is more general - allows using arbitrary predicates available in
that form.

It should be a natural choice in a functional language :-)

> H98 is seems to be lacking some features for practical use of unicode
> - the header to the standard library `Char' actually admits that
> 
>   This module offers only a limited view of the full Unicode
>   character set; the full set of Unicode character
>   attributes is not accessible in this library.

I am working on a fuller module Char replacement, consulting details
with people from unicode and linux-utf8 mailing lists. It's in
<http://qrczak.ids.net.pl/qforeign-0.60.tar.gz> (already a bit out
of date).

A problem is that it is not Haskell98. Not only because of a limited
set of predicates, but Haskell98 specifies behavior of some predicates
in a way considered heretic and unfair by Unicode people (isSpace works
only for ISO-8859-1, isSpace '\xA0', isDigit works only for ASCII,
letters from alphabets without cases are all considered uppercase).
So I am temporarily forgetting about Haskell98, sorry.

Predicates are of course based on character categories from the
Unicode character database. Categories are exposed directly too.

One question about the interface. There are 30 categories, denoted by
two-letter abbreviations. Of course we could have a flat enumeration
of all 30. But perhaps it would be better to divide them according
to their structure (which corresponds to the first and second letter
of their abbreviations):

data Category
    = Letter      !Letter
    | Mark        !Mark
    | Number      !Number
    | Separator   !Separator
    | Other       !Other
    | Punctuation !Punctuation
    | Symbol      !Symbol

data Letter = Uppercase | Lowercase | Titlecase | ModifierLetter | OtherLetter
data Mark = NonSpacing | Spacing | Enclosing
data Number = Decimal | LetterNumber | OtherNumber
data Separator = Space | Line | Paragraph
data Other = Control | Format | Surrogate | PrivateUse | NotAssigned
data Punctuation = Connector | Dash | Open | Close | Initial | Final | OtherPunctuation
data Symbol = Math | Currency | ModifierSymbol | OtherSymbol

This leads to simpler predicates, e.g. isAlphaNum checks only the
outer constructor being Letter or Number, instead of enumerating
eight categories, so I guess that it will be similarly simpler for
somebody wanting to check categories directly. It would also be more
rubust if a subcategory is added in a future Unicode standard.

But it makes the structure of the Category type more complex.
I'm not sure if this is a good idea.

It happens that GHC nicely optimizes some compound predicates.
For example isPunct ch || isSymbol ch compiles into the code like
    case category ch of
        Punctuation _ -> True
        Symbol      _ -> True
        _             -> False

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK


Reply via email to