Re: character properties

Marcin 'Qrczak' Kowalczyk Mon, 25 Sep 2000 13:31:23 -0700
Mon, 25 Sep 2000 17:07:27 +0200 (CEST), Bruno Haible <[EMAIL PROTECTED]> pisze:

> Here I would add:  category is one of [Zl,Zp]
> because the Line/Paragraph Separators behave like LineFeed.

Markus Kuhn says otherwise so I'm leaving this for now.

> > isPrint    c = category is other than [Zl,Zp,Cc,Cf,Cs,Co]
> 
> I think Cf (Format Control) and Co (Private Use) should be counted as
> printable.

Co - OK.
Cf - really? I checked which characters are these and they don't look
much like printable, more like control characters...

> > isSpace    c = one of "\t\n\r\f\v" || category is one of [Zs,Zl,Zp]
> 
> >From that, please exclude those characters of category Zs (Space)
> which have "noBreak" mentioned in their UnicodeData line.

Hmm, glibc-2.1.3 says that iswspace(160). Haskell98 says that
isSpace '\160' (but it is heretic in other cases).

Perhaps you are right that splitting a line into words should not split
on U+00A0. Line breaking on characters satisfying isSpace seems to be
correct then (even if it does not find all opportunities to break). OK,
I am making these exceptions.

> > isPunct    c = isGraph c && not (isAlphaNum c)
> 
> This is traditional Unix semantics of "punctuation". Unicode has a
> more restricted notion of "punctuation" (category P).

OK.

In a Haskell source there is a significant class described as
'any Unicode symbol or punctuation', explicitly including also
    !#$%&*+./<=>?@\^|-~
These are characters that operators may consist of. It would make
sense to provide an unambiguous recognition for this class (modulo
reserved characters). So I'm adding isSymbol (and this predicate
= isSymbol ch || isPunct ch).

> > isDigit    c = c >= '0' && c <= '9'
> 
> I'd prefer: category is Nd

OK, I was convinced on the Unicode list. But since many formats require
only ASCII digits, I think I will add isAsciiDigit too. Category Nd
does not fit into isOctDigit and isHexDigit which stay as before;
it would be strange to have ASCII hex but not ASCII decimal predicate.

> > isUpper    c = category is one of [Lu,Lt]
> > isLower    c = category is Ll
> 
> The isUpper/isLower categorization should take the toUpper/toLower
> mappings into accound.

What do you mean?

> > But perhaps it's enough to have toTitle in addition to toUpper and
> > toLower, because what could isTitle be used for?
> 
> IMO an 'isTitle' function doesn't make sense.

IMHO isUpper variants may be used for two things:
- whether a word starts with uppercase (more important),
- whether all characters are uppercase (less important).

isUpper meaning [Lu,Lt] is OK for the first. It's not enough for
the second. Fortunately the second can be done by checking whether
all characters satisfy isUpper and toUpper does not change them,
so perhaps the other variant is unnecessary.

> But toTitle is important (as a function String -> String,
> not char -> char).

toUpper has the same problem with ß. Unfortunately Haskell has
them as Char -> Char so I'm afraid they must stay. They must be
locale-independent anyway, so in a yet-nonexistant locale framework
there will probably be more correct String -> String case changing
functions.

                        *       *       *

I think I will export the interface to character categories directly too.

This leads to the problem of naming the categories. Of course I could
have a flat enumeration of 30 categories. But perhaps it would be
better to divide them:

data Category
    = Letter      Letter
    | Mark        Mark
    | Number      Number
    | Separator   Separator
    | Other       Other
    | Punctuation Punctuation
    | Symbol      Symbol

data Letter = Uppercase | Lowercase | Titlecase | ModifierLetter | OtherLetter
data Mark = NonSpacing | Spacing | Enclosing
data Number = Decimal | LetterNumber | OtherNumber
data Separator = Space | Line | Paragraph
data Other = Control | Format | Surrogate | PrivateUse | NotAssigned
data Punctuation = Connector | Dash | Open | Close | Initial | Final | OtherPunctuation
data Symbol = Math | Currency | ModifierSymbol | OtherSymbol

so e.g. Lu is "Letter Uppercase". This leads to simpler predicates,
e.g. isAlphaNum checks only the outer constructor being Letter or
Number, instead of enumerating eight categories, so I guess that it
will be similarly simpler for somebody wanting to check categories
directly.

I'm not sure if this is a good idea.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: character properties

Reply via email to