Useful task -- Character properties

Patrick R. Michaud Wed, 04 May 2005 09:03:08 -0700

On Tue, May 03, 2005 at 09:22:11PM +0100, Nicholas Clark wrote:
> 
> Whilst I confess that it's unlikely to be me here, if anyone has the time
> to contribute some help, do you have a list of useful self-contained tasks
> that people might be able to take on?


Actually, overnight I realized there's a relatively good-sized
project that needs figuring out -- identifying character properties
such as isalpha, islower, isprint, etc.  Here I'll briefly sketch
how I'd like it to work, and maybe someone enterprising can take 
things from there for us.

Currently Parrot offers quite a few ops for character properties --
namely "is_whitespace", "is_wordchar", "is_digit", etc. and their
"find_XXX" counterparts.  While these are useful, the set is also
incomplete -- at the moment I haven't found anything that let's
us find alphabetic, uppercase, lowercase, etc. properties.  (If I've
just overlooked something, please point it out!)

I suppose Parrot could add a bunch of new "is_alpha", "is_upper", 
"is_lower", etc.  ops, but having separate opcodes for every 
property actually complicates the design of PGE a fair bit
as well as makes a lot of very function-specific opcodes.  
What would *really* be useful would be to have three basic opcodes:

    is_cclass(out INT, in INT, in STR, in INT)
        Set $1 to 1 if the codepoint of $3 at position $4 is in
        the character class(es) given by $2.

    find_cclass(out INT, in INT, in STR, in INT, in INT)
        Set $1 to the offset of the first codepoint matching
        the character class(es) given by $2 in string $3, starting
        at offset $4 for up to $5 codepoints.  If no matching
        character is found, set $1 to -1.

    find_not_cclass(out INT, in INT, in STR, in INT, in INT)
        Set $1 to the offset of the first codepoint not matching
        the character class(es) given by $2 in string $3, starting
        at offset $4 for up to $5 codepoints.  If the substring
        consists entirely of matching characters, set $1 to -1.

The character classes in $2 above are given by an integer bitmask,
defined according to the following table (or something like it --
I took this table from ctype.h on my system, then added a "newline" 
class):

     0x0001 - uppercase char
     0x0002 - lowercase char
     0x0004 - alphabetic char
     0x0008 - numeric character
     0x0010 - hexadecimal digit
     0x0020 - whitespace
     0x0040 - printing
     0x0080 - graphical
     0x0100 - blank (i.e., SPC and TAB)
     0x0200 - control character
     0x0400 - punctuation character
     0x0800 - alphanumeric character
     0x1000 - newline character

We have 32 bits available, so we could extend this table as needed.
And EVENTUALLY we'll probably need a more general interface 
to handle Unicode properties as well as character class compositions, 
but I speculate that we can do those either in a library, or
(if speed is needed) we can build a "character class" PMC type 
optimized for charsets and have:

    is_cclass(out INT, in PMC, in STR, in INT)
    find_cclass(out INT, in PMC, in STR, in INT, in INT)
    find_not_cclass(out INT, in PMC, in STR, in INT, in INT)

But for now the integer representation of character classes
ought to be sufficient.

Anyway, that's another very useful self-contained task that 
I'd be glad to have a volunteer for.

Pm

Useful task -- Character properties

Reply via email to