Patrick R. Michaud wrote:
[ see below for some more ]
Actually, overnight I realized there's a relatively good-sized
project that needs figuring out -- identifying character properties
such as isalpha, islower, isprint, etc. Here I'll briefly sketch
how I'd like it to work, and maybe someone enterprising can take things from there for us.
Currently Parrot offers quite a few ops for character properties -- namely "is_whitespace", "is_wordchar", "is_digit", etc. and their "find_XXX" counterparts. While these are useful, the set is also incomplete -- at the moment I haven't found anything that let's us find alphabetic, uppercase, lowercase, etc. properties. (If I've just overlooked something, please point it out!)
I suppose Parrot could add a bunch of new "is_alpha", "is_upper", "is_lower", etc. ops, but having separate opcodes for every property actually complicates the design of PGE a fair bit
as well as makes a lot of very function-specific opcodes. What would *really* be useful would be to have three basic opcodes:
is_cclass(out INT, in INT, in STR, in INT) Set $1 to 1 if the codepoint of $3 at position $4 is in the character class(es) given by $2.
find_cclass(out INT, in INT, in STR, in INT, in INT) Set $1 to the offset of the first codepoint matching the character class(es) given by $2 in string $3, starting at offset $4 for up to $5 codepoints. If no matching character is found, set $1 to -1.
find_not_cclass(out INT, in INT, in STR, in INT, in INT) Set $1 to the offset of the first codepoint not matching the character class(es) given by $2 in string $3, starting at offset $4 for up to $5 codepoints. If the substring consists entirely of matching characters, set $1 to -1.
The character classes in $2 above are given by an integer bitmask,
defined according to the following table (or something like it --
I took this table from ctype.h on my system, then added a "newline" class):
0x0001 - uppercase char 0x0002 - lowercase char 0x0004 - alphabetic char 0x0008 - numeric character 0x0010 - hexadecimal digit 0x0020 - whitespace 0x0040 - printing 0x0080 - graphical 0x0100 - blank (i.e., SPC and TAB) 0x0200 - control character 0x0400 - punctuation character 0x0800 - alphanumeric character 0x1000 - newline character
We have 32 bits available, so we could extend this table as needed.
And EVENTUALLY we'll probably need a more general interface to handle Unicode properties as well as character class compositions, but I speculate that we can do those either in a library, or
(if speed is needed) we can build a "character class" PMC type optimized for charsets and have:
is_cclass(out INT, in PMC, in STR, in INT) find_cclass(out INT, in PMC, in STR, in INT, in INT) find_not_cclass(out INT, in PMC, in STR, in INT, in INT)
But for now the integer representation of character classes ought to be sufficient.
For hysterical raisins we actually have already two of char class interfaces (partially) implemented, e.g.
src/string.c:
Parrot_string_is_digit(Interp *interpreter, STRING *s, INTVAL offset)
src/string_primitives.c
Parrot_char_is_digit(Interp *interpreter, UINTVAL character)
The former is covered by an opocde in ops/string.ops and is the more useful form taking an string and an offset. The latter OTOH can call the ICU function, if ICU is present.
To cleanup that mess, we stick to Patricks plan, which implies in no specific order:
- implement the new opcodes, first in experimental.ops - create an enum of the char classes in charset.h - create the general API in that header too - convert existing charset classifying tables to the new bits - move the ICU functions to charset/unicode.c - deprecate existing opcodes and APIs - cleanup string_primitives.* - convert existing tests - write new tests - write more news tests - all I've forgotten to list
See also: src/ string.c string_primitives.c include/parrot/ charset.h string_primitives.h string_funcs.h charset/ *.c *.h [1] ops/ string.ops t op/string_cs.t
[1] especially char typetable[] and usage of it
Anyway, that's another very useful self-contained task that I'd be glad to have a volunteer for.
Yep.
Pm
leo