Re: More character matching bits

Buddha Buck Tue, 12 Jun 2001 19:36:48 -0700
Dan Sugalski <[EMAIL PROTECTED]> writes:

> 
> We probably also ought to answer the question "How accommodating to 
> non-latin writing systems are we going to be?" It's an uncomfortable 
> question, but one that needs asking. Answering by Larry, probably, but 
> definitely asking. Perl's not really language-neutral now (If you think so, 
> go wave locales at Jarkko and see what happens... :) but all our biases are 
> sort of implicit and un (or under) stated. I'd rather they be explicit, 
> though I know that's got problems in and of itself.

Perl came from ASCII-centric roots, so it's likely that most of our
biases are ASCII-centric.  And for a couple of reasons, it's going to
be hard to deal with that:

1. Backwards compatability with existing Perl practice,

and

2. To do language-neutral right is -really- hard; look at locales and
Unicode as examples.

As such, instead of trying to make Perl work for all languages out of
the box, why not make Perl's language handling extensible from within
the language and have it be as language-free as possible (except for
backwards compatability stuff) out of the box.

Examples of what we can do:

I. Make ranges work on Unicode code-points (if they don't already).

II. Make POSIX-style character classes (e.g. [:space:])
user-definable and modifiable.  That way, a Unicode::Japanese module
could do something like:

[:hiragana:] = /[\x{3041}-\x{3094}]/;
[:katakana:] = /[\x{30A1}-\x{30F4}]/;
[:kana:] = [:hiragana:] + [:katakana:];

and then each of those three classes could be used in RE's when needed.

III. Allow for character equivalence tables to be user-definable.
This would allow for the /i behavior of RE's to be generalized.

As an example, consider the following code:

$kanainsensitive = td/[:hiragana:]/[:katakana:]/;

if ($japanesetext =~ m/$japanesepattern/i{$kanainsentive} {
   print "$japanesetext matched $japanesepattern\n";
}

The new td// construct would create a character equivalence table that
could be used with a generalized /i option to indicate that hiragana
and katakana should be treated equivalently.

A more sophisticated example could be:

$vowelsoptional = td/aeiouAEIOU//;

which would make vowels equivalent to no characters at all.

For certain applications, it would be useful to allow matches of more
than one character:

$kanainsensitive +=   td/\x{304C}\x{3042}/\x{30AC}\x{30FC}/r
                    + td/\x{304D}\x{3044}/\x{30AD}\x{30FC}/r
                    + ... ;

In this case, it represents the fact that long vowels are represented
by one form in hiragana (HIRAGANA LETTER KA + HIRAGANA LETTER A), and
a different form in katakana (KATAKANA LETTER KA + KATAKANA-HIRAGANA
PROLONGED SOUND MARK).

I used a /r there to indicate that the two parts of the td/// are
regular expressions which are designed to be treated equivalent.  That
would allow both of those lines above to be written:

$kanainsensitive +=  td/([\x{304C}\x{304D}])\x{3042}/\1\x{30FC}/r;

It would also allow people to deal with combining forms, although
there are probably better ways than this.

IV.  Make the character class switches be redefinable, but default to
the current set.  That would allow someone who is doing lots of work
in Japanese be able use \w to mean kanji, hiragana, and katakana
instead of the default of [0-9A-Za-z_].

There are probably lots of things I overlooked, but if it can be done
cheaply, abstracting out the existing biases and making them
user-expandable/definable would probably go a long way towards getting
rid of language bias.

> 
>                                       Dan
> 
> --------------------------------------"it's like this"-------------------
> Dan Sugalski                          even samurai
> [EMAIL PROTECTED]                         have teddy bears and even
>                                       teddy bears get drunk
Re: More character matching bits

Reply via email to