Re: More character matching bits

Bryan C . Warnock Tue, 12 Jun 2001 20:53:37 -0700
On Tuesday 12 June 2001 10:58 pm, Bryan C. Warnock wrote:
> On Tuesday 12 June 2001 09:16 pm, Simon Cozens wrote:
> > On Tue, Jun 12, 2001 at 05:41:40PM -0700, Hong Zhang wrote:
> > > We should let external collator to handle all these fancy features.
> >
> > Phew, I've been saying this all along. :)
>
> I think we've *all* been saying that.  We just need to determine how.

We're going to need hooks.  And superglue.  And some of them little 
sparklies.

Here's a rough cut on what need or need not be externalized for locale 
handling.  (Most of this could probably be done using the regular 'custom 
regex' hack As Seen On TV.  Except you miss interpolated variables that way. 
But the hack is still good, and a good argument for *not* hooking some 
potentially hairy code in.)

RE Feature     Override       Create New
----------------------------------------
switches       'i' only       yes
anchors        no             no 
escape         no             no
alternation    no             no
grouping       no             no
classes        no*            yes
quantifiers    no             no
backreferences no             no
standard escs  no             no
ext. escs      yes            yes
tag classes    yes            yes
ext syntax     no             no
ranges         yes            no


- Switches.  Both the m//imsgo and (?imsg) kind.  Most of the switches have 
a fairly non-specialized meaning, so it doesn't make sense to allow 
overloading them.  (I couldn't think of any separate context for a "line")  
'g', 'o', 'e', 'c', 'd', 's', 'm' all seem relatively straightforward.  'i' 
could be considered locale dependent, and potentially overridden.  Since 
switches, either appended or embedded, really just alter the parameters the 
regex is being handled under, it should also be straightforward to create 
new ones.  (ignored (or warned -w, or die with strict re?) if not 
implemented with the current locale.)

- Anchors.  ^,$,\A,\Z,\z,\b, \G.  Since the definition of a line (see 'm' 
and 's' above) isn't changing, the line oriented assertions don't need 
adjusting.  String anchors are straightforward, as is \G.  \b *could* be 
locale-defined independent of \w\W or \W\w, but I think it'd be better to 
stick with that definition, and allow the locale to override \w, instead.  I 
think adding new anchor positions may be more difficult that any gains 
gained.

- The Escape character.  '\'.  It works.  Why change it?

- The Alternation character.  '|'.  Ditto.

- Grouping.  Another ditto.  Overriding parenthesis wold also really screw 
up the extended regex instructions.

- Classes.  The externally-defined classes (and I'll conveniently omit the 
non-[] character classes \w \s \d and their inverses) should be static, 
since they aren't ours to adjust.  (Like POSIX definitions, or Unicode 
definitions.)  Perlish character classes should be allowed to be redefined, 
and new ones should be allowed to be created.  (One could think of the 
Perlish character classes as custom created character classes.)

- Quantifiers.  Leave them alone.  They've never hurt anybody.   Any new 
quantifies would really just be syntactic shortcuts for something like "an 
odd number of these."

- Backreferences.  Leave these be, too.

- Standard escape sequences.  The escape sequences that equate to one 
character (either classically predefined, or as a data reference).  '\t', 
'\a', '\nnn', '\c',  '\x', '\N', etc..  It doesn't make sense to override 
these, and if we could create them, then they wouldn't be standard.  ;-)

- Extended escape sequences.  '\u' and '\l', for instance.  They're already 
tied to locales, so we should probably allow the locales to handle them.  
And if they can handle them, that should be enough hook to allow to create 
new ones.  But then again, that may cause a strain in processing, since you 
could drastically change the behavior of the regexen.  These are almost like 
the switches, and with a little imagination, work the same way.  (Except for 
what they affect.
   m//i - the whole string, except where negated
   (?i) - to end of string (or negation)
   ((?i)    ) - within outer parentheses
   \u  - next character
   \U - till \E
Larry was already suggesting that \Q .. \E would be changed (but after Apo 
2, he was reconsidering something).  Perhaps the rest of the range modifiers 
(\U \L) could become switches - it works much the same way - so that you've 
simplified syntax to:
   m//swtiches - the whole string, except where negated
   (?switches) - to end of string (or negation)
   ((?switches)   ) - within outer parentheses   
That's three rules (or four rules, if you consider that the standard escape 
sequences are reflexive), reduced to two.  With quotemeta up in the air.
   m/$hello/u - match uc $hello?
   m/(?u)$hello/ - same thing
Oh, well.  Diversionary.  But interesting none the less.....

- Tag Classes.  The escape sequences that equate to Perl character classes.  
\w \s \d.  Locales could override those, and possibly create new ones.  (But 
if they create new ones, how do you differentiate between reflexive 
customizations - \v = [aeiouy] - from the extended escapes above, where you 
affect the next character?

- Extended Syntax.  Leave it alone.  The (?... space is sacred.  And it's 
probably too much work to allow a locale to define branch new functionality 
like that.

- Ranges.  A-Za-z for ASCII, EBCDIC, and UNICODE.  Locales could translate 
ranges to their particular collation order.  Or choose to slap the user, 
which would please Jarkko.

Did I miss anything?  Like the boat?

-- 
Bryan C. Warnock
[EMAIL PROTECTED]
Re: More character matching bits

Reply via email to