Mike Lambert:
(a bunch of stuff about regexes)
No offense intended, but I had trouble understanding that, and I helped
come up with the thing. :^) So, I'll try to interpret.
In Perl 5, we came up against the problem of simply running out of
characters in regexes. To deal with this, Larry came up with the
(?_regex) syntax, where _ is some character. Although a clever use of
an otherwise impossible sequence, it's also gratuitously ugly.
Consider the many roles (?_) plays:
Non-capturing parentheses: (?:)
Look(ahead|behind)s: (?=), (?!), (?<=), (?<!)
Inline code: (?{}), (??{})
Inline modifiers: (?imsx-ismx), (?imsx-ismx)
Conditionals: (?()), (?()|)
Comments: (?#)
Non-backtracking: (?>)
Obviously, this is getting out of hand--using more than one or two of
those constructs makes your regex much harder to read.
Let's first tackle non-capturing parentheses and lookarounds. If we
think about what metacharacters are around, we can realize that {} is
only legal with numbers inside it. [0] That means that we can probably
reuse it. If we think about it, we can derive a few basic categories:
-consuming (_) or not (|) [1]
Reasoning: _ is fat, | is skinny
-positive (=) or negative (!)
Reasoning: same as in Perl 5
-forwards (>) or backwards (<)
Reasoning: same as in Perl 5
The characters in parentheses are prefix characters that indicate which
is to be used. A simple mapping of the five things this section covers
follows:
Perl 5 Perl 6
------ ------
(?:regex) {_=>regex}
(?=regex) {|=>regex}
(?!regex) {|!>regex}
(?<=regex) {|=<regex} [2]
(?<!regex) {|!<regex}
Obviously, that's a bit much to type. But if we define some reasonable
defaults, it becomes more manageable. By default, the specifier is _=>.
So here's a map of what you're more likely to see in a regex:
Perl 5 Perl 6
------ ------
(?:regex) {regex}
(?=regex) {|regex}
(?!regex) {|!regex}
(?<=regex) {|<regex}
(?<!regex) {|!<regex}
However, the sharp reader might have noticed that there were three
possibilities missing from the above tables. That's right--we get free
features too!
(_!>regex) -- Nonsensical.
{_=<regex) -- Match backwards. [3]
{_!<regex) -- Nonsensical.
Well, one free feature--we end up with reversed regexes from this deal.
The final table looks like this:
Perl 5 Perl 6
------ ------
(?:regex) {regex}
N/A {<regex}
(?=regex) {|regex}
(?!regex) {|!regex}
(?<=regex) {|<regex}
(?<!regex) {|!<regex}
He then went on to describe something I didn't understand at all.
Sorry.
--- BEGIN MY THOUGHTS ---
The only major drawback I can see to that is the na�ve user might type
{<b>.*?</b>}+ expecting a bunch of text in bold tags and getting a
lookbehind instead--so it may be wise to leave the | and _ specifiers
out of this altogether, and come up with a better way. I'll address
that point shortly.
In the mean time, let's consider some of the other syntaxes. The inline
code tings are a good opportunity for improvement--and they have a good
alternative. In Perl 5, ({ ought not to be legal, but it is--it's
hacked in to be the same as (\{. So, we can drop a question mark from
each of the block forms, getting ({code}) and (?{code}. However, we can
go even further by combining the two.
Here's how it works:
-If the code returns undef, we backtrack.
-If the code returns the empty string, we move on.
-If the code returns anything else, we interpolate that into the
regex.
So, we now just have ({}).
Comments can go, since Larry has said that /x will be on by default
anyway.
That leaves conditionals, non-backtracking sections, inline modifiers,
and (maybe) non-capturing parens. We now have three characters that
aren't valid in these places: *, +, and ?.
My suggestion is this:
Thing Syntax Logic
----- ------ -----
Conditionals (?()|) The question mark makes sense
for a conditional.
Inline Modifiers (?imsx-imsx) Might as well be a
little bit compatible.
Non-backtracking (+) + requires more
than * does.
Non-capturing (*) Suggestions welcome.
:^)
So, my final suggestions are:
Perl 5 Perl 6
------ ------
(?:) (*)
(?=) {}
(?!) {!}
(?<=) {<}
(?<!) {<!} [4]
(?()) (?())
(?()|) (?()|)
(?imsx-imsx) (?imsx-imsx)
(?imsx-imsx:) (?imsx-imsx:)
(?>) (+)
(?{}) ({}) returning empty string
(??{}) ({}) returning a string or regex
(?#) N/A--obsolete
Please feel free to comment on these.
[0] Perl won't be the first tool to take advantage of this--lex uses
something similar for named subexpressions.
[1] Neither of these characters is ideal, however. | looks like !, and
_ might reasonably be at the beginning of this sort of thing anyway.
Better suggestions are welcome.
[2] Mike originally had all the backwards matches as sexegers. I think
this is a bad idea, but feel obligated to mention that.
[3] This seems a bit useless to me too. It's probably more useful to
have a /r modifier on the entire regex.
[4] I changed the ordering for this one to avoid an ambiguity.
--Brent Dax <[EMAIL PROTECTED]>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)
#define private public
--Spotted in a C++ program just before a #include