Re: is \1 vs $1 a necessary distinction?
From: "Dave Storrs" <[EMAIL PROTECTED]> > Both \1 and $1 refer to what is matched by the first set of parens in a > regex. AFAIK, the only difference between these two notation is that \1 > is used within the regex itself and $1 is used outside of the regex. Is > there any reason not to standardize these down to one notation (i.e., > eliminate one or the other)? \1 came from sed and friends. I think an early driving force was maintaining familiarity with things like awk and sed. Even today there are still people that switch to and from other reg-ex languages. Emacs is the most common for me (though I still dabble with awk). I don't see a real advantage in taking out \1, and it is very likely to needlessly break legacy code, and additionally confuse various developers that have a habbit of using \1. On the other hand, the use of $1with substitutions is important for consistency. When you write s/../.../e, you're going to need to use a substitution variable, "\1" just doesn't fit. s/(...)/pre\1post/; works fine s/(...)/pre$1post/; is the question. I tend to use it only because I sometimes switch to: s/(...)/func() . "$1post"/e; for various reasons.. I just try and standardize on $1, but that's just me. Additionally the use of $1 in the matching reg-ex is ambiguous as in: m/(...).*?$1/; Does it refer to the internal set of (..), or does it mean the previous value of $1 before this match.. This becomes non-obvious to the observer in the following case: m/($keyword).*?$1/; Here, our mindset is substitution of external variables, the casual (non-seasoned) observer might not understand that it really means: m/($keyword).*?\1/; My argument is that both \1 and $1 have their places, and limiting to one type can be troublesome. Plus, TMTOWTDI. :) -Michael
Re: RFC 308 (v1) Ban Perl hooks into regexes
> There is, but as MJD wrote: "it ain't pretty". Now, semantic checks or > assertions would be the only reason why I'd expect to be able to execute > perl code every time a part of a regex is succesfully parsed. Simply > look at RFC 197: a syntactic extension to regexes just to check if a > number is within a range! That is absurd, isn't it? Would a simple way > to include localized tests, *any*¨test, make more sense? I'm trying to stick to a general philosophy of what's in a reg-ex, and I can almost justify assertions since as you say, \d, ^, $, (?=), etc are these very sort of things. I've been avoiding most of this discussion because it's been so odd, I can't believe they'll ultimately get accepted. Given the argument that it's unlikely that (?{code}) has been implemented in production, I can almost see changing it's symantics. From what I understand, the point would be to run some sort of perl-code and returned defined / undefined, where undefined forces a back-track. As you said, we shouldn't encourage full-fledged execution (since core dumps are common). I can definately see simple optimizations such as (?{$1 op const}), though other interesting things such as (?{exists $keywords{ $1 }}) might proliferate. That would expand to the general purpose (?{ isKeyword( $1 ) }), which then allows function calls within the reg-ex, which is just asking for trouble. One restriction might be to disallow various op-codes within the reg-ex assertion. Namely user-function calls, reg-ex's, and most OS or IO operations. A very common thing could be an optimal /(?>\d+)(?{MIN < $1 && $1 > MAX})/, where MIN and MAX are constants. -Michael
Re: RFC 308 (v1) Ban Perl hooks into regexes
> On 25 Sep 2000 20:14:52 -, Perl6 RFC Librarian wrote: > > >Remove C, C and friends. > > I'm putting the finishing touches on an RFC to drop (?{...}) and replace > it with something far more localized, hence cleaner: assertions, also in > Perl code. That way, > > /(? > would only match integers between 0 and 255. > > Communications between Perl code snippets inside a regex would be > strongly discouraged. I can't believe that there currently isn't a means of killing a back-track based on perl-code. Looking through perlre it seems like you're right. I'm not really crazy about breaking backward compatibilty like this though. It shouldn't be too hard to find another character sequence to perform your above job. Beyond that, there's a growing rift between reg-ex extenders and purifiers. I assume the functionality you're trying to produce above is to find the first bare number that is less than 256 (your above would match the 25 in 256).. Easily fixed by inserting (?!\d) between the second and third aggregates. If you were to be more strict, you could more simply apply \b(\d+)\b... In any case, the above is not very intuitive to the casual observers as might be while ( /(\d+)/g ) { if ( $1 < 256 ) { $answer = $1; last; } } Likewise, complex matching tokens are the realm of a parser (I'm almost getting tired of saying that). Please be kind to your local maintainer, don't proliferate n'th order code complexities such as recursive or conditional reg-ex's. Yes, I can mandate that my work doesn't use them, but it doesn't mean that CPAN won't (and I often have to reverse engineer CPAN modules to figure out why something isn't working). That said, nobody should touch the various relative reg-ex operators. I look at reg-ex as a tokenizer, and things like (?>...) which optimizes reading, and (?
Re: RFC 308 (v1) Ban Perl hooks into regexes
From: "Simon Cozens" <[EMAIL PROTECTED]> > > A lot of what is trying to happen in (?{..}) and friends is parsing. > > That's not the problem that I'm trying to solve. The problem I'm trying > to solve is interdependence. Parsing is neither here nor there. Well, I recognize that your focus was not on parsing. However, I don't feel that perl-abstractness is a key deliverable of perl. My comment was primarly on how the world might be a better place with reg-ex's not getting into algorithms that are better solved elsewhere. I just thought it might help your cause if you expanded your rationale. -Michael
Re: RFC 308 (v1) Ban Perl hooks into regexes
From: "Hugo" <[EMAIL PROTECTED]> > :Remove C, C and friends. > > Whoops, I missed this bit - what 'friends' do you mean? Going by the topic, I would assume it involves (?(cond) true-exp | false-exp). There's also the $^R or what-ever it was that is the result of (?{ }). Basically the code-like operations found in perl 5.005 and 5.6's perlre. -Michael
Re: RFC 308 (v1) Ban Perl hooks into regexes
> Ban Perl hooks into regexes > > =head1 ABSTRACT > > Remove C, C and friends. > At first, I thought you were crazy, then I read >It would be preferable to keep the regular expression engine as >self-contained as possible, if nothing else to enable it to be used >either outside Perl or inside standalone translated Perl programs >without a Perl runtime. Which makes a lot of sence in the development field. Tom has mentioned that the reg-ex engine is getting really out of hand; it's hard enough to document clearly, much less be understandible to the maintainer (or even the debugger). A lot of what is trying to happen in (?{..}) and friends is parsing. To quote Star Trek Undiscovered Country, "Just because we can do a thing, doesn't mean we should." Tom and I have commented that parsing should be done in a PARSER, not a lexer (like our beloved reg-ex engine). RecDescent and Yacc do a wonderful job of providing parsing power within perl. I'd suggest you modify your RFC to summarize the above; that (?{}) and friends are parsers, and we already have RecDescent / etc. which are much easier to understand, and don't require too much additional overhead. Other than the inherent coolness of having hooks into the reg-ex code, I don't really see much real use from it other than debugging; eg (?{ print "Still here\n" }). I could go either way on the topic, but I'm definately of the opinion that we shouldn't continue down this dark path any further. -Michael
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
- Original Message - From: "Jonathan Scott Duff" <[EMAIL PROTECTED]> Subject: Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach)) > How about qy() for Quote Yacc :-) This stuff is starting to look > more and more like we're trying to fold lex and yacc into perl. We > already have lex through (?{code}) in REs, but we have to hand-write > our own yacc-a-likes. Though you can do cool stuff in (?{code}), I wouldn't quite call it lex. First off we're dealing with NFA instead of DFA, and at the very least, that gives you back-tracking. True, local's allow you to preserve state to some degree. But the following is as close as I can consider (?{code}) a lexer: sub lex_init { my $str = shift; our @tokens; $str =~ / \G (?{ local @tokens; }) (?: TokenDelim(\d+) (?{ push @tokens, [ 'digit', $1 ] }) | TokenDelim(\w+) (?{ push @tokens, [ 'word', $1 ] }) ) /gx; } sub getNextToken { shift @tokens; } I'm not even suggesting this is a good design. Just showing how akward it is. Other problems with the lexing in perl is that you pretty much need the entire string before you begin processing, while a good lexer only needs the next character. Ideally, this is a character stream. Already we're talking about a lot of alteration and work here.. Not something I'd be crazy about putting into the core. -Michael
Re: RFC 145 (alternate approach)
- Original Message - From: "Richard Proctor" <[EMAIL PROTECTED]> Sent: Tuesday, September 05, 2000 1:49 PM Subject: Re: RFC 145 (alternate approach) > On Tue 05 Sep, David Corbin wrote: > > Nathan Wiger wrote: > > > But, how about a new ?m operator? > > >/(?m<<|[).*?(?M>>|])/; > There already is a (?m > Current Use in perl5 > (?# comment > (?imsx flags > (?-imsx flags > (?: subexpression without bracket capture > (?= zero-width positive look ahead > (?! zero width negative look ahead > (?<= zero-width positve look behind > (? (?{code} Execute code > (??{code} Execute code and use result as pattern > (?> Independant subexpression > (?(condition)yes-pattern > (?(condition)yes-pattern|no-pattern > > Suggested in RFCs either current or in development > > (?$foo= suggested for assignment (RFC 112) > (?%foo= suggested for hash assignment (RFC 150?) > > (?@foo suggested list expansion (?:$foo[0] | $foo[1] | ...) ? (RFC 166) > (?Q@foo) Quote each item of lists (RFC 166) > (?^pattern) matches anything that does not match pattern > (RFC 166 but will be somewhere else on next rewrite [1]) > (?F Failure tokens (RFC in development by me [1]) > (?r),(?f) Suggested in Direction Control RFC 1 > (?& Boolean regexes (RFC in development [1]) > (?*{code}) Execute code with pass/fail result (RFC in development [1]) > > a,b,c,d,e, ,g,h, ,j,k,l, ,n,o,p,q, , ,t,u,v,w,x,y,z > A,B,C,D,E, ,G,H,I,J,K,L,M,N,O,P, ,R,S,T,U,V,W,X,Y,Z > 0,1,2,3,4,5,6,7,8,9 > `_,."+[];'~) Ok, I've read through some of the archives, and thought this was a good starting point. I haven't seen any discussion on an obvious solution (though in another email, I suggested that this approach should be foregone in favor of a parsing approach.. But one thing at a time). There are two general problems as I see it. First, you have to be able to specify exactly what you're matching. Obviously generically matching "[<(`" etc is going to be upset if your nesting has simple things like " a < 5 " or "I'm going home, it's hot". A design goal, therefore should be to explicitly state the matching characters. Second, you need to be able to apply additional expression-syntax to match inside the nesting. An additional problem occurs when you suggest using pragmas to specify delimeters. It could be a performance hit, if not a developer's nightmare. When I run eval, must I always set the pragma, just in case there is some wierd scoping problem? Same problem as when using all global variables (and the 'local' keyword. God I hate that thing). Therefore, I suggest a commonly used form: /(?N [ { ] . )/x Note that I use N which stands for nesting instead of the redunant 'M'atch. I don't know how well character-based op-codes will be accepted. As pointed out above, the symbol-space is shrinking fast. The dots describe further matching / capturing within the delimeters. Thus /A (?N [ { ] ) B/x will match 'A' followed by a bracket grouping (anything therein is fine), then followed by 'B'. /A (?N [ { ] ( .* ) ) B/x does the same as above, but captures the internal contents (excluding the delimeters). /A ( (?N [ { ] ) ) B/x Will capture all the conents, including the delimeters. /A (?N [ [ ( ] ( .* ) ) B/x Same as before, but with squares and parentheses. Note delim specifiers can obey the same rules as normal character classes, thus [ [ ( { < ] means collect the entire group. POSIX classes can be used for all of them, as in [=open_braces=] (don't care what the phrase actually is). The reason I chose this is becuase we are essentially doing a character class, so we might as well explicitly use one; It makes more logical sence. Note that to make emacs happy, you should be able to escape all the one-way delimeters. as in [ \[ \( \{ \< ]. That might also make it easier to read, explicitly showing that these are being treated as characters, and not as actual operators. As for special operations such as (/* ... */ ), then I would recommend the usage of named-character classes. [=c_comment=], for example. I'm not sure how those classes are defined, but this obviously requires the system to be extensible (RFC anyone?). Course this violates my issue of using pragmas to alter the operation of reg-ex's. Most likely only built-in types should work. Another feature could be to treat the end of matching-brace as an end-of-line. Thus the above .* will properly exit. If this turns out to not work, then .* can necessarily be replaced by .*?. The advantage of this is in nested expressions, as in: $r_kw = qr/Keyword \s* .* /x; $r_lisp_expr = qr/ (?N [ ( ] $r_kw ) /x; $line = <>; $line =~ $r_lisp_expr; But this would also have worked with: $r_kw = qr/Keyword \s* .* $/x; Since '$' would treat ')' as '\n'. The main advantages of this approach are: That you can still pre-compile an expression and garuntee that it won't need recompiling, and that it'll always act the same. That you can nest the puppies with complete lack of ambiguity, and littl
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
- Original Message - From: "Jonathan Scott Duff" <[EMAIL PROTECTED]> Subject: Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach)) > On Wed, Sep 06, 2000 at 08:40:37AM -0700, Nathan Wiger wrote: > > What if we added special XML/HTML-parsing ?< and ?> operators? > > What if we just provided deep enough hooks into the RE engine that > specialized parsing constructs like these could easily be added by > those who need them? > > -Scott Ok, I've avoided this thread for a while, but I'll make my comment now. I've played with several ideas of reg-ex extensions that would allow arbitrary "parsing". My first goal was to be able to parse perl-like text, then later a simple nested parentheses, then later nested xml as with this thread. I have been able to solve these problems using perl5.6's recursive reg-ex's, and inserted procedure code. Unfortunately this isn't very safe, nor is it 'pretty' to figure out by a non-perl-guru. What's more, what I'm attempting to do with these nested parens and xml is to _parse_ the data.. Well, guess what guys, we've had decades of research into the area of parsing, and we came out with yacc and lex. My point is that I think we're approaching this the wrong way. We're trying to apply more and more parser power into what classically has been the lexer / tokenizer, namely our beloved regular-expression engine. A great deal of string processing is possible with perls enhanced NFA engine, but at some point we're looking at perl code that is inside out: all code embedded within a reg-ex. That, boys and girls, is a parser, and I'm not convinced it's the right approach for rapid design, and definately not for large-scale robust design. As for XML, we already have lovely c-modules that take of that.. You even get your choice. Call per tag, or generate a tree (where you can search for sub-trees). What else could you want? (Ok, stupid question, but you could still accomplish it via a customized parser). My suggestion, therefore would be to discuss a method of encorportating more powerful and convinient parsing within _perl_; not necessarily directly within the reg-ex engine, and most likely not within a reg-ex statement. I know we have Yacc and Parser modules. But try this out for size: Perl's very name is about extraction and reporting. Reg-ex's are fundamental to this, but for complex jobs, so is parsing. After I think about this some more, I'm going to make an RFC for it. If anyone has any hardened opinions on the matter, I'd like to hear from you while my brain churns. -Michael
Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()
> >Simple solution. > > >If you want to require formats such as m/.../ (which I actually think is a > >good idea), then make it part of -w, -W, -ww, or -WW, which would be a perl6 > >enhancement of strictness. > > That's like having "use strict" enable mandatory perlstyle compliance > checks, and rejecting the program otherwise. Doesn't seem sensible. > > --tom Well, use strict refuses soft-links, and -w refuses use of undefined values. It may be that these are easy to check for at the low level, and therefore are candidates for flag-based operations. But for style, I don't see why the interpreter can't also check for various non-obscure syntaxes / styles. I doubt that requireing m/.../ really would help parsing performance any though. Compatibility is going to have to be maintained somehow. And we can either have some sort of perl6 designator (such as the pragma) to designate incompatible (and otherwise ambiguous) code, or we're going to have to continue tacking on syntactic sugar to legacy code. -Michael
Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()
> If you want to change STUPID behaviour that should be avoided by current > programs (such as empty regexes) fine. Simple solution. If you want to require formats such as m/.../ (which I actually think is a good idea), then make it part of -w, -W, -ww, or -WW, which would be a perl6 enhancement of strictness. Likewise, things like legacy Formats would not be allowed in -WW. This gives flexibility to the programmer, and can help the interpreter to make optimizations where necessary. If you needed legacy module compatibility, then maybe we should use pragmas like the following: use 6.0; or use 6.0 ':no-compat'; Programs and modules could assign themselves to a compatibility contract (lacking the require statement defaults to perl5 compat). The reason for having 'use' instead of 'require' is that the interpreter can turn on compile-time warnings / optimizations as it goes from module to module. Maybe this should be an RFC. -Michael