Re: Perl 5's "non-greedy" matching can be TOO greedy!
>Nice summary, but I'm not buying what you're selling in the elaboration. Then you lose, because I am not allowed to disagree with you anymore. And everyone else has already written you off. And the answer to "what breaks if mimimal matching is overall but maximal matching is local"--or even, "if we change it all"-- is a zillion programs, including just about any progressive match: while (/.*?(\w+)=(\S+)/g) { push @{ $h{$1} }, $2; } I can't wait for that to match the rightmost one and then fail. Bah. >>/dev/null --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>Take. It. To. Private. Email. Please. I'm going to do better. I'm taking it to /dev/null. It's not worth my wasting my life over. Nobody agrees with this guy, so it doesn't matter. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
And while I'm at it, consider /(.*)(.*)(.*)/, which we'll call /ABC./ You need to be able to say all of these independently and in conjunction with one another: whether segment A is longest or shortest overall whether segment B is longest or shortest overall whether segment C is longest or shortest overall whether segment AB is longest or shortest overall whether segment BC is longest or shortest overall whether segment ABC is longest or shortest overall Imagine wanting, in /ABC/, A and B to be minimal, C to be maximal, AB to be maximal, BC to be minimal, and ABC to be maximal. Does this not strike fear into your heart? The very notation we'd have to devise should itself be plenty sufficient to give you serious pause--and that's not even considering the heat-death problem of guaranteed worst-case behavior that the word "overall" mandates. Be very afraid. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>At worst, this should take no more than double the amount of time that the >single pass did, probably less. Hardly a cause to concern ourselves with >the heat death of the universe. Oh really? We have shown that for the kind of global overall analysis that you are asking for, that in the general case, all possible paths much be taken. You cannot short-circuit, because you must first consider all possibilities and then weigh each valid result against each other valid result. Consider something like /.*/ or /.*?/. For a string a length N, there are (N+1) (N+2) --- 2 substrings that that matches. That means that an 80-byte string has some 3321 possible substrings, all of which must be considered. In the short-circuiting version, the Engine need consider but one single solitary case for each of those. 3321 is not the double of 1. Consider now something like /(.*)(.*)/ or /(.*?)(.*?)/ or /(.*)(.*?)/ or /(.*)(.*?)/. You now have 2 ( (N+1) (N+2) ) --- 4 cases to consider, or, in the case of an 80-byte string, some 11,029,041 possible choices. And with the current, normal, standard, short-circuiting system, the Engine has to consider, hm... could it be just one possibility? And that's just with two wildcards. People are often writing more than that. Can you now see why this would be a problem? And how even in the cases where it didn't actually break old programs (many of which it would!) that it would cause many many them to apparently hang, racing for electron death? --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>That would be a strange regexp, but I never suggested it. I suggested the >regexp /b.*?d/ and pointed out that I believe "bd" is a more intuitive >match than "d". That was the matching text, not the regexp, sorry >if I didn't make that clear. Fine. What you said is first find a b then find any non-newline, repeated 0 to N times then find a d What part of "first find a b" do you expect a randomizing solution to? That's very clear. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>You can't explain why "d" matches without making reference to the >absolute priority of the leftmost rule. "bd" would still make sense >(locally) without reference to that rule. Nope. Nope, nope, and nope. Th8is /d/ thing, which is completely unrealistic and non-real-worldly, says: find a b such that this is immediately followed by b such that this is immediately followed by b such that this is immediately followed by b such that this is immediately followed by c such that this is immediately followed by c such that this is immediately followed by c such that this is immediately followed by c such that this is immediately followed by d If you think that people for "find a b" to suddently mean something stochastic, you know different people than I do. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>On Fri, 15 Dec 2000, Tom Christiansen wrote: >> >As for special-case rules, I believe that my proposed modification would >> >REMOVE a special-case semantic rule, at the cost of added complexity at the >> >implementation level. >> >> What is this alleged "special-case rule" you are talking about? >> There is no such thing. None. When you write /pat/, it means to >> find the first such pattern. There is no special case here. >The special case is "as long as it has the earliest starting position". >There may be many, many possible matches for a regexp in a given string, >especially with an expression as inclusive as ".*". You want to change things from "find a match", which has the obviously deterministic semantics of finding the first match, and alter that to mean "find all possible matches; now, amongst those...". This is much more complicated, at many levels. You have yet to address my long mail to you. You have yet to read MRE. >So, you have to apply some disambiguating rules to identify which matches >are "interesting" enough to be worth paying attention to. There is no ambiguity. Short-circuiting it not ambiguity. Stopping when you have an answer is not ambiguity. You are mistaken. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>We may have to "agree to disagree". I shan't be doing that. >I'm understand why people believe in >the current semantics, but I've seen no indication that anyone else >understands why I believe in these alternative semantics, or has tried. >(Disagreeing with my conclusion doesn't preclude understanding where I'm >coming from, but nobody seems to.) You have not addressed the heat death of the universe as I and others have illustrated. Finding all possible matches is very often completely infeasible. Please solve the electron decay problem before continuing. >Well, obviously we could. Maybe we shouldn't, but we could do it. Many, >many existing programs depended on Perl 4's magic behavior with @'s in >double-quoted strings, yet Perl 5 broke them all with a fatal error during >the compile phase. People survived. They adapted and moved on. Red herring. >Unlike >that incompatibility, this one would probably affect few programs. You're wrong. Incredibly wrong. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>Really? I haven't taken a survey, but I did ask one co-worker for his >first impression of what the regexp (from my example) would match. Not >being an experienced Perl programmer, but being familiar with regular >expressions, he believed he understood the idea of non-greedy matching. >His expectation? That would match "bd", not "d". I'm sure you invalidated the test results by giving the wrong set up. Listen very closely: PERL DOES NOT HAVE GREEDY MATCHING. Got that? Neither does it have stingy matching. Only the quantifiers have such a property. NOT THE MATCH ITSELF.Wait, let me say it again: PERL DOES NOT HAVE GREEDY MATCHING. There is no global greed, only local greed. And greed is a misleading term. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>Have you thought it through NOW, on a purely semantic level (in isolation >from implementation issues and historical precedent), I've said it before, and I'll say it again: you keep using the word "semantic", but I do not think you know what that word means. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>> More generally, it seems to me that you're hung up on the description >> of "*?" as "shortest possible match". That's an ambiguous >Yup, that's a bit confusing. It's really "start matching as soon as >possible, and stop matching as soon as possible". (The usual greedy >one is, of course, "keep matching as long as possible".) The initial >invariant part, "start as soon as possible", is the de facto and de >jure (at least POSIX 1003.2, but probably also Single Unix) >definition, and therefore rather non-negotiable. It's like people who write /^.*fred/ instead of /.*fred/. They are forgetting something critical: where the Engine starts the serach. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>Actually, I'm not sure -- it's conceivable that the ending point would ALSO >move inward for a different starting point within the original match. But >the ending point should NEVER be advanced further -- that's where the >"leftmost over nongreedy" rule should apply instead... Please show us your implementation for a pattern matching engine that lets the current end-point vary. This is very exciting, because now you can relax the restriction that lookbehinds must be constant width. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>I want the maximum level of OVERALL consistency for regular expressions as We're there, thank you very much. "Find a match" is the over-riding sentiment, the principle semantic. It is completely consistent with this. You've got greed/nongreed very wrong. >a whole, rather than immutable adherence to the "leftmost trumps nongreedy" >rule currently in place. Most of the time, I agree with the precedence of >leftmost over nongreedy. The example I gave is a case where I believe the >strict adherence to the leftmost rule actually introduces complexity and >makes the regular expression system less self-consistent. You have yet to provide a concrete, real-world example of this allegation. To the contrary, you give unrealworld examples. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>I meant that I've never seen >a concrete, realistic example where the current behavior is more beneficial >to the programmer than my proposed behavior. Absense of evidence is hardly evidence of absence. `cat /vmunix` =~ /\w+/ I just love guaranteed worst-case behavior. NOT. It is good to short circuit. Very good. >(I imagine in most cases, it >will be a moot point, since the match will usually be the same.) Then why the bloody blazes are you arguing about this so vociferously? >Strange argument. Greedy matching was once considered fundamental to the >design of regex, and the "leftmost" behavior is 100% consistent with greedy >matching. Nope. These are orthogonal, unrelated concpets. >Yet Perl 5 added non-greedy modifiers, changing a fundamental >aspect of every preceding regex system, and still called it a regex... Whether a match should be minimal or maximal in no way changes whether the language is to be deemed "regular" by the proper definition of that term. Back-references, which have been in Perl since its inception, suffice to disqualify the language from that category, but minimal and maximal alternation do not. But this doesn't matter. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>As for special-case rules, I believe that my proposed modification would >REMOVE a special-case semantic rule, at the cost of added complexity at the >implementation level. What is this alleged "special-case rule" you are talking about? There is no such thing. None. When you write /pat/, it means to find the first such pattern. There is no special case here. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>I made a mistake in phrasing it this way, because it seemed to suggest that >I thought it was an implementation bug that it returns "d" instead >of "bd". I didn't make it clear that I was trying to approach this as >a purely SEMANTIC question, considered in isolation from the implementation >of the system. You keep using "semantic". However, I do not think that that word means what you think it means. >The question is, "what interpretation makes the most sense, >at a high level", not "why does the current behavior make sense". There are all three of them different things. >It's not that there aren't justifications for the current behavior. It's a >question of perspective -- from one perspective (mine), "bd" makes more >sense semantically. No, sir. You cannot use the S word for that. Here are the *SEMANTICS* of pattern matching in Perl: When there's more than one match, the first match found (that is, the leftmost) is the winner, with ties being resolved in favor of the longer string for maximal matches and the shorter string for minimal matches. This is *not* an "implementational detail". These *are* the semantics. You are asking for *different* semantics. What you are doing is simply an attempt to impose a sloppy English-language description on the behavior of the code. Just because you should happen to understand the English does not mean that this describes the code. It's like people thinking /<.*?>/ will find a tag because they are thinking in English, not Perl. Of course it won't. >I believe it it more intuitive, at the highest level. "Intuitive" is another one of those words frequently bandied about that is nearly always misapplied. WRONG: The frobnitz interface is more intuitive. RIGHT: The nipple is the only intuitive human interface. CORRECTION: From my own historical experiences and resulting biases, the frobnitz interface would have been more what I personally without regard to anyone else would have been expecting. >>From a different (more implementation-oriented) perspective, the current No, this is not "implementation-oriented". It is merely the semantics. >Hopefully, we can have a rational discussion about whether this semantic >anomaly is real or imagined, what impact "fixing" it would have on the >implementation (if it's deemed real), and whether it's worth "fixing". I do not expect you to be rational, because I do not think we can agree to your terms. There is no semantic anomaly, anymore than thinking that <.*> or <.*?> finds an HTML tag is some sort of "semantic anomaly". It is the result of your mistranslating between English and code. >Here's where I see the disconnect happening. I'm approaching this from a >semantic perspective, asking myself "what should this match (ideally)?" No, you're not. Please stop abusing the S word. It places you on no moral high ground whatsoever. --tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>No question that's how it's been implemented. But WHY would anyone want >such behavior? When is it beneficial? It is beneficial because this is how it's always been, because it is faster, because it is more expressive, because it is more powerful, because it is more intuitive, and because it is more perlian. In elaboration: 0) All NFAs before POSIX acted this way. It is historically consistent and perfectly expected. 1) It is obviously faster to come to an answer earlier on in the execution than it would be to come to an answer later. It's like an expression whose evaluation short-circuits. Also, when the matching sematics permit back tracking and back references, the combinatoric possibilities can easily explode into virtual unsolvability as the 2**N algorithm loses its race to the heat death of the universe. Yes, if Perl did overall-longest or overall-shorted, this would produce a more predictable time; however, as we see with DFAs and POSIX NFAs, this prediction plays out as guaranteed *WORST-CASE* time. It is not acceptable to make everyone pay the worst-case time. Never penalize the whole world for the needs or desires or the few. 2) Consider the simple case, /A|B/. In your overall longest/shortest, guaranteed worst-case time, both submatch A and submatch B must be calculated, and then the lengths of their matches both be compared. Perl, fortunately, does not do that. Rather, the first one in that sequence wins. That means that under the current scheme, the patterns /A|B/ and /B|A/ have different semantics. Under your worst-case scheme, they do not. Because /A|B/ and /B|A/ mean something different, more expressivity is provided. This is the same scenario, albeit expressed slightly differently, as your situation. The issues manifest in both are equivalent. 3) This leads to increased power. It's like the difference between a short-circuiting "or" and one that blindly plods ahead trying to figure something out even when all is for naught. Compare A&&B with A&B, for example. If A is 0, then B need not be computed, yet in the second version, one runs subexpression B nevertheless. If according to the rules of one particular system, patX and patY mean different things, whereas in a second system, they are completely interchangeable, then the first system can express nuances that the second one cannot. When you have more nuances, more expressivity, then you have more power, because you can say things you could not otherwise say. Why do C and its derivatives such as Perl have short-circuiting Boolean operators? Because in older languages, such as Fortran and Pascal, where you did not have them, one quickly found that this was cumbersome and annoying. 4) It is more intuitive to the reader and the writer to minimize strange action at a distance. It's more to remember; or, perhaps better phrased, more to forget. That's why we don't like variables set in one place magically affecting innocent code elsewhere. Maybe it's more applicable here to say that that's why having mixed precedences and associativities confuses people. If in an expression like A->B->C->D, you had to know a prior when evaluating A that D was going to be coming up, it would require greater look-ahead, more mental storage. Even if a computer could do it, people would find it harder. That's why we don't write &{&{$fnctbl{expr}}(arg1)}(arg2) when we can simply write $fnctbl{expr}->(arg1)->(arg2) It is not intuitive to people to have to do too much look-ahead, or too much storage. Having distance items interact with one another is confusing, and we've already got that situation with backreferences, as in /(\w+)(\w+)\s+\2(\w+)/, which depending on how you start weighting those +'s into +?'s, can really move matters around. Let's not exacerbate the counterintuitiveness. 5) It is more Perlian because of the principle that things that look different should actually *be* different. /A|B/ and /B|A/ look quite different. Thus, they should likewise *be* different. >I didn't need the long-winded explanation, and I don't need help with >understanding how that regexp matches what it does. I understand it >perfectly well already. I'm no neophyte with regular expressions, even if >Perl 5 does offer some regexp features I've never bothered to exploit... All NFAs prior to POSIX behaved in the fashion that Perl's continue to behave in. I am surprised that over the long course of your experiences with regexes, that you never noticed this fundamental principle before. >My point is that the current behavior, while reasonable, isn't quite right. You're wrong. Don't call it "not right". It's perfectly correct and consistent. It follows directly from historical behavior of these things, and quite simply, it's in the rules. It's
Re: Perl 5's "non-greedy" matching can be TOO greedy!
>Does anyone disagree with the premise, and believe that "d" is the >CORRECT match for the non-greedy regexp above? Yes. The Camel's regex chapter reads: You might say that eagerness holds priority over greed (or thrift). >For what it's worth, here's a quote from a Perl 5.005_03 "perlre" manpage: > By default, a quantified subpattern is "greedy", that is, it will > match as many times as possible (given a particular starting > location) while still allowing the rest of the pattern to match. > If you want it to match the minimum number of times possible, > follow the quantifier with a "?". Note that the meanings don't > change, just the "greediness": >I don't believe that ".*?" matching "bbb" above qualifies as "to match >the minimum number of times possible", when it is possible only to match >the "" and still match the full regexp. Since the documentation makes >no mention of earliest-match in this paragraph, I can only assume this is >unintended behavior, but I'm asking to check my assumptions. Any devil's >advocates out there who want to argue for the current behavior? The simple story is this: Rule 1: Given two matches at *different* starting points, the one that occurs earlier wins. *OTHERWISE* Rule 2: Given two matches at the *same* starting points, the one that is longer wins. Or, more lengthly: Given the opportunity to match something a variable number of times, maximal quantifiers will elect to maximize the repeat count. So when we say "as many times as you'd like", the greedy quantifier interprets this to mean "as many times as you can possibly get away with", constrained only by the requirement that this not cause specifications later in the match to fail. If a pattern contains two open-ended quantifiers, then obviously both cannot consume the entire string: characters used by one part of the match are no longer available to a later part. Each quantifier is greedy at the expense of those that follow it, reading the pattern left to right. That's the traditional behavior of quantifiers in regular expressions. However, Perl permits you to reform the behavior of its quantifiers: by placing a C after that quantifier, you change it from maximal to minimal. That doesn't mean that a minimal quantifier will always match the smallest number of repetitions allowed by its range, any more than a maximal quantifier must always match the greatest number allowed in its range. The overall match must still succeed, and the minimal match will take as much as it needs to succeed, and no more. (Minimal quantifiers value contentment over greed.) For example, in the match: "exasperate" =~ /e(.*)e/# $1 now "xasperat" the C<.*> matches "C", the longest possible way for it to match. (It also stores that value in C<$1>, as described below under "Capturing and Clustering".) Although there was a shorter match available, a greedy match doesn't care. Given two choices at the same starting point, it always returns the I of the two. Contrast this with this: "exasperate" =~ /e(.*?)e/ # $1 now "xasp" Here, the minimal matching version, C<.*?>, is used. Adding the C to C<*> makes C<*?> take on the opposite behavior: Now given two choices at the same starting point, it always returns the I of the two. Although you could read C<*?> as saying to match zero or more of something but preferring zero, that doesn't mean it will always match zero characters. If it did so here, for example, and left C<$1> set to C<"">, then the second "C" wouldn't be found, since it doesn't immediately follow the first one. You might also wonder why, in minimally matching C, Perl didn't stick "C" into C<$1>. After all, "C" also falls between two C's, and is shorter than "C". In Perl, the minimal/maximal choice applies only when selecting the shortest or longest from among several matches that all have the same starting point. If two possible matches exist, but these start at different offsets in the string, then their lengths don't matter--and neither does whether you've used a minimal quantifier or a maximal one. The earliest of several valid matches always wins out over all latecomers. It's only when multiple possible matches start at the same point that you use minimal or maximal matching to break the tie. If the starting points differ, there's no tie to break. Perl's matching is normally I; with minimal matching, it becomes I. But the "leftmost" part never varies, and is the dominant criterion. Not all regex engines work this way. Some believe in overall greed, in which the longest match always wins, even if it shows up later. Perl isn't that way. You might say that eagerness holds
Re: RFC 308 (v1) Ban Perl hooks into regexes
>I consider recursive regexps very useful: > > $a = qr{ (?> [^()]+ ) | \( (??{ $a }) \) }; Yes, they're "useful", but darned tricky sometimes, and in ways other than simple regex-related stuff. For example, consider what happens if you do my $regex = qr{ (?> [^()]+ ) | \( (??{ $regex }) \) }; That doesn't work due to differing scopings on either side of the assignment. And clearly a non-regex approach could be more legible for recursive parsing. --tom Visit our website at http://www.ubswarburg.com This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. If verification is required please request a hard-copy version. This message is provided for informational purposes and should not be construed as a solicitation or offer to buy or sell any securities or related financial instruments.
RFC 198 (v2) Boolean Regexes
This seems very complicated. Did you look at the Ram:6 recipe on expressing AND, OR, and NOT in a regex? For example, to do /FOO/ && /BAR/ you need not write /FOO.*BAR|BAR.*FOO/ -- and in fact, should not, as it doesn't work properly on some pairs! For example, /CAN/ && /ANAL/ can't be written /CAN.*ANAL|ANAL.*CAN/ of you expect to match "CANAL". Overlaps bite you. You really need /(?=.*CAN)(?=.*ANAL)/ instead, which permits multiple assertions. Please check out the recipe I'm talking about. --tom, from a strange place PS: NB -- I cannot access my mail spool. And the mailing list archives are 4 days behind on the website, so there is no hope of me participating in real-time, nor in seeing any private replies. Visit our website at http://www.ubswarburg.com This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. If verification is required please request a hard-copy version. This message is provided for informational purposes and should not be construed as a solicitation or offer to buy or sell any securities or related financial instruments.
Re: \z vs \Z vs $
>I gather you're talking about //s making perl ignore the setting of $*. >You're right, I didn't know that. But I doubt if it's that important, >this variable already has been marked as deprecated since Perl5 came >out, about 5 years ago. It's a good candiadte to be removed in Perl6. Agreed. >My point is: to most people, //s already mostly means "treat \n as an >ordinary character". Let's draw this through, and make //s remove all >special meanings of "\n", in particular WRT /$/. >Then, there's the matter of combining //m and //s. It would have no >effect in that case, because //m makes /$/ behave like /\n|\z/. //ms >wouldn't change that. Er, not quite. It's a lookahead. /foo$/ is /foo(?=\n?\z)/ /foo$/m is /foo(?=\n|\z)/ or some such. >p.s. The mnemonic of //s (single line) would not make any sense any >more. It never really did work. No, it never did. Camel-3 doesn't use it much/really. ModifierMeaning --- C Ignore alphabetic case distinctions (case insensitive). C Let C<.> match newline and ignore deprecated C<$*>. C Let C<^> and C<$> match next to embedded C<\n>. C Ignore (most) whitespace and permit comments in pattern. C Compile pattern once only. --tom
Re: \z vs \Z vs $
>Tom Christiansen wrote: >> Don't forget /s's other meaning. >Do you enjoy making people ask what you're talking about? Of course not. I enjoy giving people enough pointers to help them learn things for themselves. >What other >meaning did you have in mind, overriding $*? Yes. --tom
Re: \z vs \Z vs $
>That was my second thought. I kinda like it, because //s would have two >effects: > + let . match a newline too (current) > + let /$/ NOT accept a trailing newline (new) Don't forget /s's other meaning. --tom
Re: \z vs \Z vs $
>>>>>> "TC" == Tom Christiansen <[EMAIL PROTECTED]> writes: >>> Could you explain what the problem is? >TC> /$/ does not only match at the end of the string. >TC> It also matches one character fewer. This makes >TC> code like $path =~ /etc$/ "wrong". >Sorry, I'm missing it. I know. On your "longest match", you are committing the classic error of thinking green more important than eagerness. It's not. This is unrelated to /m. Go back and read all the insanities we (mostly gbacon and your truly) went through to fix the 5.6 release's modules. People coded them *WRONG*. Wrong means incorrect behaviour. Sometimes this even leads to security foo. BOTTOM LINE: You cannot use /foo$/ to say "does the string end in `foo'?". You can't do that. You can't even use /s to fix it. It doesn't fix it. This is an annoying gotcha. Larry once said that he wished he had made \Z do what \z now does. One would like $ to (be able to) mean "ONLY AT END OF STRING". --tom EXAMPLE 1: --- /usr/local/lib/perl5/5.00554/File/Basename.pm Mon Jan 4 13:00:53 1999 +++ /usr/local/lib/perl5/5.6.0/File/Basename.pm Sun Mar 12 22:24:29 2000 @@ -37,10 +37,10 @@ "VMS", "MSDOS", "MacOS", "AmigaOS" or "MSWin32", the file specification syntax of that operating system is used in future calls to fileparse(), basename(), and dirname(). If it contains none of -these substrings, UNIX syntax is used. This pattern matching is +these substrings, Unix syntax is used. This pattern matching is case-insensitive. If you've selected VMS syntax, and the file specification you pass to one of these routines contains a "/", -they assume you are using UNIX emulation and apply the UNIX syntax +they assume you are using Unix emulation and apply the Unix syntax rules instead, for that function call only. If the argument passed to it contains one of the substrings "VMS", @@ -73,7 +73,7 @@ =head1 EXAMPLES -Using UNIX file syntax: +Using Unix file syntax: ($base,$path,$type) = fileparse('/virgil/aeneid/draft.book7', '\.book\d+'); @@ -102,7 +102,7 @@ The basename() routine returns the first element of the list produced by calling fileparse() with the same arguments, except that it always quotes metacharacters in the given suffixes. It is provided for -programmer compatibility with the UNIX shell command basename(1). +programmer compatibility with the Unix shell command basename(1). =item C @@ -111,8 +111,8 @@ second element of the list produced by calling fileparse() with the same input file specification. (Under VMS, if there is no directory information in the input file specification, then the current default device and -directory are returned.) When using UNIX or MSDOS syntax, the return -value conforms to the behavior of the UNIX shell command dirname(1). This +directory are returned.) When using Unix or MSDOS syntax, the return +value conforms to the behavior of the Unix shell command dirname(1). This is usually the same as the behavior of fileparse(), but differs in some cases. For example, for the input file specification F, fileparse() considers the directory name to be F, while dirname() considers the @@ -124,12 +124,22 @@ ## use strict; -use re 'taint'; +# A bit of juggling to insure that C always works, since +# File::Basename is used during the Perl build, when the re extension may +# not be available. +BEGIN { + unless (eval { require re; }) +{ eval ' sub re::import { $^H |= 0x0010; } ' } + import re 'taint'; +} + + +use 5.005_64; +our(@ISA, @EXPORT, $VERSION, $Fileparse_fstype, $Fileparse_igncase); require Exporter; @ISA = qw(Exporter); @EXPORT = qw(fileparse fileparse_set_fstype basename dirname); -use vars qw($VERSION $Fileparse_fstype $Fileparse_igncase); $VERSION = "2.6"; @@ -162,23 +172,23 @@ if ($fstype =~ /^VMS/i) { if ($fullname =~ m#/#) { $fstype = '' } # We're doing Unix emulation else { - ($dirpath,$basename) = ($fullname =~ /^(.*[:>\]])?(.*)/); + ($dirpath,$basename) = ($fullname =~ /^(.*[:>\]])?(.*)/s); $dirpath ||= ''; # should always be defined } } if ($fstype =~ /^MS(DOS|Win32)/i) { -($dirpath,$basename) = ($fullname =~ /^((?:.*[:\\\/])?)(.*)/); -$dirpath .= '.\\' unless $dirpath =~ /[\\\/]$/; +($dirpath,$basename) = ($fullname =~ /^((?:.*[:\\\/])?)(.*)/s); +$dirpath .= '.\\' unless $dirpath =~ /[\\\/]\z/; } - elsif ($fstype =~ /^MacOS/i) { -($dirpath,$basename) = ($fullname =~ /^(.*:)?(.*)/); + elsif ($fstype =~ /^MacOS/si) { +($dirpath,$basename) = ($fullname =~ /^(.*:)?(.*)/s); } elsif ($fstype =~ /^AmigaOS
\z vs \Z vs $
What can be done to make $ work "better", so we don't have to make people use /foo\z/ to mean /foo$/? They'll keep writing the $ for things that probably oughtn't abide optional newlines. Remember that /$/ really means /(?=\n?\z)/. And likewise with \Z. --tom
Re: What's in a Regex (was RFC 145)
The phrase "die a horrible death" clearly reads that something was a bletcherous botch--a terribly brain-damaged mistake, if you would--and so must necessarily be expurgated from the language. For example, when Larry said, "...this does not mean that some of us should not want, in a rather dispassionate sort of way, to put a bullet through csh's head," *that* was the sort of thing that might be described as something he wanted to die a horrible death. Yet note how mildly worded even this is. While others sometimes say this about various elements of Perl, Larry seldom states matters so strongly, as you did when you portrayed him as having said that it should die a horrible death. After all, if he *really* felt that strongly about some (mis)feature (and yes, this sometimes happens), then said misfeature would almost certainly be long dead already. Think about it. :-) That's why I thought clarification was in order. --tom
Re: What's in a Regex (was RFC 145)
> 2. Many people - including Larry - have voiced their desire > to see =~ die a horrible death Please provide a look-up-able reference to Larry's saying that he wanted to =~ to die horrible death. That's very strongly worded for him. Are you sure this tale hasn't merely grown in the telling? --tom
Re: What's in a Regex (was RFC 145)
>Can be rewritten as the shorter and more readable: > ($name) =~ split /\s+/; > $string =~ quotemeta; > @array =~ reverse; > @vals =~ sort { $a <=> $b }; > $string =~ s/\s+/SPACE/;# looks familiar > $string =~ m/\w+/; # this too > @strs =~ m/\w+/;# cool extension > @strs =~ s/foo/bar/gi; # ditto Which can of course be written in an immeasuably more legible fashion using current Perl, a little-known language: ($name) = split /\s+/, $name; $string = quotemeta($string); @array = reverse @array; @vals = sort { $a <=> $b } @vals; $string =~ s/\s+/SPACE/; $string =~ /\w+/; map { m/\w+/ } @strs; s/foo/bar/gi for @strs; Although the invention of redundant and obfuscated syntactic alternatives for operations which are not only perfectly feasible already but also more readable in their current incarnations seems to be a not infrequent theme in these documents, one must always carefully consider whether any scant benefit these cutesinesses might provide can be truly worth further exacerbating the rampant inscrutability problems (stemming mainly from punctuation in lieu of alphabetics and from magically implicit targets, arguments, and side-effects) for which Perl is already soundly--and not always undeservedly--derided. Explicitly saying precisely what you mean is perfectly acceptable--and usually desirable. Inventing subtleties merely to avoid typing, however, may not be. --tom
Re: What's in a Regex (was RFC 145)
>But you said "lists" up there and that sparked an idea in me ... What >does > @a =~ /pattern/; >currently do? AFAICT, nothing useful. But it could be a syntactic >shorcut for a pattern matching grep() That changes semantics in places you might not expect. What does fn() =~ /pattern/ currently do? It calls fn() in scalar context, of course. But with your suggestion, the =~ operator is no longer a scalar operator, so this changes. --tom
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
>I am working on an RFC >to allow boolean logic ( && and || and !) to apply a number of patterns to >the same substring to allow easier mining of information out of such >constructs. What, you don't like: :-) $pattern = $conjunction eq "AND" ? join('' => map { "(?=.*$_)" } @patterns) | join("|" =>@patterns); --tom
Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))
>...My point is that I think we're approaching this >the wrong way. We're trying to apply more and more parser power into what >classically has been the lexer / tokenizer, namely our beloved >regular-expression engine. >A great deal of string processing is possible with perls enhanced NFA >engine, but at some point we're looking at perl code that is inside out: all >code embedded within a reg-ex. That, boys and girls, is a parser, and I'm >not convinced it's the right approach for rapid design, and definately not >for large-scale robust design. What you say has, I think, a great deal of sense. While Jon and I--with Nathan, actually (see inside page credits)--were trying to figure out how to go about presenting all this wacky stuff for the final section of the new regex chapter in the Camel: Fancy Patterns Lookaround Assertions Non-Backtracking Subpatterns Programmatic Patterns Generated patterns Substitution evaluations Match-time code evaluation Match-time pattern interpolation Conditional interpolation Defining Your Own Assertions We kept coming back to sentiments remarkably similar to those you yourself have just expressed: although I think we managed to put a decently positive shine on the matter for the print version, it still really seems that that the inside-outness of this is very hard on your brain, and of remarkably abstruse appeal to the incredibly few. (Names of the usual suspects omitted to avoid using four-letter words in public forums. :-) I would welcome a less inside-out approach, as well as one that were more procedural--or at least more symbolic and less punctuational. --tom
Re: $& and copying: rfc 158 (was Re: RFC 110 (v3) counting matches)
>actually it is more like which code refers to $& and which regex that >caem from. the problem stems from $& being a global and not local like >$1. Say what? They scope the same! sub foo { /./ } $_ = "stuff"; /.../; foo(); print $&; --tom
Re: RFC 72 (v1) The regexp engine should go backward as well as forward.
Whenever I seem to have this "search backwards" urge (not very often, I admit), I without much thought just throw memory at it with reverse($str) =~ /pat/ Or, if that's not the "search backwards" sense intended, then maybe I'll throw time at it: $str =~ /.*pat/ Sometimes I've also done ($str . $str) =~ /pat/ to effect a search that wraps around--kinda. --tom
Re: copying and s/// (was Re: Overlapping RFCs 135 138 164)
>Uri Guttman wrote: >> >> TC> ($this = $that) =~ s/foo/bar/; >> TC> for (@these = @those) { s/foo/bar/ } >> >> TC> You can't really do those in one step without it. >RFC 164 v2 has a new syntax that lets you do the above or, if you want: > $this = s/foo/bar/, $that; > @these = s/foo/bar/, @those; Those really aren't any more obvious to the reader than what we already have. Less so, in fact, since you can understand what the current ones are doing based on simple operators and precedences. --tom
Re: RFC 165 (v1) Allow Varibles in tr///
>For the record, the UTF8 version of tr/// does not use a plain 256K >table. It uses a data strcuture called a 'swash'; this is also the >data structure that is used by the UTF8 versions of 'uc' etc., the >\p{...} regex escapes, and the others. The swash is based on a hash, >and the code is in utf8.c. And is connected to a "swatch": /usr/local/src/perl/utf8.c:/* a "swash" is a swatch hash */ --tom
Re: Overlapping RFCs 135 138 164
>I was referring to the visual similarity of = and =~, when in fact they >have nothing to do with one another. The expression I picked is just a >frequently encountered idiom that puts the two in close proximity. Your >proposed ~ thing would make it much rarer, but I still think =~ looks >like something to do with assignment. Well, with a /match/, it's read-only, true, and thus nothing like "an assignment". But with either s/ubs/titut/e or a tr/ansli/terate/, you do (potentially) change the variable. --tom
Re: copying and s/// (was Re: Overlapping RFCs 135 138 164)
> TC> ($this = $that) =~ s/foo/bar/; > TC> for (@these = @those) { s/foo/bar/ } > TC> You can't really do those in one step without it. >but do they really need to be combined into one step? i sometimes prefer >the separate assignment statement for clarity. other times i feel i am >in a compressing mood. It's like why you can say while ( (ch = getc()) != EOF ) { ... } With assignment an expression, not a statement, you can use a larger expression. Python people hate this. :-) --tom
Re: copying and s/// (was Re: Overlapping RFCs 135 138 164)
>RFC 164 v2 has a new syntax that lets you do the above or, if you want: > $this = s/foo/bar/, $that; > @these = s/foo/bar/, @those; >Consistent with split, join, splice, etc, etc. That looks tremendously *IN*consistent, since now you must alter the laws of precedence! :-( % perl -MO=Deparse,-p -e '$this = s/foo/bar/, $that;' (($this = s/foo/bar/), $that); >> but we need a better syntax for s/// that doesn't modify its string but >> returns a copy which has had the substitution applied to it. >See RFC 164 v2, all this is supported, as well as this: > @str =~ s/foo/bar/; >Which has been a pipe dream for some time. I can't imagine that the number of elements in @str constains the string "foo". Or has one decided that @array in scalar context no longer returns that? Anyway, this is nothing we don't have, or which is broken. We already have the highly readable: for (@str) { s/foo/bar/ } Why do you want to magically use scalar operators on arrays? This was suggested back as early as perl1 or so because people wanted to write @a + @b and even @a + 2, and Larry wasn't interested in doing that. That's a big new world. --tom
Re: copying and s/// (was Re: Overlapping RFCs 135 138 164)
I keep noticing the connection between $foo =~ /whatever/; $foo->whatever; for ($foo) { whatever } They're all topicalizers. --tom
Re: RFC 165 (v1) Allow Varibles in tr///
>Lightning flashed, thunder crashed and Tom Christiansen >m> whispered: >| >Even if I only do something like tr/a/A/? >| >And, it is going to get worse for UTF8/UTF16? >| >| Use the Source. >If we all always used the source, we wouldn't need books and trainers. >Where would you and I be then? What, you can't read C code? Try it.
Re: RFC 165: Allow variables in a tr///
>| >A lot of >| >non-gurus >| >| So what? >There are far more non-gurus using perl than there are gurus. If all we >cared about was the gurus, we wouldn't need Perl. Wrong. And irrelevant. >| Pick your own quotes is a perl thing. Let them learn this concept. >| If they can't, you made a bad hiring choice. >It may be a perl thing, but it isn't a Perl thing, at least not until >"recently". You're completely full of ... wrongness. Again. % perl1 -e '$_ = "fred"; s#d#e#; print;' free >make it difficult on them in the first place? Remember the easy things >easy, etc. Catering to people who don't know Perl in such a way that it hamstrings those who *do* is brain-dead. Ease-of-long-term is more important than ease-of-learning. You're only a beginner once--or, if you would, ignorance is merely an ephemeral state. Well, for most people; as for those who are permanently ignorant, you can't fix them, so don't even try. We don't write Greek using Latin letters just because more people know Latin letters! It's a minor point. It's part of the language. Pick-your-own-quotes is part of what makes Perl useful and easy to write and easy to read. Just as Greek would be *harder* to read transliterated into Latin script, so too would Perl be harder to read if you had to go shoving a backslash up the frontside of every slash. Teepees are *not* harder to read. It's too much to factor out, and makes no sense. Just because the smallminded blow a fuse doesn't mean we should screw Perl--those fuses were meant to be fried. >| Transforming everything that's syntactically distinctive in Perl a >| simple C-looking function will homogenize it into the same boring >| sameness (and thence to illegibility) as the proverbial fingernail >| clippings stranded in a bowl of oatmeal. >This is also an opinion. Homogeneity isn't necessarily boring. In fact, >it can often be quite liberating and allow one far more flexibility and >creativity than previously. It may be just an opinion, but it is the PERL OPINION (read: part of what makes perl, perl) and if you don't like it, go play with a different language, or write your own. Different things are supposed to look different. Different things are *not* supposed to look the same. You haven't read enough of Larry's writings about this. --tom
Re: RFC 165 (v1) Allow Varibles in tr///
>Lightning flashed, thunder crashed and Mark-Jason Dominus <[EMAIL PROTECTED] >pered: >| > > The way tr/// works is that a 256-byte table is constructed at compile >| > > time that say for each input character what output character is >| > >| > Speaking of which, what's going to happen when there are more than 256 >| > values to map? >| >| It's already happened, but I forget the details. >Let me see if I understand this correctly. For every tr/// in a program, >256 bytes have to be allocated? Yes, once upon a time. >Even if I only do something like tr/a/A/? >And, it is going to get worse for UTF8/UTF16? Use the Source. >Is this really the optimal >solution for this (sorry, this is probably going into -internals space). >Seems to me that we could very quickly end up with a really large memory >image. Memory usage is irrelevant compared with speed. --tom
Re: RFC 165: Allow variables in a tr///
>Personally, I would say that q/.../ and friends were a bad idea. That's one opinion. As Piers points out, it's hardly universal. Go read what I just wrote Uri. >A lot of >non-gurus So what? see /.../ (whatever comes before it) and their first impression >is that it has something to do with regex. I would suggest that anything >that isn't a regex should not use /.../. Make q, qq, etc use matched >pairs. Pick your own quotes is a perl thing. Let them learn this concept. If they can't, you made a bad hiring choice. >Make tr look like a regular function and do >tr(SEARCH, REPLACE, MOD, STR). It just seems more orthagonal to me. Transforming everything that's syntactically distinctive in Perl a simple C-looking function will homogenize it into the same boring sameness (and thence to illegibility) as the proverbial fingernail clippings stranded in a bowl of oatmeal. Don't even dream of it. This is part of what makes Perl, Perl, you know. Not everything looks like an import from libc. And shoudn't. --tom
Re: Overlapping RFCs 135 138 164
> TC> ($foo += 3) *= 2; >that is way too many assignment ops. better is the normalized > $foo = ($foo + 3) * 2; > TC> $n = select($rout=$rin, $wout=$win, $eout=$ein, 2.5); >who uses select directly anymore? use a module! :) I see the smiley, but one must be exceedingly careful not to enshrine one's own personal preferences and predilections--one's own small choices of style and nuance--into laws inviolate, and then to further go on to hold others accountable for not having followed those choices that one has made for oneself and then dicated to others. There's plenty that is convenient--to to mention familiar, reasonable, and perhaps even idiomatically comforting--about about changing en passant via assignment's lvaluability: ($this = $that) =~ s/foo/bar/; or for a whole bunch of them: for (@these = @those) { s/foo/bar/ } You can't really do those in one step without it. I have in passing proposed a form of s/// that acts upon a temporary not the original and returns the new value not the success status. This would employ the previously unused binary ~ operator (I mean binary as in two operands; the unary ~ is bitwise, but I don't mean that kind of binary.) Were this around, one could write that first one as $this = $that ~ s/foo/bar/: Because the right side of the assignment is the string resulting from that substitute, without harming $that. By extension, the array case could be @these = map { $_ ~ s/foo/bar/ } @those Which is still not very appealing, actually. Hm... --tom
Re: Overlapping RFCs 135 138 164
>What about these, which are much the same thing in that they all >use the lvaluability of assignment: And don't forget: for (@new = @old) { s/foo/bar/ } --tom
Re: Overlapping RFCs 135 138 164
>($foo = $bar) =~ s/x/y/; will never make much sense to me. What about these, which are much the same thing in that they all use the lvaluability of assignment: chomp($line = ); ($foo = $bar) += 10; ($foo += 3) *= 2; func($diddle_me = $protect_me); $n = select($rout=$rin, $wout=$win, $eout=$ein, 2.5); --tom
Re: RFC 110 (v3) counting matches
>I think what Tom means is that (for example) >print "${\(localtime())}\n"; >does not produce "Tue Aug 29 19:15:55 2000". Yup. You are hereby appointed tchrist-to-lateur translator. :-) --tom
Re: RFC 110 (v3) counting matches
>>>p.s. Has anybody already suggested that we ought to have a nicer >>>solution to execute perl code inside a string, replacing "${\(...)}" and >>>"@{[...]}", which also won't ever win a beauty contest? Oops, wrong >>>mailing list. >> >>The first one doesn't work, and never did. You want >>@{[]} and @{[scalar ]} instead. >"Doesn't work"? > print "The sum of 1 + 2 is ${\(1+2)}.\n"; >--> > The sum of 1 + 2 is 3. >I'm surprised your wouldn't have known this. The principle is the same: >"${...}" expects a scalar reference inside the block, and '\' provides >one. Of course, there shouldn't be a real multi-element list inside the >parens, but just one scalar. And often, the parens aren't needed. I'm surprised that you still don't understand. Notice what I showed you for the replacement above: @{[scalar ]}. Using ${\(...)} doesn't work in the sense that contrary to popular belief, it fails to provide a scalar context to the contents of those parens. Thus ${ \( fn() ) } is still calling fn() in list context, not scalar context. Witness: sub fn { sprintf "called in %s context", wantarray ? "list" : "scalar" } print "Test 1: "; print "@{ [fn()] }\n"; print "Test 2: "; print "${ \(fn()) }\n"; print "Test 3: "; print "@{ [scalar fn()] }\n"; That, when executed, yields: Test 1: called in list context Test 2: called in list context Test 3: called in scalar context *That's* why test 2 "doesn't work". --tom
Re: RFC 165 (v1) Allow Varibles in tr///
>tr///e is the same as s///g: > >tr/$foo/$bar/e == s/$foo/$bar/g I suggest you read up on tr///, sir. You are completely wrong. --tom
Re: RFC 165: Allow variables in a tr///
>Building a tr/// table is much much simpler and much less work than >compiling a regex, but we don't make people write >eval " \$s =~ m/$pat/ " >when they want to interpolate a string into a regex at run time. >Instead, we take care of it transparently. tr/// could easily be made >to work the exact same way. One thing to be careful of there is thread safety. You can't hand the data off the syntax node (the one with the tr op on it), because tr/$foo/$bar/ wouldn't work for several threads in it at the same time then. --tom
Re: RFC 110 (v3) counting matches
>p.s. Has anybody already suggested that we ought to have a nicer >solution to execute perl code inside a string, replacing "${\(...)}" and >"@{[...]}", which also won't ever win a beauty contest? Oops, wrong >mailing list. The first one doesn't work, and never did. You want @{[]} and @{[scalar ]} instead. And I can't see you coming up with anything that's "better" than this, since this already works and follows directly from understanding of Perl. Too often on these lists anything that "follows directly" one seeks to special-case with brand-new syntax. This is a poor general principle. This has nothing to do with regexes (although it could if we had @foo normally interpolate into patterns with $" = '|' instead, which would break that), so when you find a better list to discuss it on, I'll mumble again. --tom
Re: RFC 165: Allow variables in a tr///
Perl has always excelled at convenience. Look at this code: while (<>) { for (split) { s/foo/bar/g; next if /glarch/i; tr/aeiou/eioua/s; print; } } There is *nothing*wrong* with any of them, and to suggest breaking them is extremely demoralizing. Don't you people have anything that's *broken* to fix? Sheesh. I fully expect to see an RFC for each and every lovely Perlism that isn't in C, Python, and Java. Well, Perl *isn't* C, Python, or Java, and there's no need to freak out just because of this!! --tom
Re: RFC 165: Allow variables in a tr///
>But I think this is worth discussing further, because it neatly >accomplishes the goal of the RFC in a straightforward way: >tr('a-z', 'A-Z', $str) >replaces a-z with A-Z, and >tr($foo, $bar, $str) >replaces the characters from $foo with the characters from $bar. >No special syntax is necessary. When does the structure get built? That's why eg. tr[a-z][A-Z] brooks no variables, for it is solely at compile time that these things occur, and why you must resort to delayed compilation via eval qq/.../ to prod the compiler into building you a new one.o Maybe you want qt/.../.../ or something. --tom
Re: RFC 165: Allow variables in a tr///
>Would there be any interest in adding these two ideas to this RFC: >1) tr is not regex function, so it should be regularized to > tr(SEARCH, REPLACE, MOD, STR) >The // tend to confuse people and make them expect tr to operate as a >regular expression. So what? q/.../ is not a "regex function" either. These are all pick-you-own-quotes function. This makes no sense. --tom
Re: RFC 110 (v3) counting matches
>And hashes are assembled just like lists anyways: > %hash = list get_key_vals; > %hash = (key, val, key2, val2); # same thing Eh? List context is conferred by the % on the LHS. You need no redundant listification redundancy there. >But no, I certainly wouldn't suggest going down the path of 1000 >explicit contexts. Bad. Implicit context good! But a "list" helper >function like a "scalar" helper function would solve a lot of common >problems. No, a list helper function would *not* solve a lot of *common* problems: There's no C function corresponding to C since, in practice, one never needs to force evaluation in a list context. That's because any operation that wants R already provides a list context to its list arguments for free. It's not a "common problem". Now, you *can* force list context, but I (and Larry, one of whose text I just quoted) don't see it as common, so it's not worth the word. But it's not impossible, either, as you can use either the construct @{ [ ... ] } if you're in a string and trying to interpolate some function call, or simply through ()=... otherwise. Education is a wonderful thing. --tom
Re: RFC 110 (v3) counting matches
>For me, yeah. But I can name at least 30 people in my building alone >that have been hacking Perl for years who wouldn't get this. And a "well >they don't know what's going on" argument doesn't work. Not everyone is >a Perl expert. I will always find this argument specious. Some people "hack on X" for years but never but scratch the surface. There are various reasons for this, depending. But you won't fix it--espeically by adding more crudola for them to scratch up. >Besides, you're telling me this: > foo(list bar()) >is *LESS* intuitive? I really don't buy that. Noting that we use [] for an anon array and {} for an anon hash, not ARRAY or array and HASH or hash, it seems to follow to use () for the list. It's not my fault that people don't know this. I've certainly explained it. % tcgrep '^\s.*\(\s*\)\s*=' ~/cookbook/*.pod /home/tchrist/cookbook/chap10.pod:() = some_function(); % tcgrep '^\s.*\(\s*\)\s*=' ~/camel/*.pod /home/tchrist/camel/200lexical.pod:() = funkshun(); /home/tchrist/camel/200lexical.pod:$x = ( () = funk() ); # also set $x to funk()'s return count /home/tchrist/camel/290subs.pod:canmod() = 5; # Assigns to $val. /home/tchrist/camel/290subs.pod:nomod() = 5; # ERROR /home/tchrist/camel/650threads.pod:$t1->tid() == $td->tid() --tom
Re: RFC 166 (does-not-match)
>I can tighten the definition up. If there have been calls for a >(?^baz) type construct before, there will be again. It is a matter of >getting the definition straightforward and useable. Are you really just wanting !/BAD/ there? That is, something that isn't matched by /BAD/? One would, of course, normally simply write !/BAD/, or perhaps !~ /BAD/. However, if reading a config file of patterns, you can't go invert the sense of the match. Well, easily, that is. The Perl Cookbook, in Chapter 6, has these solutions: * True if either C or C matches, like C: /ALPHA|BETA/ * True if both C and C match, but may overlap, meaning that C<"BETALPHA"> should be ok, like C: /^(?=.*ALPHA)(?=.*BETA)/s * True if both C and C match, but may not overlap, meaning that C<"BETALPHA"> should fail: /ALPHA.*BETA|BETA.*ALPHA/s * True if pattern C does not match, like C<$var !~ /PAT/>: /^(?:(?!PAT).)*$/s * True if pattern C does not match, but pattern C does: /(?=^(?:(?!BAD).)*$)GOOD/s I suspect the penultimate is just what you're looking for. Or shall I go back and deepread the whole thread? :-( --tom
Re: RFC 110 (v3) counting matches
>But, for "crying out loud!", then what the hell do we need "scalar" for? >You can accomplish the same thing like this: > $num = @array; > print "Got $num elements"; Wrong. You just wasted a scalar needlessly, which ()= doesn't do. Of course, you *don't* need scalar() there. print "Got " . @array . " elements"; >"scalar" makes things easy. So does something like "list". This > $stuff = () = $r =~ /crap/shit/; >Doesn't make anything easy. Goodness, it certainly does. It's loads easier than learning a new buzz^Wkeyword or a new switch, because you already know it. >> Perl does context. Perl does *IMPLICIT* context. Cope. >Great. Then let's drop "scalar" to be consistent. This can be done >completely implicitly, right? There are no anonymous scalars. You'd at best have to write foo(scalar bar()) as something more like foo(do { my $x = bar() }) which is lame. However, if foo($) is thus "prototyped", you need but write foo( () = bar() ) to get bar() to be called in list context. This is wholly intuitive. If it isn't, you need to review how my($x) works--once again. --tom
Re: RFC 110 (v3) counting matches
> $count = () = /PATTERN/g; >With a keyword forcing a list context, this new option is unnecessary. We already *HAVE* a token set that forces list context, thank you very much. It's called "()=". I'm glad you like it. --tom
Re: RFC 110 (v3) counting matches
>While I agree that /l is bad, I think going through the crap of "= () =" >is even worse. Does it work? Yes. But is it easily usable and fun, even >for non-experts? No. Oh, for crying out loud--at some point, you have to stop tossing rotting fish for the starving ignorant and actually get them to LEARN something. Or let them die of starvation. Note the difference between my $var = func(); and my($var) = func(); Those are completely different in that they call func() in scalar and list contexts. Why? Because of hte presence or absence of (), of course. If they can't learn that adding () to the LHS of an assignment makes it list context, then they will be forever miserable. Perl does context. Perl does *IMPLICIT* context. Cope. --tom
Re: RFC 110 (v2) counting matches
>If we want to use uppercase, make these unique as well. That gives us >many more combinations, and is not necessarily confusing: > m//f - fast match > m//F - first match > m//i - case-insentitive > m//I - ignore whitespace > >And so on. This seems like a much more productive use, otherwise we're >just wasting characters. Larry's on record as preferring not to have us going down the road of using distinct upper and lower case regex switches. The distance between //c and //C, say, is far too narrow. --tom
Re: RFC 110 (v3) counting matches
>That empty list to force the proper context irks me. How about a >modifier to the RE that forces it (this would solve the "counting matches" >problem too). > $string =~ m{ > (\d\d) - (\d\d) - (\d\d) > (?{ push @dates, makedate($1,$2,$3) }) > }gxl; > $count = $string =~ m/foo/gl; # always list context The reason why not is because you're adding a special case hack to one particular place, rather than promoting a general mechanism that can be everywhere. Tell me: which is better and why. 1) A regex switch to specify scalar context, as in a mythical /r: push(@got, /bar/r) 2) A general mechanism, say for example, "scalar": push(@got, scalar /bar/) Obviously the "scalar" is better, because it does not require that a new switch be learnt, nor is its use restricted to pattern matching. Furthermore, it's inarguably more mnemonic for the sense of "match this scalarishly". Likewise, to force list context (a far less common operation, mind you), it is a bad idea to have what amounts to a special argument to just one function to this. What happens to the next function you want to do this to? How about if I want to force getpwnam() into list context and get back a scalar result? $count = getpwnam("tchrist")/l; $count = getpwnam("tchrist", LIST); $count = getpwnam("tchrist")->as_list; All of those, frankly, suck. This is much better: $count = () = getpwnam("tchrist"); It's better because * You don't have to invent anything new, whether syntactically or mnemonically. The sucky solution all require modification of Perl's very syntax. With the list assignment, you just need to learn how to use what you *already have*. I could say as much for (?{...}). Think how many of the suggestions on these lists can be dealt with simply through using existing features that the suggesting party was unaware of. * It's a general mechanism that isn't tailored for this particular function call. Special-purpose solutions are often inferior to general-purpose ones, because the latter are more likely to be creatively usable in a fashion unforeseen by the author. * What could possibly be more intuitive for the action of acting as though one were assigning to a list than doing that very thing itself? Since () is the canonical list (it's empty, after all), this follows directly and requires on special knowledge whatsoever. --tom
Re: RFC 110 (v3) counting matches
>much possible action at a distance. I'm not seeing a nicely-parseable, >easily-understandable way of doing this. Would this be a possible: > $string =~ /(\d\d)-(\d\d)-(\d\d)?&{push @list,makedate(\1,\2,\3)}/g; >Or is that just too ugly and nasty for words? Yes, passing a reference to the numbers 1, 2, and 3 is clearly too ugly. But you'll find we've already got that, I think. sub makedate { my($dd,$mm,$yy) = @_; warn "Just got a date for @_\n"; return "[$yy/$mm/$dd]"; } $string = "22-33-44 and 55-66-77 are ok"; @dates = (); () = $string =~ m{ (\d\d) - (\d\d) - (\d\d) (?{ push @dates, makedate($1,$2,$3) }) }gx; print "Now the dates are: @dates\n"; Running that yields: Just got a date for 22 33 44 Just got a date for 55 66 77 Now the dates are: [44/33/22] [77/66/55] --tom
Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()
>But for style, I don't see why >the interpreter can't also check for various non-obscure syntaxes / styles. (You mean "compiler", not interpreter.) You have to be quite careful there: Perl is so humungous that what's obscure to one person is well-known to the next. For example, $#foo is verging on the obscure for many these days, who would surely pause at reading $#foo /= 2; I don't mean to suggest that $#foo should be "preserved"; just poiting out that in many places, "obscure" is a judgment call, and suggest that we should avoid being too judgmental. --tom, who is about ready to give up on this lame American habit of writing "judgment" and "acknowledgment" with their e's!
Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()
>The compatibility path for perl5 to perl6 is via a translator. It >is not expected that perl6 will run perl5 programs unchanged. The >complexity of the translator and the depth of the changes will be >decided by the decisions Larry makes. This becomes not merely "It is not expected that perl6 will run perl5 programs unchanged." but also "It is not expected that perl6 will run perl4 programs unchanged." "It is not expected that perl6 will run perl3 programs unchanged." "It is not expected that perl6 will run perl2 programs unchanged." "It is not expected that perl6 will run perl1 programs unchanged." This has never been the case before, at least, not so dramatically. Sure, the edges have been dodgy, like what happened with "[EMAIL PROTECTED]". But if *MOST* perl1 .. perl5 programs aren't going to work unchanged, that means that most people's existing perl knowledge-base will no longer be valid. That probably means that they aren't going to be able to just type in the Perl that they already know, either, since that Perl will no longer be valid. And in my ever so humble opinion, that's when one should consider dropping the name "perl". This is *not* a bad thing; think of it as much the same as occurred when people stopped calling their improved version of Lisp "Lisp" and started calling it Scheme, or how "C with Classes" eventually took on a different name as well. Names--or, I suppose, "branding", if you truly must--are important things. If the perl6:perl5 relationship is similar in breadth to what we saw in the perl5:perl4 one, then perhaps, maybe even probably, one will get away with it. However, if the stretch is appreciably further, I don't think one will. And I do fear the negative public image ramifications to Perl. This will have to be handled gently and sensitively lest the public lose faith. (No, I didn't really *say* "spin control" there--you just read it.) A new dialect name might save some public confusion. --tom
Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()
>Simple solution. >If you want to require formats such as m/.../ (which I actually think is a >good idea), then make it part of -w, -W, -ww, or -WW, which would be a perl6 >enhancement of strictness. That's like having "use strict" enable mandatory perlstyle compliance checks, and rejecting the program otherwise. Doesn't seem sensible. --tom
Re: RFC 110 (v3) counting matches
>Have you ever wanted to count the number of matches of a patten? s///g >returns the number of matches it finds. m//g just returns 1 for matching. >Counts can be made using s//$&/g but this is wastefull, or by putting some >counting loop round a m//g. But this all seams rather messy. It's really much easier than all that: $count = () = $string =~ /pattern/g; --tom
Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()
>> It's nearly part of Perl's language signature. I wouldn't count >> on this going away if you still think to call this "Perl". It is >> of course much more likely in the renamed "Frob" language, however. >First off, this argument is just a little too grandiose, because if we >can't change anything because of precedent, then we're stuck and Perl 6 >should just be Perl 5.9 instead. How nice of you to put words in my mouth. Please cite me the precise message ID, date, and appropriate text in which I said "we can't change anything because of precedent". Right. I didn't say that. So don't *you* go saying that I did say it, or pretend that I did, or allege that I did, or infer that I did. It's deceptive, misleading, and flat-out wrong, and I'll thank you not to repeat the error. Yes, it's a hot button, so don't push it. Here's something you can quote, however: You cannot hope to just mutate absolutely *everything* willy-nilly and still expect that the language should keep the same name. It's not fair to anyone. If you want to make a language with a similar relationship to Perl as Scheme has to Lisp, then by all means do so, but note the wisdom of the name-change that the lispers pulled. Thus, there *is* fundamental merit in respecting and understanding the appeal of precedent. That is a *long* way from saying "never change anything". Where are the reasonable boundaries here? Well, although it's hard to say with inerrant precision, it's trivially easy to make a good stab at it. You just look at usage--how much has this feature been used? For how long (eg been there since perl1 vs just got added in perl5.003)? What is its prevalence in Perl scripts? Is it a rare feature (like formats) or a ubiquitous one (like hashes)? While there are other criteria one can apply, such as whether its presence necessarily dead-ends some other desired functionality, this has to be taken in the light of understand what's indispensable because of its longevity and ubiquity--not to mention convenience. If you look at the perl1 manpage, then consider usage over time, you'll get a good feel for these fundamentals, which range from single- vs double- vs back-quote distinctions to if/unless variances, from pick-your-own-quotes features to automatic memory management. Almost of these in turn have their ancestral roots as well, like dollar signs for variables inside of interpolated strings. Perl is easy to learn because you don't need to know much of it, and also because the parts you do need to know you're apt to already know from Perl's parents. These two features are also critical. >That being said, I don't see why this wouldn't work still. As I noted in >an email to Scott, at the very least this will work: > next if m/\s+/ || m/\w+/; Having to write m// is needlessly burdensome, flying in the face of thirteen years of experience and millions of users. I guarantee you that there are more people who know about if (/foo/) and about if ($var =~ /foo/) than there are people who know that you can use m// for the same things. Have you ever noticed how that many Most of the drastic changes suggested here seem to I have a long list of changes, things I'd like to see *fixed* in Perl, but virtually none of which anybody here has ever even managed to mention. You're all too busy giving the baby a brain-transplant than you are to trim his toenails. (And you forget that the baby is rather grown up now.) An exception is Mark's addressing of the empty regex problem, which was on my list of niggles. Another of my fix-the-pointy-edges niggles is the way wait() and waitpid() have the wrong semantics for a syscall, since they should always be writable as syscall() || die, just like the rest of them. Then there's the "1;" at the end of a require()d file, or that index/rindex don't grok negative offsets the way the rest of the language does. I think there might have been something about $/ and its currently all-filehandle nature, but certainly that's there, too. There are plenty more where that came from, but they're all easy and obvious changes that could almost be called fixes to simple design oversights. Nearly none of them seem to be being addressed. --tom
Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()
>> >next if /\s+/ || /\w+/; next if match /\s+/ or match /\w+/; >> >> Gosh this is annoying. I *really* don't want to have to type "match" >> all the time. And now I have to use C rather than C<||>, which is >> already ingrained in my head (I rarely use "or" or "and") There are thirteen years of precedent, not to mention the millions of users, who are completely accustomed to writing expressions like next if /\s+/ || /\w+/; It's nearly part of Perl's language signature. I wouldn't count on this going away if you still think to call this "Perl". It is of course much more likely in the renamed "Frob" language, however. --tom
Re: RFC 145 (v2) Brace-matching for Perl Regular Expressions
>All in all, though, you're right that neither set of features is particularly >well-known/used outside of p5p followers. At least from what I've seen. >Virtually every person I've worked with since 5.6 came out has been surprised >and amazed at the REx eval stuff. The completely reworked regex chapter in Camel III explains and demos all the new 5.6 features. I do not believe they will long remain the Cabal's secret. --tom
Re: RFC 158 (v1) Regular Expression Special Variables
>> There's also long been talk/thought about making $& and $1 >> and friends magic aliases into the original string, which would >> save that cost. >Please correct me if I'm mistaken, but I believe that that's the way >they are implemented now. A regex match populates the ->startp and >->endp parts of the regex structure, and the elements of these items >are byte offsets into the original string. I haven't looked at it at all, and perhaps that 's sometihng Ilya idd when creating @+ etc. So you might be right. Yet if so, I don't see the great fears of massive copies for once-ever use of $` and all, since I should have thought that that would have addressed it. --tom
Re: RFC 158 (v1) Regular Expression Special Variables
>those early perl3 scripts by lwall floating around in /etc were poorly >written. i am glad they are finally out of the distribution. Those weren't the scripts I was thinking about, and it is *NOT* ipso facto true that something which uses $& or $` is poorly written. --tom
Re: RFC 138 (v1) Eliminate =~ operator.
>Solve the larger problem: permit method calls in qq() strings. You mean outside of @{[ ... ]}, eh? :=}, I think Larry *might* have said something about making this work. I'm just a bit concerned with the general notion that functions would under some circumstances trigger in qq guys. It's a bit odd to explain that things like abs() for $n+3 won't work, but $o->foo() would. Then again, it's already curious with $a[$n+3]. :-) --tom
Re: RFC 158 (v1) Regular Expression Special Variables
>$`, $& and $' are useful variables which are never used by any >experienced Perl hacker since they have well known problems with >efficiency. That's hardly true. I could show you plenty of code from inexperienced Perl hackers like lwall that use them. But the cost in understood. :-) The rest of what you said probably is reasonable, however. The (.*?)(blah)(.*) solution kind works sometimes, but is hardly pleasant. Likewise the @+ and @- stuff. There's also long been talk/thought about making $& and $1 and friends magic aliases into the original string, which would save that cost. --tom
Re: RFC 145 (v1) Brace-matching for Perl Regular Expressions
>How about \p and \P ("P" for "pairwise groupings" or just "pairs")? I'm afraid those are taken, too. Symbol Atomic Meaning -- -- --- C<\0> yes Match the null character (ASCII NUL). C<\I> yes Match the character given in octal, up to C<\377>. C<\I>yes Match Rth previously captured string (decimal). C<\a> yes Match the alarm character (BEL). C<\A> no True at beginning of string. C<\b> yes Match the backspace character (BS). C<\b> no True at word boundary. C<\B> no True when not at word boundary. C<\cR> yes Match the control character Control-R (C<\cZ>, C<\c[>). C<\C> yes Match one byte (C C) even in utf8 (dangerous). C<\d> yes Match any digit character. C<\D> yes Match any non-digit character. C<\e> yes Match the escape character (ASCII ESC, not backslash). C<\E> -- End case (C<\L>, C<\U>) or metaquote (C<\Q>). C<\f> yes Match the form feed character (FF). C<\G> no True at end-of-match position of prior C. C<\l> -- Lowercase next character only. C<\L> -- Lowercase till C<\E>. C<\n> yes Match the newline character (NL, CR on Macs). C<\N{R}> yes Match the named char (C<\N{greek:Sigma}>. C<\p{R}> yes Match any character with named property. C<\P{R}> yes Match any character without named property. C<\Q> -- Quote (de-meta) metacharacters till C<\E>. C<\r> yes Match the return character (CR, NL on Macs). C<\s> yes Match any whitespace character. C<\S> yes Match any non-whitespace character. C<\t> yes Match the tab character (HT). C<\u> -- Titlecase next character only. C<\U> -- Uppercase (not titlecase) till C<\E>. C<\w> yes Match any "word" character (alphanums plus "_"). C<\W> yes Match any non-word character. C<\x{abcd}> yes Match the character given in hexadecimal. C<\X> yes Match "combining character sequence" string. C<\z> no True at end of string only. C<\Z> no True at end of string or before optional newline.
Re: RFC 144 (v1) Behavior of empty regex should be simple
>Thanks, I will add this to the next version. I did consider that, and >I rejected it. Here's my thinking: s/successful// does make the >feature somewhat more useful, but (a) all those uses are more easily >accomplished with qr() these days, and (b) it's still an >action-at-a-distance effect, which means that it's fragile and that >the behavior of working code can change suddenly and surprisingly when >it is modified. I agree with your reasoning there. I just thought it should be spelt out in the document, since it's a common first thought that we've all had, but which we've not necessarily taken to its conclusions. thanks, --tom
Re: RFC 150 (v1) Extend regex syntax to provide for return of a hash of matched subpatterns
This is useful in that it would stop being number dependent. For example, you can't now safely say /$var (foo) \1/ and guarantee for arbitrary contents of $var that your you have the right number backref anymore. If I recall correctly, the Python folks addressed this. One might check that. --tom
Re: RFC 145 (v1) Brace-matching for Perl Regular Expressions
>=head1 ABSTRACT >It is quite difficult to match paired characters in Perl 5 regular >expressions. A solution is proposed, using new \g (match opening grouping >character) and \G (match closing grouping character) metacharacters. >Two new special variables, @^g and @^G control which strings are >considered grouping characters and what their complement is. What about the meaning that \G already holds? Wasn't one going to avoid using any more cryptic variables? You can't use $^g for a variable name, because you're pretending it's different than $^G. But notice that you can't use a lower case letter there. --tom
Re: RFC 144 (v1) Behavior of empty regex should be simple
>I propose that this 'last successful match' behavior be discarded >entirely, and that an empty pattern always match the empty string. I don't see a consideration for simply s/successful// above, which has also been talked about. Thas would also match expected usage based upon existing editors. --tom
Re: RFC 138 (v1) Eliminate =~ operator.
>But I agree that such examples would certainly make a better argument. >The only concrete thing I can come up with is that I and several other >perl coders I know had a lot of trouble remembering the =~ symbol. I've >been a very frequent perl user for about 8 years, and after I didn't use >perl for about a month (2 week vacation + intense pressure at work, >it'll never happen again, I promise!), I couldn't for the life of me >remember whether it was ~= or =~. I've also observed one perl beginner >look up the symbol in a book every single time she needed it for a new >program. Changing anything that has ever or shall ever confuse anyone is a task without end: you will never be done, so don't even start. The =~ operator is perfectly obvious to csh programmers, of course, which is where it came from. There can be no ~= operator because that is obviously construing a binary ~ operator, which currently remains nonexistent. The ~ operator is unary only, thus far. Whether there *should* be a binary ~ operator, one related to pattern matching, is a different question. The awkers expect there to be, but you can't please all the people all the time. They've got their !~, so hopefully this will semiappease them. The occasionally proposed use for a binary ~ is mostly in conjunction with a substitute or translit, so that it returns the new string rather than the success count.The original would be unchanged. # eg "fred & barney" or "wilma and barney" if ( "barney" eq ($var ~ s/.*\W//) ) { Or thus $alteration = ($original ~ s/old/new/); Which is really just like ($alteration = $original) ~= s/old/new/; but might be better optimized. Sure would be nice if they en-passant things would be better optimized anyway, though. Seems to me that if ~ did this, then ~= might be a reasonable thing! $foo = $foo ~ s/old/new/ could then of course be written $foo ~= s/old/new/ which would in fact be the same as $foo =~ s/old/new/ Whether this would actually be desirable is highly open to debate. :-) --tom