RE: atomicness and \n
On Tue, 2002-09-03 at 23:57, Luke Palmer wrote: On Tue, 3 Sep 2002, Brent Dax wrote: How can you be sure that roundascii is implemented as a character class, as opposed to (say) an alternation? What's the difference? :) Neglecting internals, semantically what Iis the difference? One *possible* semantic difference is a guaranteed matching order. Nothing (historically) has ever really dictated that character classes must match left-to-right, as alternation does. That's mainly because character classes have always been of a uniform width, in which case it is only going to match one thing and one thing only. Whether that will be an issue with variable-width characters in a class is largely going to rely on the semantics that are dictated. -- Bryan C. Warnock bwarnock(gtemail.net|raba.com)
RE: atomicness and \n
On 4 Sep 2002 at 0:22, Aaron Sherman wrote: On Wed, 2002-09-04 at 00:01, Sean O'Rourke wrote: None, I think. Of course, if we ignore internals, there's no difference bewteen that and rx /roundascii | 1 | 7/. Then, why is there a C+? Why not make it C|? $foo = rx/ a|b|[cde]|f / Because it's good to have MTOWTDI. (= More than one way to do it) -- Markus Laire 'malaire' [EMAIL PROTECTED]
RE: atomicness and \n
On Wed, 2002-09-04 at 09:55, Markus Laire wrote: On 4 Sep 2002 at 0:22, Aaron Sherman wrote: On Wed, 2002-09-04 at 00:01, Sean O'Rourke wrote: None, I think. Of course, if we ignore internals, there's no difference bewteen that and rx /roundascii | 1 | 7/. Then, why is there a C+? Why not make it C|? $foo = rx/ a|b|[cde]|f / Because it's good to have MTOWTDI. (= More than one way to do it) But, there isn't. There's only one way to indicate character-class unions, and that's C+. If we had C+ and C| as synonyms, I'd be ok with that, though I'd only tell people about C| to avoid the confusion (mind if we call you Bruce?)
Re: atomicness and \n
Jonathan Scott Duff wrote: How can you be sure that roundascii is implemented as a character class instead of being some other arbitrary rule? An answer is that perl should know how these things are implemented and if you try arithmetic on something that's not a character class, it should carp appropriately. Another answer might be that roundascii+[17] is actually syntactically illegal and you MUST perform character class arithmetic as [abc]+[def]. Somehow I prefer the former to the latter. It will definitely be the former, since we have to support named character classes like alpha, digit, printable, etc. Damian
RE: atomicness and \n
Damian Conway: # Neither. You need: # # $roundor7 = rx /roundascii+[17]/ # # That is: the union of the two character classes. How can you be sure that roundascii is implemented as a character class, as opposed to (say) an alternation? --Brent Dax [EMAIL PROTECTED] @roles=map {Parrot $_} qw(embedding regexen Configure) In other words, it's the 'Blow up this Entire Planet and Possibly One or Two Others We Noticed on our Way Out Here' operator. --Damian Conway
RE: atomicness and \n
On Tue, 3 Sep 2002, Brent Dax wrote: Damian Conway: # Neither. You need: # # $roundor7 = rx /roundascii+[17]/ # # That is: the union of the two character classes. How can you be sure that roundascii is implemented as a character class, as opposed to (say) an alternation? What's the difference? :) Neglecting internals, semantically what Iis the difference? Luke
RE: atomicness and \n
On Tue, 3 Sep 2002, Luke Palmer wrote: On Tue, 3 Sep 2002, Brent Dax wrote: Damian Conway: # Neither. You need: # # $roundor7 = rx /roundascii+[17]/ # # That is: the union of the two character classes. How can you be sure that roundascii is implemented as a character class, as opposed to (say) an alternation? What's the difference? :) Neglecting internals, semantically what Iis the difference? None, I think. Of course, if we ignore internals, there's no difference bewteen that and rx /roundascii | 1 | 7/. /s
Re: atomicness and \n
On Tue, Sep 03, 2002 at 09:57:31PM -0600, Luke Palmer wrote: On Tue, 3 Sep 2002, Brent Dax wrote: Damian Conway: # Neither. You need: # # $roundor7 = rx /roundascii+[17]/ # # That is: the union of the two character classes. How can you be sure that roundascii is implemented as a character class, as opposed to (say) an alternation? What's the difference? :) Neglecting internals, semantically what Iis the difference? I think the point still stands. How can you be sure that roundascii is implemented as a character class instead of being some other arbitrary rule? An answer is that perl should know how these things are implemented and if you try arithmetic on something that's not a character class, it should carp appropriately. Another answer might be that roundascii+[17] is actually syntactically illegal and you MUST perform character class arithmetic as [abc]+[def]. Somehow I prefer the former to the latter. -Scott -- Jonathan Scott Duff [EMAIL PROTECTED]
RE: atomicness and \n
On Wed, 2002-09-04 at 00:01, Sean O'Rourke wrote: On Tue, 3 Sep 2002, Luke Palmer wrote: On Tue, 3 Sep 2002, Brent Dax wrote: Damian Conway: # $roundor7 = rx /roundascii+[17]/ # # That is: the union of the two character classes. How can you be sure that roundascii is implemented as a character class, as opposed to (say) an alternation? What's the difference? :) None, I think. Of course, if we ignore internals, there's no difference bewteen that and rx /roundascii | 1 | 7/. Then, why is there a C+? Why not make it C|? $foo = rx/ a|b|[cde]|f /
Re: atomicness and \n
At 9:24 PM -0400 8/31/02, Ken Fox wrote: Damian Conway wrote: No. It will be equivalent to: [\x0a\x0d...] I don't think \n can be a character class because it is a two character sequence on some systems. Apoc 5 said \n will be the same everywhere, so won't it be something like rule \n { \x0d \x0a | \x0d | \x0a } That should be rule ASCII::\n or something of the sort. That particular rule will only be valid for ASCII data. Unicode will have a superset of that, and the other character sets will have a different line ending rule. This, like the other shortcut characters, will be character-set specific. (And overridable, in case someone feels like making \b work properly (FSVO properly) for asian data that doesn't use word delimiters) -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: atomicness and \n
Aaron Sherman wrote: Is C\n going to be a rule (e.g. C eol ) There might be an named rule like that. But C\n will certainly still be available. or is it implicitly translated to: [\x0a\x0d...]+ No. It will be equivalent to: [\x0a\x0d...] (no repetition) Along those lines, will [\n] work Yes. Hmm... this is a slippery slope. That gets me thinking about rule roundascii { [a-hjm-uB-DGJO-SU23568-0] } $roundor7 = rx /[roundascii17]/; or do I have to $roundor7 = rx /roundorascii|[17]/; Neither. You need: $roundor7 = rx /roundascii+[17]/ That is: the union of the two character classes. Damian
Re: atomicness and \n
[EMAIL PROTECTED] (Damian Conway) writes: Neither. You need: $roundor7 = rx /roundascii+[17]/ That is: the union of the two character classes. Thank you; that wasn't in A5, E5 or S5. Will there be foo-bar as well? -- I wish my keyboard had a SMITE key -- J-P Stacey
Re: atomicness and \n
$roundor7 = rx /roundascii+[17]/ That is: the union of the two character classes. Thank you; that wasn't in A5, E5 or S5. Will there be foo-bar as well? From A5: The outer ... also naturally serves as a container for any extra syntax we decide to come up with for character set manipulation: [_]+alpha+digit-Swedish -- ralph
Re: atomicness and \n
On Sat, 2002-08-31 at 07:07, Damian Conway wrote: Aaron Sherman wrote: Is C\n going to be a rule (e.g. C eol ) There might be an named rule like that. But C\n will certainly still be available. or is it implicitly translated to: [\x0a\x0d...]+ No. It will be equivalent to: [\x0a\x0d...] (no repetition) Didn't A5 or E5 say that C\n was going to match sequences like C\x0d\x0a so that file formats that treat this as EOL would be supported natively? Was this a mistake? If it's not a mistake, then is there a way to minimally match the first end-of-line character, regardless of what follows it (even if it's another end-of-line character)? That was the crux of my question. On the union operator stuff, thanks; I'd forgotten about that.
Re: atomicness and \n
Damian Conway wrote: No. It will be equivalent to: [\x0a\x0d...] I don't think \n can be a character class because it is a two character sequence on some systems. Apoc 5 said \n will be the same everywhere, so won't it be something like rule \n { \x0d \x0a | \x0d | \x0a } Hmm. Now that I read that, I'm thinking some characters will be multi-byte sequences. Is there going to be multi-byte magic for line endings? Even in ASCII data streams? - Ken