RE: atomicness and \n

2002-09-04 Thread Bryan C. Warnock

On Tue, 2002-09-03 at 23:57, Luke Palmer wrote:
 On Tue, 3 Sep 2002, Brent Dax wrote:
  
  How can you be sure that roundascii is implemented as a character
  class, as opposed to (say) an alternation?
 
 What's the difference? :)
 
 Neglecting internals, semantically what Iis the difference?
 

One *possible* semantic difference is a guaranteed matching order.
Nothing (historically) has ever really dictated that character classes
must match left-to-right, as alternation does.

That's mainly because character classes have always been of a uniform
width, in which case it is only going to match one thing and one thing
only.  Whether that will be an issue with variable-width characters in a
class is largely going to rely on the semantics that are dictated.

-- 
Bryan C. Warnock
bwarnock(gtemail.net|raba.com)



RE: atomicness and \n

2002-09-04 Thread Markus Laire

On 4 Sep 2002 at 0:22, Aaron Sherman wrote:

 On Wed, 2002-09-04 at 00:01, Sean O'Rourke wrote:
 
  None, I think.  Of course, if we ignore internals, there's no
  difference bewteen that and rx /roundascii | 1 | 7/.
 
 Then, why is there a C+? Why not make it C|?
 
   $foo = rx/ a|b|[cde]|f /

Because it's good to have MTOWTDI. (= More than one way to do it)

-- 
Markus Laire 'malaire' [EMAIL PROTECTED]





RE: atomicness and \n

2002-09-04 Thread Aaron Sherman

On Wed, 2002-09-04 at 09:55, Markus Laire wrote:
 On 4 Sep 2002 at 0:22, Aaron Sherman wrote:
 
  On Wed, 2002-09-04 at 00:01, Sean O'Rourke wrote:
  
   None, I think.  Of course, if we ignore internals, there's no
   difference bewteen that and rx /roundascii | 1 | 7/.
  
  Then, why is there a C+? Why not make it C|?
  
  $foo = rx/ a|b|[cde]|f /
 
 Because it's good to have MTOWTDI. (= More than one way to do it)

But, there isn't. There's only one way to indicate character-class
unions, and that's C+. If we had C+ and C| as synonyms, I'd be ok
with that, though I'd only tell people about C| to avoid the confusion
(mind if we call you Bruce?)





Re: atomicness and \n

2002-09-04 Thread Damian Conway

Jonathan Scott Duff wrote:

  How can you be sure that roundascii is
 implemented as a character class instead of being some other arbitrary
 rule? An answer is that perl should know how these things are
 implemented and if you try arithmetic on something that's not a
 character class, it should carp appropriately. Another answer might be
 that roundascii+[17] is actually syntactically illegal and you MUST
 perform character class arithmetic as [abc]+[def].
 
 Somehow I prefer the former to the latter.

It will definitely be the former, since we have to support named character
classes like alpha, digit, printable, etc.

Damian





RE: atomicness and \n

2002-09-03 Thread Brent Dax

Damian Conway:
# Neither. You need:
# 
#  $roundor7 = rx /roundascii+[17]/
# 
# That is: the union of the two character classes.

How can you be sure that roundascii is implemented as a character
class, as opposed to (say) an alternation?

--Brent Dax [EMAIL PROTECTED]
@roles=map {Parrot $_} qw(embedding regexen Configure)

In other words, it's the 'Blow up this Entire Planet and Possibly One
or Two Others We Noticed on our Way Out Here' operator.
--Damian Conway




RE: atomicness and \n

2002-09-03 Thread Luke Palmer

On Tue, 3 Sep 2002, Brent Dax wrote:

 Damian Conway:
 # Neither. You need:
 # 
 #  $roundor7 = rx /roundascii+[17]/
 # 
 # That is: the union of the two character classes.
 
 How can you be sure that roundascii is implemented as a character
 class, as opposed to (say) an alternation?

What's the difference? :)

Neglecting internals, semantically what Iis the difference?

Luke





RE: atomicness and \n

2002-09-03 Thread Sean O'Rourke

On Tue, 3 Sep 2002, Luke Palmer wrote:

 On Tue, 3 Sep 2002, Brent Dax wrote:

  Damian Conway:
  # Neither. You need:
  #
  #  $roundor7 = rx /roundascii+[17]/
  #
  # That is: the union of the two character classes.
 
  How can you be sure that roundascii is implemented as a character
  class, as opposed to (say) an alternation?

 What's the difference? :)

 Neglecting internals, semantically what Iis the difference?

None, I think.  Of course, if we ignore internals, there's no difference
bewteen that and rx /roundascii | 1 | 7/.

/s




Re: atomicness and \n

2002-09-03 Thread Jonathan Scott Duff

On Tue, Sep 03, 2002 at 09:57:31PM -0600, Luke Palmer wrote:
 On Tue, 3 Sep 2002, Brent Dax wrote:
 
  Damian Conway:
  # Neither. You need:
  # 
  #  $roundor7 = rx /roundascii+[17]/
  # 
  # That is: the union of the two character classes.
  
  How can you be sure that roundascii is implemented as a character
  class, as opposed to (say) an alternation?
 
 What's the difference? :)
 
 Neglecting internals, semantically what Iis the difference?

I think the point still stands. How can you be sure that roundascii is
implemented as a character class instead of being some other arbitrary
rule? An answer is that perl should know how these things are
implemented and if you try arithmetic on something that's not a
character class, it should carp appropriately. Another answer might be
that roundascii+[17] is actually syntactically illegal and you MUST
perform character class arithmetic as [abc]+[def].

Somehow I prefer the former to the latter.

-Scott
-- 
Jonathan Scott Duff
[EMAIL PROTECTED]



RE: atomicness and \n

2002-09-03 Thread Aaron Sherman

On Wed, 2002-09-04 at 00:01, Sean O'Rourke wrote:
 On Tue, 3 Sep 2002, Luke Palmer wrote:
 
  On Tue, 3 Sep 2002, Brent Dax wrote:
 
   Damian Conway:

   #  $roundor7 = rx /roundascii+[17]/
   #
   # That is: the union of the two character classes.
  
   How can you be sure that roundascii is implemented as a character
   class, as opposed to (say) an alternation?
 
  What's the difference? :)

 None, I think.  Of course, if we ignore internals, there's no difference
 bewteen that and rx /roundascii | 1 | 7/.

Then, why is there a C+? Why not make it C|?

$foo = rx/ a|b|[cde]|f /






Re: atomicness and \n

2002-09-02 Thread Dan Sugalski

At 9:24 PM -0400 8/31/02, Ken Fox wrote:
Damian Conway wrote:
No. It will be equivalent to:

   [\x0a\x0d...]

I don't think \n can be a character class because it
is a two character sequence on some systems. Apoc 5
said \n will be the same everywhere, so won't it be
something like

   rule \n { \x0d \x0a | \x0d | \x0a }

That should be

   rule ASCII::\n

or something of the sort. That particular rule will only be valid for 
ASCII data. Unicode will have a superset of that, and the other 
character sets will have a different line ending rule.

This, like the other shortcut characters, will be character-set 
specific. (And overridable, in case someone feels like making \b work 
properly (FSVO properly) for asian data that doesn't use word 
delimiters)
-- 
 Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk



Re: atomicness and \n

2002-08-31 Thread Damian Conway

Aaron Sherman wrote:

 Is C\n going to be a rule (e.g. C eol )

There might be an named rule like that. But C\n will certainly
still be available.

 or is it implicitly translated to:
 
   [\x0a\x0d...]+

No. It will be equivalent to:

[\x0a\x0d...]

(no repetition)


 Along those lines, will
 
   [\n]
 
 work

Yes.



 Hmm... this is a slippery slope. That gets me thinking about
 
   rule roundascii { [a-hjm-uB-DGJO-SU23568-0] }
   $roundor7 = rx /[roundascii17]/;
 
 or do I have to
 
   $roundor7 = rx /roundorascii|[17]/;

Neither. You need:

 $roundor7 = rx /roundascii+[17]/

That is: the union of the two character classes.


Damian





Re: atomicness and \n

2002-08-31 Thread Simon Cozens

[EMAIL PROTECTED] (Damian Conway) writes:
 Neither. You need:
  $roundor7 = rx /roundascii+[17]/
 That is: the union of the two character classes.

Thank you; that wasn't in A5, E5 or S5. Will there be foo-bar as
well?

-- 
I wish my keyboard had a SMITE key
-- J-P Stacey



Re: atomicness and \n

2002-08-31 Thread Me

   $roundor7 = rx /roundascii+[17]/
  That is: the union of the two character classes.

 Thank you; that wasn't in A5, E5 or S5. Will there be foo-bar as
 well?

From A5:

The outer ... also naturally serves as a container
for any extra syntax we decide to come up with for
character set manipulation:

[_]+alpha+digit-Swedish

--
ralph




Re: atomicness and \n

2002-08-31 Thread Aaron Sherman

On Sat, 2002-08-31 at 07:07, Damian Conway wrote:
 Aaron Sherman wrote:
 
  Is C\n going to be a rule (e.g. C eol )
 
 There might be an named rule like that. But C\n will certainly
 still be available.
 
  or is it implicitly translated to:
  
  [\x0a\x0d...]+
 
 No. It will be equivalent to:
 
   [\x0a\x0d...]
 
 (no repetition)

Didn't A5 or E5 say that C\n was going to match sequences like
C\x0d\x0a so that file formats that treat this as EOL would be
supported natively? Was this a mistake? If it's not a mistake, then is
there a way to minimally match the first end-of-line character,
regardless of what follows it (even if it's another end-of-line
character)?

That was the crux of my question.

On the union operator stuff, thanks; I'd forgotten about that.





Re: atomicness and \n

2002-08-31 Thread Ken Fox

Damian Conway wrote:
 No. It will be equivalent to:
 
   [\x0a\x0d...]

I don't think \n can be a character class because it
is a two character sequence on some systems. Apoc 5
said \n will be the same everywhere, so won't it be
something like

   rule \n { \x0d \x0a | \x0d | \x0a }

Hmm. Now that I read that, I'm thinking some characters
will be multi-byte sequences. Is there going to be
multi-byte magic for line endings? Even in ASCII data
streams?

- Ken