atomicness and \n

2002-08-30 Thread Aaron Sherman

Is C<\n> going to be a rule (e.g. C<<  >>) or is it implicitly
translated to:

<[\x0a\x0d...]>+

If it's the latter, then what does this do?

\n?

Do I get

[<[\x0a\x0d...]>+]?

Or do I get

<[\x0a\x0d...]>+?

If the former (which I assume is the case), how do I get the latter
without having to know what the C<...> is, above?

Along those lines, will

<[\n]>

work, or should I expect that to be an error because C<\n> is no longer
an actual character class.

Hmm... this is a slippery slope. That gets me thinking about

rule roundascii { <[a-hjm-uB-DGJO-SU23568-0]> }
$roundor7 = rx /<[17]>/;

or do I have to

$roundor7 = rx /|<[17]>/;

Which seems somewhat clunky and less efficient.





Re: atomicness and \n

2002-08-31 Thread Damian Conway

Aaron Sherman wrote:

> Is C<\n> going to be a rule (e.g. C<<  >>)

There might be an named rule like that. But C<\n> will certainly
still be available.

> or is it implicitly translated to:
> 
>   <[\x0a\x0d...]>+

No. It will be equivalent to:

<[\x0a\x0d...]>

(no repetition)


> Along those lines, will
> 
>   <[\n]>
> 
> work

Yes.



> Hmm... this is a slippery slope. That gets me thinking about
> 
>   rule roundascii { <[a-hjm-uB-DGJO-SU23568-0]> }
>   $roundor7 = rx /<[17]>/;
> 
> or do I have to
> 
>   $roundor7 = rx /|<[17]>/;

Neither. You need:

 $roundor7 = rx /<+[17]>/

That is: the union of the two character classes.


Damian





Re: atomicness and \n

2002-08-31 Thread Simon Cozens

[EMAIL PROTECTED] (Damian Conway) writes:
> Neither. You need:
>  $roundor7 = rx /<+[17]>/
> That is: the union of the two character classes.

Thank you; that wasn't in A5, E5 or S5. Will there be <-> as
well?

-- 
I wish my keyboard had a SMITE key
-- J-P Stacey



Re: atomicness and \n

2002-08-31 Thread Me

> >  $roundor7 = rx /<+[17]>/
> > That is: the union of the two character classes.
>
> Thank you; that wasn't in A5, E5 or S5. Will there be <-> as
> well?

>From A5:

The outer <...> also naturally serves as a container
for any extra syntax we decide to come up with for
character set manipulation:

<[_]++->

--
ralph




Re: atomicness and \n

2002-08-31 Thread Aaron Sherman

On Sat, 2002-08-31 at 07:07, Damian Conway wrote:
> Aaron Sherman wrote:
> 
> > Is C<\n> going to be a rule (e.g. C<<  >>)
> 
> There might be an named rule like that. But C<\n> will certainly
> still be available.
> 
> > or is it implicitly translated to:
> > 
> > <[\x0a\x0d...]>+
> 
> No. It will be equivalent to:
> 
>   <[\x0a\x0d...]>
> 
> (no repetition)

Didn't A5 or E5 say that C<\n> was going to match sequences like
C<\x0d\x0a> so that file formats that treat this as EOL would be
supported natively? Was this a mistake? If it's not a mistake, then is
there a way to minimally match "the first end-of-line character,
regardless of what follows it (even if it's another end-of-line
character)?

That was the crux of my question.

On the union operator stuff, thanks; I'd forgotten about that.





Re: atomicness and \n

2002-08-31 Thread Ken Fox

Damian Conway wrote:
> No. It will be equivalent to:
> 
>   <[\x0a\x0d...]>

I don't think \n can be a character class because it
is a two character sequence on some systems. Apoc 5
said \n will be the same everywhere, so won't it be
something like

   rule \n { \x0d \x0a | \x0d | \x0a }

Hmm. Now that I read that, I'm thinking some characters
will be multi-byte sequences. Is there going to be
multi-byte magic for line endings? Even in ASCII data
streams?

- Ken




Re: atomicness and \n

2002-09-02 Thread Dan Sugalski

At 9:24 PM -0400 8/31/02, Ken Fox wrote:
>Damian Conway wrote:
>>No. It will be equivalent to:
>>
>>   <[\x0a\x0d...]>
>
>I don't think \n can be a character class because it
>is a two character sequence on some systems. Apoc 5
>said \n will be the same everywhere, so won't it be
>something like
>
>   rule \n { \x0d \x0a | \x0d | \x0a }

That should be

   rule ASCII::\n

or something of the sort. That particular rule will only be valid for 
ASCII data. Unicode will have a superset of that, and the other 
character sets will have a different line ending rule.

This, like the other shortcut characters, will be character-set 
specific. (And overridable, in case someone feels like making \b work 
properly (FSVO "properly") for asian data that doesn't use word 
delimiters)
-- 
 Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk



RE: atomicness and \n

2002-09-03 Thread Brent Dax

Damian Conway:
# Neither. You need:
# 
#  $roundor7 = rx /<+[17]>/
# 
# That is: the union of the two character classes.

How can you be sure that  is implemented as a character
class, as opposed to (say) an alternation?

--Brent Dax <[EMAIL PROTECTED]>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)

"In other words, it's the 'Blow up this Entire Planet and Possibly One
or Two Others We Noticed on our Way Out Here' operator."
--Damian Conway




RE: atomicness and \n

2002-09-03 Thread Luke Palmer

On Tue, 3 Sep 2002, Brent Dax wrote:

> Damian Conway:
> # Neither. You need:
> # 
> #  $roundor7 = rx /<+[17]>/
> # 
> # That is: the union of the two character classes.
> 
> How can you be sure that  is implemented as a character
> class, as opposed to (say) an alternation?

What's the difference? :)

Neglecting internals, semantically what I the difference?

Luke





RE: atomicness and \n

2002-09-03 Thread Sean O'Rourke

On Tue, 3 Sep 2002, Luke Palmer wrote:

> On Tue, 3 Sep 2002, Brent Dax wrote:
>
> > Damian Conway:
> > # Neither. You need:
> > #
> > #  $roundor7 = rx /<+[17]>/
> > #
> > # That is: the union of the two character classes.
> >
> > How can you be sure that  is implemented as a character
> > class, as opposed to (say) an alternation?
>
> What's the difference? :)
>
> Neglecting internals, semantically what I the difference?

None, I think.  Of course, if we ignore internals, there's no difference
bewteen that and "rx / | 1 | 7/".

/s




Re: atomicness and \n

2002-09-03 Thread Jonathan Scott Duff

On Tue, Sep 03, 2002 at 09:57:31PM -0600, Luke Palmer wrote:
> On Tue, 3 Sep 2002, Brent Dax wrote:
> 
> > Damian Conway:
> > # Neither. You need:
> > # 
> > #  $roundor7 = rx /<+[17]>/
> > # 
> > # That is: the union of the two character classes.
> > 
> > How can you be sure that  is implemented as a character
> > class, as opposed to (say) an alternation?
> 
> What's the difference? :)
> 
> Neglecting internals, semantically what I the difference?

I think the point still stands. How can you be sure that  is
implemented as a character class instead of being some other arbitrary
rule? An answer is that perl should know how these things are
implemented and if you try arithmetic on something that's not a
character class, it should carp appropriately. Another answer might be
that <+[17]> is actually syntactically illegal and you MUST
perform character class arithmetic as <[abc]+[def]>.

Somehow I prefer the former to the latter.

-Scott
-- 
Jonathan Scott Duff
[EMAIL PROTECTED]



RE: atomicness and \n

2002-09-03 Thread Aaron Sherman

On Wed, 2002-09-04 at 00:01, Sean O'Rourke wrote:
> On Tue, 3 Sep 2002, Luke Palmer wrote:
> 
> > On Tue, 3 Sep 2002, Brent Dax wrote:
> >
> > > Damian Conway:

> > > #  $roundor7 = rx /<+[17]>/
> > > #
> > > # That is: the union of the two character classes.
> > >
> > > How can you be sure that  is implemented as a character
> > > class, as opposed to (say) an alternation?
> >
> > What's the difference? :)

> None, I think.  Of course, if we ignore internals, there's no difference
> bewteen that and "rx / | 1 | 7/".

Then, why is there a C<+>? Why not make it C<|>?

$foo = rx/ <||[cde]>|f /






RE: atomicness and \n

2002-09-04 Thread Bryan C. Warnock

On Tue, 2002-09-03 at 23:57, Luke Palmer wrote:
> On Tue, 3 Sep 2002, Brent Dax wrote:
> > 
> > How can you be sure that  is implemented as a character
> > class, as opposed to (say) an alternation?
> 
> What's the difference? :)
> 
> Neglecting internals, semantically what I the difference?
> 

One *possible* semantic difference is a guaranteed matching order.
Nothing (historically) has ever really dictated that character classes
must match left-to-right, as alternation does.

That's mainly because character classes have always been of a uniform
width, in which case it is only going to match one thing and one thing
only.  Whether that will be an issue with variable-width characters in a
class is largely going to rely on the semantics that are dictated.

-- 
Bryan C. Warnock
bwarnock@(gtemail.net|raba.com)



RE: atomicness and \n

2002-09-04 Thread Markus Laire

On 4 Sep 2002 at 0:22, Aaron Sherman wrote:

> On Wed, 2002-09-04 at 00:01, Sean O'Rourke wrote:
> 
> > None, I think.  Of course, if we ignore internals, there's no
> > difference bewteen that and "rx / | 1 | 7/".
> 
> Then, why is there a C<+>? Why not make it C<|>?
> 
>   $foo = rx/ <||[cde]>|f /

Because it's good to have MTOWTDI. (= More than one way to do it)

-- 
Markus Laire 'malaire' <[EMAIL PROTECTED]>





RE: atomicness and \n

2002-09-04 Thread Aaron Sherman

On Wed, 2002-09-04 at 09:55, Markus Laire wrote:
> On 4 Sep 2002 at 0:22, Aaron Sherman wrote:
> 
> > On Wed, 2002-09-04 at 00:01, Sean O'Rourke wrote:
> > 
> > > None, I think.  Of course, if we ignore internals, there's no
> > > difference bewteen that and "rx / | 1 | 7/".
> > 
> > Then, why is there a C<+>? Why not make it C<|>?
> > 
> > $foo = rx/ <||[cde]>|f /
> 
> Because it's good to have MTOWTDI. (= More than one way to do it)

But, there isn't. There's only one way to indicate character-class
unions, and that's C<+>. If we had C<+> and C<|> as synonyms, I'd be ok
with that, though I'd only tell people about C<|> to avoid the confusion
(mind if we call you Bruce?)





Re: atomicness and \n

2002-09-04 Thread Damian Conway

Jonathan Scott Duff wrote:

 > How can you be sure that  is
> implemented as a character class instead of being some other arbitrary
> rule? An answer is that perl should know how these things are
> implemented and if you try arithmetic on something that's not a
> character class, it should carp appropriately. Another answer might be
> that <+[17]> is actually syntactically illegal and you MUST
> perform character class arithmetic as <[abc]+[def]>.
> 
> Somehow I prefer the former to the latter.

It will definitely be the former, since we have to support named character
classes like , , , etc.

Damian