Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-10-07 Thread Bart Lateur

On Sat, 7 Oct 2000 00:00:56 -0400, Bennett Todd wrote:

>this proposal is hammering out a little
>bit of irregularity, removing a subtle difference between the
>behavior of $ at the end and ^ at the beginning under /s. I offer
>this as another argument in favour of RFC 332.

That was the basic idea right from the start. You do put it nicely,
though.

-- 
Bart.



Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-10-06 Thread Bennett Todd

I started to write:

Is there a reason for introducing an asymmetry, or
should this proposal read "... and /^/ equivalent to
/\A/ ..."?

but then I re-re-read perlre(1) and realized that that is the
current behavior already: this proposal is hammering out a little
bit of irregularity, removing a subtle difference between the
behavior of $ at the end and ^ at the beginning under /s. I offer
this as another argument in favour of RFC 332.

-Bennett

 PGP signature


Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-10-01 Thread Bart Lateur

On Thu, 28 Sep 2000 23:54:20 +0100, Hugo wrote:


>I still like the idea of $$, as I described it in the original thread.
>I've seen no comments for or against at this time.

I intend to put this in the RFC:

  Hugo prefers to add an alternative, like /$$/, wich would behave like
  this. But an alternative already exists: /\z/. We don't want Yet
  Another Alternative. We want to fix /$/ so that it will Do The Right
  Thing.

>:=head2 Getting the old behaviour back
>:
>:You can't. Question is: do you really want to?
>:
>:=head1 MIGRATION
>:
>:It's not unlikely that currently having /$/ in your regexes, is actually
>:a bug in your script, but you don't care because the data won't ever
>:make it visible.

>This seems like a read bad idea. I think you have to assume people
>are feeding you the code they want to run.

I'll replace relavant parts of the RFC with this:

=head2 '/z/' and '/Z'

  /\z/ and /\Z/ will not be altered. They will still behave as before.

=head1 MIGRATION

  replace /$/s with /\Z/s. The behaviour of /\Z/ will not be altered.


>:=head2 '/ms': combined '/m' and '/s'
>:
>:'/ms' still works as before. Internally, '/m' has taken over the job  of
>:matching before a newline at the end of the string, simply because /$/m
>:can match before I newline.
>
>Eh? Surely /$/ms would now only match _after_ the newline, or at end of
>string, whereas before it would match before _or_ after any newline, or
>at end of string?

My gut feeling tells me this just ain't right. /$/m is supposed to match
at the end of any line (or at the end of the string). The "end of line"
is before the newline, not after it.

Perl5 agrees with me:

$_ = "foo\nbar\nbaz\n";
if(/^bar\n$/m) {
print "'/\$/m' matches follwoing a newline\n";
} else {
print "'/\$/m' does not match following just any newline\n";
}
if(/^bar$/m) {
print "'/\$/m' matches just before any newline\n";
} else {
print "'/\$/m' does not match before just any newline\n";
}

-->
'/$/m' does not match following just any newline
'/$/m' matches just before any newline

The meaning of the term "end of line" won't change because of the the
'/s'. So, I expect that under '/ms', /$/ can match before any newline
thanks to the '/m', and '/s' will make /./ match any newline.

-- 
Bart.



Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-09-29 Thread Hugo

In <[EMAIL PROTECTED]>, Nathan Wiger writes:
:> Is $$ the only alternative, or did I miss more? I don't think I've even
:> seen this $$ mentioned before?
:
:$$ is not a suitable alternative. It already means the current process
:ID. It really cannot be messed with. And ${$} is identical to $$ by
:definition.

Well, not quite. First, writing $var as ${var} is the usual and common
way to disambiguate where there is a problem; second, $$ is rarely
used in a regexp pattern. We can easily migrate perl5 scripts by
translating $$ to ${$} throughout. There is no problem here.

Hugo



Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-09-28 Thread Nathan Wiger

> Is $$ the only alternative, or did I miss more? I don't think I've even
> seen this $$ mentioned before?

$$ is not a suitable alternative. It already means the current process
ID. It really cannot be messed with. And ${$} is identical to $$ by
definition.

> >I still like the idea of $$, as I described it in the original thread.
> >I've seen no comments for or against at this time.

See above.

> I can't see how yet another alternative, /$$/, is any better than what
> we have now: /\z/.

I agree. If it's more alternatives we're after, just have the person
write a custom regex. The idea is to make Perl do the right thing,
whatever that may be.

The big problem with changing $, as you note, is for people that need to
catch multiple instances in a string:

   $string = "Hello\nGoodbye\nHello\nHello\n";
   $string =~ s/Hello$/Goodbye/gm;

Without $, you can workaround this like so:

   $string =~ s/Hello\n/Goodbye\n/gm;

My suggestion would be:

   1. Make $ exactly always match just before the last \n, as the
  RFC suggests.

   2. Introduce some new \X switch that does what $ does
  currently if it's deemed necessary.

We're back to new alternatives again, but the one thing this buys you is
a $ that works consistently. I don't think many people need $'s current
functionality, and those that do can have an new \X.

-Nate



Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-09-28 Thread Hugo

In <[EMAIL PROTECTED]>, Bart Lateur writes:
:I'll try to find that "thread" back.

This was my message:

  http://www.mail-archive.com/perl6-language-regex%40perl.org/msg00354.html

:>I don't think changing /s is the right solution. I think this will
:>incline people to try and fix their problems by adding /s, without
:>realising that this changes the definition of every . in their
:>regexp as well.
:
:Perhaps. I do think that, in general, textual data falls into one of
:three categories:
:
: * text with possibly embedded newlines
: * text with no embedded newlines
: * text with an irrelevant newline at the very end.
:
:The '/s' option is for the 1st case. No '/s' for the 3rd. As for #2: you
:don't care.

I'd distinguish the first case further into 'the newlines are
significant' or not - /s is often desired for the first case,
and /m often for the second. And then I'd be tempted to repeat
the whole list, replacing 'newline' with 'record separator'.

I have to say I'm quite prejudiced against /s - I consider myself
reasonably knowledgeable about regexps, but on average about once
a month I find myself unsure enough about which is /m and which
is /s that I need to check the top of perlre to be sure. I think
we've appreciated for some time that it was a mistake to name them
as if they were opposites, but if anything I'd like to reduce the
need for them rather than to increase it.

Hugo



Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-09-28 Thread Bart Lateur

On Thu, 28 Sep 2000 23:54:20 +0100, Hugo wrote:

>We thought of a few other possibilities too. I think it is a shame you
>did not mention them, and explain why your proposal is better.

Let me think on it.

Is $$ the only alternative, or did I miss more? I don't think I've even
seen this $$ mentioned before?

>I still like the idea of $$, as I described it in the original thread.
>I've seen no comments for or against at this time. 

I'll try to find that "thread" back.

>>Perhaps '$$' to mean match at end of string (without /m) or at end
>>of any line (with /m)? The p52p6 translator can easily replace
>>references to $$ with ${$}.

I can't see how yet another alternative, /$$/, is any better than what
we have now: /\z/.

>:=head2 '/ms': combined '/m' and '/s'
>:
>:'/ms' still works as before. Internally, '/m' has taken over the job  of
>:matching before a newline at the end of the string, simply because /$/m
>:can match before I newline.
>
>Eh? Surely /$/ms would now only match _after_ the newline, or at end of
>string, whereas before it would match before _or_ after any newline, or
>at end of string?

Oh damned, you're probably right. This makes me wonder if this is doing
the right thing...

>This seems like a read bad idea. I think you have to assume people
>are feeding you the code they want to run. At worst you should
>generate a warning, but I think it is evil not to migrate things
>properly.

Well... there's a simple solution: replace /$/ with /\Z/. That one would
remain the same. Wouldn't it? I'll surely add that.

>I don't think changing /s is the right solution. I think this will
>incline people to try and fix their problems by adding /s, without
>realising that this changes the definition of every . in their
>regexp as well.

Perhaps. I do think that, in general, textual data falls into one of
three categories:

 * text with possibly embedded newlines
 * text with no embedded newlines
 * text with an irrelevant newline at the very end.

The '/s' option is for the 1st case. No '/s' for the 3rd. As for #2: you
don't care.

-- 
Bart.



Re: RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-09-28 Thread Hugo

In <[EMAIL PROTECTED]>, Perl6 RFC Librarian writes:
:Originally, we had thought of adding Yet Another Regex Modifier; but to
:be honest, having 2 modifiers just for the newline is already confusing
:enough, for too many people. A third is definitely out.

We thought of a few other possibilities too. I think it is a shame you
did not mention them, and explain why your proposal is better.

I still like the idea of $$, as I described it in the original thread.
I've seen no comments for or against at this time. To recap:

>Perhaps '$$' to mean match at end of string (without /m) or at end
>of any line (with /m)? The p52p6 translator can easily replace
>references to $$ with ${$}.

:=head2 The $* variable
:
:'/s' and '/m' also have a lesser known side effect: they both override
:the setting of the $* special variable, which controls multiline related
:behaviour in regexes.
:
:Use of this special variable has already been deprecated at least since
:Perl5 first came out, more than 5 years ago. It is a very good candidate
:to be removed from Perl6 altogether, which would result in fewer
:gotcha's in the language. That is a Good Thing.

Has there not been an RFC to remove this yet? If not I'll write one.
(Or if someone else has more spare time on their hands and wants to
do it, please let me know.)

:=head2 Getting the old behaviour back
:
:You can't. Question is: do you really want to?
:
:=over 2
:
:=item *
:
:If you know your data can contain newlines, and you want to treat them
:as ordinary characters, you probably don't want to make an exception for
:a trailing newline, anyway.

So you _can_ recreate the original behaviour. Why did you just say you
can't?

:=head2 '/ms': combined '/m' and '/s'
:
:'/ms' still works as before. Internally, '/m' has taken over the job  of
:matching before a newline at the end of the string, simply because /$/m
:can match before I newline.

Eh? Surely /$/ms would now only match _after_ the newline, or at end of
string, whereas before it would match before _or_ after any newline, or
at end of string?

:=head1 MIGRATION
:
:It's not unlikely that currently having /$/ in your regexes, is actually
:a bug in your script, but you don't care because the data won't ever
:make it visible.
:
:Therefore, I think it is not desirable to have the Perl5 To Perl6
:converter actually change your source code. A warning if /$/ is found in
:combination with a bare '/s' modifier, not combined with '/m', is
:probably all that is wanted.

This seems like a read bad idea. I think you have to assume people
are feeding you the code they want to run. At worst you should
generate a warning, but I think it is evil not to migrate things
properly.

I don't think changing /s is the right solution. I think this will
incline people to try and fix their problems by adding /s, without
realising that this changes the definition of every . in their
regexp as well. I like the idea of $$ better - this is a natural
and obvious extension to $, which adds a new capability without
messing with any existing capability. Furthermore people who find
that they have a problem in their existing regexp because $ does
not mean what they thought will not set themselves up for new and
different problems when they apply the obvious one-byte fix.

Hugo



RFC 332 (v1) Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

2000-09-28 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

=head1 VERSION

  Maintainer: Bart Lateur <[EMAIL PROTECTED]>
  Date: 28 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 332
  Version: 1
  Status: Developing

=head1 ABSTRACT

To most Perlers, /$/ in a regex simply means "end of string". This is
only right, if you're absolutely sure your string doesn't end in a
newline, as is commonly the case in a large part of all textual data:
ordinary strings don't contain newlines. Lines coming from text files
can only contain a newline as the very last character. The '/s' modifier
is usually only used in combination with the former class of textual
data.

However, this situation is basically a bug hole.

This RFC proposes to change the '/s' modifier so that under '/s', /$/
will only match at the very end of a string, and not also before a
newline at the end of the string.

=head1 DESCRIPTION

To most Perl programmers, /^foo$/ is a regex that can only match the
string "foo". It's not, actually: it can match "foo\n", too. This
assumption is usually safe, because people know the kind of data they're
dealing with, and they "know" that it won't ever end in a newline.

However, this basically is a chance for bugs to creep in, if for some
reason this assumption about the data no longer holds.

To make matters worse, Perl doesn't even have a mechanism to prevent the
regex engine from matching /$/ at just before the last character if it's
a newline.

Originally, we had thought of adding Yet Another Regex Modifier; but to
be honest, having 2 modifiers just for the newline is already confusing
enough, for too many people. A third is definitely out.

Therefore, the proposal is instead to modify the behaviour of the '/s'
modifier.

Under '/s':

=over 2

=item *

/./ can match any character, including newline;

=item *

/$/ can match only at the very end of the string, not also in front of a
last character, if it happens to be a newline.

=back

This seems simple enough.

=head1 CONSIDERATIONS

=head2 Mnemonic value of '/s'

'/s' originally stood for "single line". This can no longer be true, the
mnemonic value of the "s" is thereby reduced to zero.

However, the mnemonic value wasn't that great to begin with, especially
if you consider that combining '/s' and '/m' is not only possible, but a
useful option, too. How can a string both be a single line and
multiline, at the same time?

So, to most Perl programmers, '/s' simply stands for

=over 2

=item

let /./ match a newline too

=back

which now gets turned into:

=over 2

=item

treat "\n" as an ordinary character

=back

The change isn't that big, so it is just as easy to remember. Or not.

=head2 The $* variable

'/s' and '/m' also have a lesser known side effect: they both override
the setting of the $* special variable, which controls multiline related
behaviour in regexes.

Use of this special variable has already been deprecated at least since
Perl5 first came out, more than 5 years ago. It is a very good candidate
to be removed from Perl6 altogether, which would result in fewer
gotcha's in the language. That is a Good Thing.

Perlvar says:

Use of `$*' is deprecated in modern Perl, supplanted by the `/s'
and `/m' modifiers on pattern matching.


Therefore, any changing behaviour of '/s', with regards to $*, can
nowadays hardly be considered relevant, any more.

=head2 Getting the old behaviour back

You can't. Question is: do you really want to?

=over 2

=item *

If you know your data can contain newlines, and you want to treat them
as ordinary characters, you probably don't want to make an exception for
a trailing newline, anyway.

=item *

If you still want to ignore a trailing newline in the regex, you can
either adjust your regex so that it contains /\n?$/ or something like
it, instead of plain /$/; or you can chomp() your data, before doing the
match.

=item * 

And finally, there's still the option for simply not using '/s', and all
things will remain as they were before.  ;-)

=back

=head2 '/ms': combined '/m' and '/s'

'/ms' still works as before. Internally, '/m' has taken over the job  of
matching before a newline at the end of the string, simply because /$/m
can match before I newline.

=head1 MIGRATION

It's not unlikely that currently having /$/ in your regexes, is actually
a bug in your script, but you don't care because the data won't ever
make it visible.

Therefore, I think it is not desirable to have the Perl5 To Perl6
converter actually change your source code. A warning if /$/ is found in
combination with a bare '/s' modifier, not combined with '/m', is
probably all that is wanted.

=head1 IMPLEMENTATION

Under '/s', make '$' behave as /\z/ does now.

=head1 REFERENCES

perlre, about '/s' and '/m'

perlvar, section about $*