Re: problems with case insensitive tr/// regexp

2003-11-28 Thread R. Joseph Newton
Daniel Staal wrote:
...

 You definitely need the s/// operator, (unless you can use one of the
 HTML parsing modules).  But let's fix that regrexp first, shall we?

 First off, you may have noticed I removed the first '.*' from your
 regrexp: that's because nothing is allowed between the opening ''
 and the name of the element.  Unless, of course, it is a closing tag,
 in which case you have a '/' in there.  So, that would be:
 s/\\/?font.*\//i

 Just a moment, that's ugly.  Substitution allows different dividers,
 let's use something else.  I'll use '[' and ']'.  So, re-written that
 as:
 s[\/?font.*\][]i
 (Note that we've dropped the escape on the slash: it is no longer
 needed.)

 Ok, let's try that.  Yikes!!!  It matches _everything_ after the
 first font tag!!  Um, that greedy '.*' needs to be fixed, to stop as
 soon as it can instead of matching as much as it can.  We do that by
 adding a '?' after it:
 s[\/?font.*?\][]i

 There, that's better.  Oh, but there is one other problem:  '.*?'
 stops at a newline.  That may sound fine, but a newline is legal
 inside a HTML element tag...  We change this by adding a 's' with the
 'i' modifier:
 s[\/?font.*?\][]si

 That should work.  Of course, it only changes the first font tag it
 finds...  To fix that we need another modifier: 'g'.  So the final
 pattern is:
 s[\/?font.*?\][]gsi

 I think that covers everything...  And it is a quick lession is why
 we usually tell people not to try matching HTML with regrexps.

 Daniel T. Staal

Cool!  Thanks, Daniel, that is very nice work.  I could feel myself going
back over those first steps in using regexes as I followed your post.

Joseph


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: problems with case insensitive tr/// regexp

2003-11-28 Thread Daniel Staal
--As off Friday, November 28, 2003 1:08 PM -0800, R. Joseph Newton is 
alleged to have said:

s[\/?font.*?\][]gsi

Cool!  Thanks, Daniel, that is very nice work.  I could feel myself
going back over those first steps in using regexes as I followed
your post.
--As for the rest, it is mine.

Heh, thanks.  I'm still on the first steps, most of the time...

Quick bonus question (which I won't be around to answer, most likely; 
I'm going to be offline for the next few days): Find me a valid HTML 
snippet that the above matches but it (probably) shouldn't.  I can 
think of at least one case...

Daniel T. Staal

---
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


problems with case insensitive tr/// regexp

2003-11-27 Thread Dan Anderson

I'm trying to create a script  to remove all font tags from an
HTML documents.  I created a regular expression like this:

,[ working code
| use strict;
| use warnings;
| my $foo =font whe;
| $foo =~ tr/\.*font.*\//d;
| print $foo, \n;
`---

But, in order to remove  tags from documents where the writers
liked to use uppercase (or camel  case) I want to make the search case
insensitive.  So I added an  i like when I m/\.*font.*\/i font tags.
So I had:

,[ erronous code
| use strict;
| use warnings;
| my $foo =font whe;
| $foo =~ tr/\.*font.*\//di;
| print $foo, \n;
`---

This code produces the error:

,[ the error
| Bareword found where operator expected at - line 4, near
| tr/\.*font.*\//di syntax error at - line 4, near
| tr/\.*font.*\//di
`--

So what am I doing wrong  and how do I make a case insensitive
tr/// regexp?

Thanks for your help,

Dan




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: problems with case insensitive tr/// regexp

2003-11-27 Thread Daniel Staal
--As off Thursday, November 27, 2003 7:42 PM -0500, Dan Anderson is 
alleged to have said:

So what am I doing wrong  and how do I make a case
insensitive tr/// regexp?
Thanks for your help,
--As for the rest, it is mine.

You can't make a case insensitive tr/// regexp: tr/// doesn't do 
regexp.  It does transliteration: it replaces the characters in the 
first part with the respective ones in the second part.

You want the s/// operator:
s/\font.*\//i
Daniel T. Staal

---
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: problems with case insensitive tr/// regexp

2003-11-27 Thread Daniel Staal
--As off Thursday, November 27, 2003 7:05 PM -0600, Perl Newbies is 
alleged to have said:

--As off Thursday, November 27, 2003 7:42 PM -0500, Dan Anderson is
alleged to have said:
So what am I doing wrong  and how do I make a case
insensitive tr/// regexp?
Thanks for your help,
--As for the rest, it is mine.

You can't make a case insensitive tr/// regexp: tr/// doesn't do
regexp.  It does transliteration: it replaces the characters in the
first part with the respective ones in the second part.
You want the s/// operator:
s/\font.*\//i
--As for the rest, it is mine.

Hold on one moment before you shoot yourself in the foot with that 
loaded gun I just gave you...

You definitely need the s/// operator, (unless you can use one of the 
HTML parsing modules).  But let's fix that regrexp first, shall we?

First off, you may have noticed I removed the first '.*' from your 
regrexp: that's because nothing is allowed between the opening '' 
and the name of the element.  Unless, of course, it is a closing tag, 
in which case you have a '/' in there.  So, that would be:
s/\\/?font.*\//i

Just a moment, that's ugly.  Substitution allows different dividers, 
let's use something else.  I'll use '[' and ']'.  So, re-written that 
as:
s[\/?font.*\][]i
(Note that we've dropped the escape on the slash: it is no longer 
needed.)

Ok, let's try that.  Yikes!!!  It matches _everything_ after the 
first font tag!!  Um, that greedy '.*' needs to be fixed, to stop as 
soon as it can instead of matching as much as it can.  We do that by 
adding a '?' after it:
s[\/?font.*?\][]i

There, that's better.  Oh, but there is one other problem:  '.*?' 
stops at a newline.  That may sound fine, but a newline is legal 
inside a HTML element tag...  We change this by adding a 's' with the 
'i' modifier:
s[\/?font.*?\][]si

That should work.  Of course, it only changes the first font tag it 
finds...  To fix that we need another modifier: 'g'.  So the final 
pattern is:
s[\/?font.*?\][]gsi

I think that covers everything...  And it is a quick lession is why 
we usually tell people not to try matching HTML with regrexps.

Daniel T. Staal

---
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]