regexp

R. Joseph Newton Fri, 28 Nov 2003 14:09:35 -0800

Daniel Staal wrote:
...

> You definitely need the s/// operator, (unless you can use one of the
> HTML parsing modules).  But let's fix that regrexp first, shall we?
>
> First off, you may have noticed I removed the first '.*' from your
> regrexp: that's because nothing is allowed between the opening '<'
> and the name of the element.  Unless, of course, it is a closing tag,
> in which case you have a '/' in there.  So, that would be:
> s/\<\/?font.*\>//i
>
> Just a moment, that's ugly.  Substitution allows different dividers,
> let's use something else.  I'll use '[' and ']'.  So, re-written that
> as:
> s[\</?font.*\>][]i
> (Note that we've dropped the escape on the slash: it is no longer
> needed.)
>
> Ok, let's try that.  Yikes!!!  It matches _everything_ after the
> first font tag!!  Um, that greedy '.*' needs to be fixed, to stop as
> soon as it can instead of matching as much as it can.  We do that by
> adding a '?' after it:
> s[\</?font.*?\>][]i
>
> There, that's better.  Oh, but there is one other problem:  '.*?'
> stops at a newline.  That may sound fine, but a newline is legal
> inside a HTML element tag...  We change this by adding a 's' with the
> 'i' modifier:
> s[\</?font.*?\>][]si
>
> That should work.  Of course, it only changes the first font tag it
> finds...  To fix that we need another modifier: 'g'.  So the final
> pattern is:
> s[\</?font.*?\>][]gsi
>
> I think that covers everything...  And it is a quick lession is why
> we usually tell people not to try matching HTML with regrexps.
>
> Daniel T. Staal


Cool!  Thanks, Daniel, that is very nice work.  I could feel myself going
back over those first steps in using regexes as I followed your post.

Joseph


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: problems with case insensitive tr/// regexp

Reply via email to