Re: problems with case insensitive tr/// regexp
Daniel Staal wrote: ... You definitely need the s/// operator, (unless you can use one of the HTML parsing modules). But let's fix that regrexp first, shall we? First off, you may have noticed I removed the first '.*' from your regrexp: that's because nothing is allowed between the opening '' and the name of the element. Unless, of course, it is a closing tag, in which case you have a '/' in there. So, that would be: s/\\/?font.*\//i Just a moment, that's ugly. Substitution allows different dividers, let's use something else. I'll use '[' and ']'. So, re-written that as: s[\/?font.*\][]i (Note that we've dropped the escape on the slash: it is no longer needed.) Ok, let's try that. Yikes!!! It matches _everything_ after the first font tag!! Um, that greedy '.*' needs to be fixed, to stop as soon as it can instead of matching as much as it can. We do that by adding a '?' after it: s[\/?font.*?\][]i There, that's better. Oh, but there is one other problem: '.*?' stops at a newline. That may sound fine, but a newline is legal inside a HTML element tag... We change this by adding a 's' with the 'i' modifier: s[\/?font.*?\][]si That should work. Of course, it only changes the first font tag it finds... To fix that we need another modifier: 'g'. So the final pattern is: s[\/?font.*?\][]gsi I think that covers everything... And it is a quick lession is why we usually tell people not to try matching HTML with regrexps. Daniel T. Staal Cool! Thanks, Daniel, that is very nice work. I could feel myself going back over those first steps in using regexes as I followed your post. Joseph -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problems with case insensitive tr/// regexp
--As off Friday, November 28, 2003 1:08 PM -0800, R. Joseph Newton is alleged to have said: s[\/?font.*?\][]gsi Cool! Thanks, Daniel, that is very nice work. I could feel myself going back over those first steps in using regexes as I followed your post. --As for the rest, it is mine. Heh, thanks. I'm still on the first steps, most of the time... Quick bonus question (which I won't be around to answer, most likely; I'm going to be offline for the next few days): Find me a valid HTML snippet that the above matches but it (probably) shouldn't. I can think of at least one case... Daniel T. Staal --- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
problems with case insensitive tr/// regexp
I'm trying to create a script to remove all font tags from an HTML documents. I created a regular expression like this: ,[ working code | use strict; | use warnings; | my $foo =font whe; | $foo =~ tr/\.*font.*\//d; | print $foo, \n; `--- But, in order to remove tags from documents where the writers liked to use uppercase (or camel case) I want to make the search case insensitive. So I added an i like when I m/\.*font.*\/i font tags. So I had: ,[ erronous code | use strict; | use warnings; | my $foo =font whe; | $foo =~ tr/\.*font.*\//di; | print $foo, \n; `--- This code produces the error: ,[ the error | Bareword found where operator expected at - line 4, near | tr/\.*font.*\//di syntax error at - line 4, near | tr/\.*font.*\//di `-- So what am I doing wrong and how do I make a case insensitive tr/// regexp? Thanks for your help, Dan -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problems with case insensitive tr/// regexp
--As off Thursday, November 27, 2003 7:42 PM -0500, Dan Anderson is alleged to have said: So what am I doing wrong and how do I make a case insensitive tr/// regexp? Thanks for your help, --As for the rest, it is mine. You can't make a case insensitive tr/// regexp: tr/// doesn't do regexp. It does transliteration: it replaces the characters in the first part with the respective ones in the second part. You want the s/// operator: s/\font.*\//i Daniel T. Staal --- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problems with case insensitive tr/// regexp
--As off Thursday, November 27, 2003 7:05 PM -0600, Perl Newbies is alleged to have said: --As off Thursday, November 27, 2003 7:42 PM -0500, Dan Anderson is alleged to have said: So what am I doing wrong and how do I make a case insensitive tr/// regexp? Thanks for your help, --As for the rest, it is mine. You can't make a case insensitive tr/// regexp: tr/// doesn't do regexp. It does transliteration: it replaces the characters in the first part with the respective ones in the second part. You want the s/// operator: s/\font.*\//i --As for the rest, it is mine. Hold on one moment before you shoot yourself in the foot with that loaded gun I just gave you... You definitely need the s/// operator, (unless you can use one of the HTML parsing modules). But let's fix that regrexp first, shall we? First off, you may have noticed I removed the first '.*' from your regrexp: that's because nothing is allowed between the opening '' and the name of the element. Unless, of course, it is a closing tag, in which case you have a '/' in there. So, that would be: s/\\/?font.*\//i Just a moment, that's ugly. Substitution allows different dividers, let's use something else. I'll use '[' and ']'. So, re-written that as: s[\/?font.*\][]i (Note that we've dropped the escape on the slash: it is no longer needed.) Ok, let's try that. Yikes!!! It matches _everything_ after the first font tag!! Um, that greedy '.*' needs to be fixed, to stop as soon as it can instead of matching as much as it can. We do that by adding a '?' after it: s[\/?font.*?\][]i There, that's better. Oh, but there is one other problem: '.*?' stops at a newline. That may sound fine, but a newline is legal inside a HTML element tag... We change this by adding a 's' with the 'i' modifier: s[\/?font.*?\][]si That should work. Of course, it only changes the first font tag it finds... To fix that we need another modifier: 'g'. So the final pattern is: s[\/?font.*?\][]gsi I think that covers everything... And it is a quick lession is why we usually tell people not to try matching HTML with regrexps. Daniel T. Staal --- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]