How about using HTML::Parser? It may be overkill (for what you want, a
regex might be fine), but it allows you to do be very flexible with the
HTML, and you don't have to worry about things like nested <script> tags
like you would with a regex.
And HTML::Parser code can easily be tweaked to do alot of other HTML-related
stuff. Via a cron job, I parse bostonphoenix.com and email myself movie
listings and reviews each week for my local movie theaters. I parse
harvard.edu and email myself Harvard's events each morning.
I've written some sample code below (also see perldoc HTML::Parser). You
would use it like this:
use GetRidOfJavaScript;
$html_with_image_instead_of_javascript =
GetRidOfJavaScript::get_rid_of_js('http://www.vendor.com/index.html', '<IMG
src="tmp.gif"');
package GetRidOfJavaScript;
use HTML::Parser;
@ISA = qw(HTML::Parser);
# object creation wrapper method
sub get_rid_of_js {
($url, $js_replacement_text) = @_;
use LWP::Simple;
my $html = get($url) or die "Couldn't get html from $url: $!";
#init global vars
$keep_text = $in_javascript = "";
my $p = GetRidOfJavaScript->new(); # inherited from HTML::Parser
$p->parse($html); # parses HTML and calls start, end, text, comment,
declaration methods
return $keep_text;
}
#overwrite start method executed when a start tag is found
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag eq 'script' or $tag eq 'noscript') {
$in_javascript = 1;
}
else {
$keep_text .= $origtext;
}
}
#overwrite end method executed when an end tag is found
sub end {
my ($self, $tag, $origtext) = @_;
if ($tag eq 'script' or $tag eq 'noscript') {
$in_javascript = 0;
}
else {
$keep_text .= $origtext;
}
}
#overwrite text method executed for any text between tags
sub text {
my ($self, $origtext) = @_;
if (!$in_javascript) {
$keep_text .= $origtext;
}
else {
$keep_text .= $js_replacement_text;
}
}
# called for comments
sub comment {
my ($self, $origtext) = @_;
if (!$in_javascript) {
$keep_text .= "<!-- $origtext -->"; # strips comment tag for some
reason
}
}
# called tags like <!DOCTYPE> and <!xml>
sub declaration {
my ($self, $origtext) = @_;
if (!$in_javascript) {
$keep_text .= "<!$origtext>"; # strips off <! and >
}
}
>From: "Matthew J. Brooks" <[EMAIL PROTECTED]>
>Reply-To: "Matthew J. Brooks" <[EMAIL PROTECTED]>
>To: "Boston Perl Mongers" <[EMAIL PROTECTED]>
>Subject: Re: [Boston.pm] multiline search and replace
>Date: Tue, 20 Mar 2001 03:24:23 -0500
>
>Thanks for the reply Mike.
>
>The problem seems more to do with the regex than anything. I need to search
>for something like...
>
><script LANGUAGE="JAVASCRIPT">
><!--
>VendorSite= "Companycom";
>VendorPage= "CompanyMail";
>VendorWidth= 468;
>VendorHeight= 60;
>VendorPrintTag= true;
>VendorNewAd= true;
>VendorLoaded= false;
>VendorVersion= 3.2;
>file://-->
></script>
><script
>SRC="http://Vendor.com/ads/" LANGUAGE="JAVASCRIPT"></script>
><script
>LANGUAGE="JAVASCRIPT">
><!--
>if (VendorLoaded) VendorDeliverAd();
>file://-->
></script>
><NOSCRIPT> <a target="_top"
>HREF="http://Vendor.com/server/click/Compancom/CompanyMail/123"><img
>SRC="http://Vendor.com/server/ad/Companycom/CompanyMail/123" border="0"
>width="468" height="60"></a>
></NOSCRIPT>
>
>
>...in the html files and replace it with:
>
><IMG SRC="http://www.Company.com/ads/default.gif'>
>
>
>The Vendor here dropped their services on Company early (basically as soon
>as they found out their contract was not going to be renewed) and now
>Company's web site has public service ads being fed to it. They want to
>have
>default.gif displayed until they come up with an internal solution for
>banner rotation (that probably will end up in my lap too). Either way,
>they've got a lost revenue issue, but without this change they also have a
>content control issue.
>
>
>----- Original Message -----
>From: <[EMAIL PROTECTED]>
>To: "Matthew J. Brooks" <[EMAIL PROTECTED]>
>Cc: "Boston Perl Mongers" <[EMAIL PROTECTED]>
>Sent: Tuesday, March 20, 2001 12:46 AM
>Subject: Re: [Boston.pm] multiline search and replace
>
>
> > sub blah {
> > my $dir = shift;
> > opendir (DIR, "$dir") or die "can't open $dir";
> > foreach my $file (readdir (DIR)) {
> > next if ($file =~ /\.|\.\./);
> > if (-d $file) {
> > blah ($file);
> > }
> > if ($file =~ /\.html*/) {
> > open (FILE, "$_") or die "can't open $file";
> > my $html_file = join ("",<FILE>);
> > $html_file =
> > s/regex_that_matches_java_script/your_replacement_a_href/g;
> > close (FILE);
> > open (FILE,">$file"); #open for reading
> > print FILE $html_file;
> > close (FILE);
> > }
> > }
> > }
> >
> >
> > eh, this works i think... well the logic does, i didn't compile it. you
> > could tinker with it and see what you get. good luck. it's not going to
>be
> > efficient. all i did was slurp up the whole file to do one swap
> > statment. it puts it all into memory like in my xml routine. i am
>changing
> > this very soon in it because it's a bad idea but for a quick html file
> > hacker it works great. i'll be posting the xml_routine.pl that i
>presented
> > very soon when i've finished changing some things that need tweaking and
> > it's ready for the eyes of the masses.
> >
> > -mike
> >
> >
>
_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com