Re: [Boston.pm] multiline search and replace

Jon Gunnip Tue, 20 Mar 2001 06:19:52 -0800
How about using HTML::Parser?  It may be overkill (for what you want, a 
regex might be fine), but it allows you to do be very flexible with the 
HTML, and you don't have to worry about things like nested <script> tags 
like you would with a regex.

And HTML::Parser code can easily be tweaked to do alot of other HTML-related 
stuff.  Via a cron job, I parse bostonphoenix.com and email myself movie 
listings and reviews each week for my local movie theaters.  I parse  
harvard.edu and email myself Harvard's events each morning.

I've written some sample code below (also see perldoc HTML::Parser).  You 
would use it like this:
use GetRidOfJavaScript;
$html_with_image_instead_of_javascript =
           
GetRidOfJavaScript::get_rid_of_js('http://www.vendor.com/index.html', '<IMG 
src="tmp.gif"');



package GetRidOfJavaScript;
use HTML::Parser;
@ISA = qw(HTML::Parser);

# object creation wrapper method
sub get_rid_of_js {
    ($url, $js_replacement_text) = @_;
    use LWP::Simple;
    my $html = get($url) or die "Couldn't get html from $url: $!";

    #init global vars
    $keep_text = $in_javascript = "";

    my $p = GetRidOfJavaScript->new();  # inherited from HTML::Parser
    $p->parse($html);  # parses HTML and calls start, end, text, comment, 
declaration methods
    return $keep_text;
}

#overwrite start method executed when a start tag is found
sub start {
    my ($self, $tag, $attr, $attrseq, $origtext) = @_;
    if ($tag eq 'script' or $tag eq 'noscript') {
        $in_javascript = 1;
    }
    else {
        $keep_text .= $origtext;
    }
}

#overwrite end method executed when an end tag is found
sub end {
    my ($self, $tag, $origtext) = @_;
    if ($tag eq 'script' or $tag eq 'noscript') {
        $in_javascript = 0;
    }
    else {
        $keep_text .= $origtext;
    }
}

#overwrite text method executed for any text between tags
sub text {
    my ($self, $origtext) = @_;
    if (!$in_javascript) {
        $keep_text .= $origtext;
    }
    else {
        $keep_text .= $js_replacement_text;
    }
}

# called for comments
sub comment {
    my ($self, $origtext) = @_;
    if (!$in_javascript) {
        $keep_text .= "<!-- $origtext -->"; # strips comment tag for some 
reason
    }
}

# called tags like <!DOCTYPE> and <!xml>
sub declaration {
    my ($self, $origtext) = @_;
    if (!$in_javascript) {
        $keep_text .= "<!$origtext>"; # strips off <! and >
    }
}



>From: "Matthew J. Brooks" <[EMAIL PROTECTED]>
>Reply-To: "Matthew J. Brooks" <[EMAIL PROTECTED]>
>To: "Boston Perl Mongers" <[EMAIL PROTECTED]>
>Subject: Re: [Boston.pm] multiline search and replace
>Date: Tue, 20 Mar 2001 03:24:23 -0500
>
>Thanks for the reply Mike.
>
>The problem seems more to do with the regex than anything. I need to search
>for something like...
>
><script LANGUAGE="JAVASCRIPT">
><!--
>VendorSite= "Companycom";
>VendorPage= "CompanyMail";
>VendorWidth= 468;
>VendorHeight= 60;
>VendorPrintTag= true;
>VendorNewAd= true;
>VendorLoaded= false;
>VendorVersion= 3.2;
>file://-->
></script>
><script
>SRC="http://Vendor.com/ads/" LANGUAGE="JAVASCRIPT"></script>
><script
>LANGUAGE="JAVASCRIPT">
><!--
>if (VendorLoaded) VendorDeliverAd();
>file://-->
></script>
><NOSCRIPT> <a target="_top"
>HREF="http://Vendor.com/server/click/Compancom/CompanyMail/123"><img
>SRC="http://Vendor.com/server/ad/Companycom/CompanyMail/123" border="0"
>width="468" height="60"></a>
></NOSCRIPT>
>
>
>...in the html files and replace it with:
>
><IMG SRC="http://www.Company.com/ads/default.gif'>
>
>
>The Vendor here dropped their services on Company early (basically as soon
>as they found out their contract was not going to be renewed) and now
>Company's web site has public service ads being fed to it. They want to 
>have
>default.gif displayed until they come up with an internal solution for
>banner rotation (that probably will end up in my lap too). Either way,
>they've got a lost revenue issue, but without this change they also have a
>content control issue.
>
>
>----- Original Message -----
>From: <[EMAIL PROTECTED]>
>To: "Matthew J. Brooks" <[EMAIL PROTECTED]>
>Cc: "Boston Perl Mongers" <[EMAIL PROTECTED]>
>Sent: Tuesday, March 20, 2001 12:46 AM
>Subject: Re: [Boston.pm] multiline search and replace
>
>
> > sub blah {
> >   my $dir = shift;
> >   opendir (DIR, "$dir") or die "can't open $dir";
> >   foreach my $file (readdir (DIR)) {
> >     next if ($file =~ /\.|\.\./);
> >     if (-d $file) {
> >       blah ($file);
> >     }
> >     if ($file =~ /\.html*/) {
> >       open (FILE, "$_") or die "can't open $file";
> >       my $html_file = join ("",<FILE>);
> >       $html_file =
> >         s/regex_that_matches_java_script/your_replacement_a_href/g;
> >       close (FILE);
> >       open (FILE,">$file"); #open for reading
> >       print FILE $html_file;
> >       close (FILE);
> >     }
> >   }
> > }
> >
> >
> > eh, this works i think... well the logic does, i didn't compile it. you
> > could tinker with it and see what you get. good luck. it's not going to 
>be
> > efficient. all i did was slurp up the whole file to do one swap
> > statment. it puts it all into memory like in my xml routine. i am 
>changing
> > this very soon in it because it's a bad idea but for a quick html file
> > hacker it works great. i'll be posting the xml_routine.pl that i 
>presented
> > very soon when i've finished changing some things that need tweaking and
> > it's ready for the eyes of the masses.
> >
> > -mike
> >
> >
>

_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com
Re: [Boston.pm] multiline search and replace

Reply via email to