--On Friday, March 19, 2004 10:24 AM -0600 Matthew Simpson <[EMAIL PROTECTED]> wrote:
I need some quick help scanning the message body for URLs and certain HTML tags.
We do this, in filter(), to catch tags. It changes <object to <no-object, etc, disabling it.
# Check for bad code in HTML parts if ($type eq "text/html") { my($bla,$badtag); if ($io = $entity->open("r")) { while (defined($_ = $io->getline)) { # note iframe, script, object if (/<(iframe|script|object) /i) { $badtag = $1; $_ =~ s/<(iframe|script|object)\b/<no-$1 /ig; } $bla .= $_; } $io->close; } if ($badtag) { if ($io = $entity->open("w")) { $io->print($bla); $io->close; } md_graphdefang_log('modify',"$badtag tag deactivated"); action_change_header("X-Warning", "$badtag tag modified by Columbia filter"); action_rebuild(); } }
Bugged IMG tags are probably next thing to go into this section. Personally I use a MUA that does not show images.
Scanning for URLs is much harder. The above does not catch things broken over more than one line. You can set $/="\n\n" to work by paragraphs but I think some of the more obfuscated garbage even spans paragraphs. I just started looking at this. Basically you have to catch <a.href and then buffer all till the next </a>, with some kind of stream input. I didn't peak at Anomy HTML Cleaner yet to see how they do it :-) And if you really want to do a lot of HTML cleaning, well, they do it all-- more than we want to do.
Joseph Brennan Academic Technologies Group, Academic Information Systems (AcIS) Columbia University in the City of New York
_______________________________________________ Visit http://www.mimedefang.org and http://www.canit.ca MIMEDefang mailing list [EMAIL PROTECTED] http://lists.roaringpenguin.com/mailman/listinfo/mimedefang