--On Friday, March 19, 2004 10:24 AM -0600 Matthew Simpson <[EMAIL PROTECTED]> wrote:

I need some quick help scanning the message body for URLs and certain HTML
tags.


We do this, in filter(), to catch tags.  It changes <object to
<no-object, etc, disabling it.


# Check for bad code in HTML parts if ($type eq "text/html") { my($bla,$badtag); if ($io = $entity->open("r")) { while (defined($_ = $io->getline)) { # note iframe, script, object if (/<(iframe|script|object) /i) { $badtag = $1; $_ =~ s/<(iframe|script|object)\b/<no-$1 /ig; } $bla .= $_; } $io->close; } if ($badtag) { if ($io = $entity->open("w")) { $io->print($bla); $io->close; } md_graphdefang_log('modify',"$badtag tag deactivated"); action_change_header("X-Warning", "$badtag tag modified by Columbia filter"); action_rebuild(); } }



Bugged IMG tags are probably next thing to go into this section.
Personally I use a MUA that does not show images.

Scanning for URLs is much harder.  The above does not catch things
broken over more than one line.  You can set $/="\n\n" to work by
paragraphs but I think some of the more obfuscated garbage even
spans paragraphs.  I just started looking at this.  Basically you
have to catch <a.href and then buffer all till the next </a>, with
some kind of stream input.  I didn't peak at Anomy HTML Cleaner yet
to see how they do it :-)   And if you really want to do a lot of
HTML cleaning, well, they do it all-- more than we want to do.

Joseph Brennan
Academic Technologies Group, Academic Information Systems (AcIS)
Columbia University in the City of New York



_______________________________________________
Visit http://www.mimedefang.org and http://www.canit.ca
MIMEDefang mailing list
[EMAIL PROTECTED]
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang

Reply via email to