I've noticed several spam mails with a lot of quoted text (quotes from
Dave Barry, some of Moby Dick, that sort of thing. Usually all
punction is stripped out, but not always.) included within brackets or
an HTML title. It's likely being used to counterweight the
message against a Bayesian filter, since most of the words generally
also appear in ham. I made two rules to catch this. It doesn't seem like
it'd bring up false positives (perhaps increasing the title length past
80), and works quite well against my corpus, but are there any problems
I'm overlooking with this approach?

rawbody L_Text_Padding_In_Html      /<(title>)?[ '-.,?!\w]{50,}>/
describe L_Text_Padding_In_Html  Text padding within brackets or HTML
title to fool bayesian filter
score L_Text_Padding_In_Html 3.0

rawbody L_Very_Long_Title  /<title>[ '-.,?!\w]{80,}<\/title>/
describe L_Very_Long_Title HTML title longer than 80 characters to fool
bayesian filter
score L_Very_Long_Title 1.0

Thanks,
sckot Vokes
-- 
"I wish I had a 2 liter of Pepsi in my box of replacement
 staples, so if they needed to quench their thirst, then
 they could ride the snake." -Kefka P


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to