What might be interesting to do, is a regexp on JUST the domain names.
Similar to how MT-Blacklist works (the spam plugin for MovableType weblogs): http://www.jayallen.org/comment_spam/
I've seen that most websites, follow similar patterns as some that JayAllen has pointed out. Why not perform these tests on SpamURL's themselves?
He's got a nice base to work off of. And his top 15 rules, seemt o catch most of the spammers:
([\w\-_.]+\.)?(l(so|os)tr)\.[a-z]{2,} # Catchall regex for lsotr.xxx and
lostr.xxx with or without a subdomain
(blow)[\w\-_.]*job[\w\-_.]*\.[a-z]{2,}
(buy)[\w\-_.]*online[\w\-_.]*\.[a-z]{2,} # Catchall regexp for many spam
sites
(diet|penis)[\w\-_.]*(pills|enlargement)[\w\-_.]*\.[a-z]{2,} # Catchall
regexp for many spam sites
(i|la)-sonneries?[\w\-_.]*\.[a-z]{2,}
(levitra|lolita|phentermine|viagra|vig-?rx|zyban|valtex|xenical|adipex|meridia\b)[\w\-_.]*\.[a-z]{2,}
# Super regexp for domains containing levitra, lolita, phentermine,
viagra, vigrx, vig-rx, zyban, valtex, xenical, adipex and meridia
(magazine)[\w\-_.]*(finder|netfirms)[\w\-_.]*\.[a-z]{2,}
(mike)[\w\-_.]*apartment[\w\-_.]*\.[a-z]{2,} # Catchall regexp for Mike's
Apartment variations
(milf)[\w\-_.]*(hunter|moms|fucking)[\w\-_.]*\.[a-z]{2,}
(online)[\w\-_.]*casino[\w\-_.]*\.[a-z]{2,} # Catchall regexp for a hundred
online casino sites
(prozac|zoloft|xanax|valium|hydrocodone|vicodin|paxil|vioxx)[\w\-_.]*\.[a-z]{2,}
# Super regexp for domains containing prozac, zoloft, xanax, valium,
hydrocodone, vicodin, paxil, vioxx
(ragazze)-?\w+\.[a-z]{2,} # Catchall regexp for many spam sites
(ultram\b|\btenuate|tramadol|pheromones|phendimetrazine|ionamin|ortho.?tricyclen|retin.?a)[\w\-_.]*\.[a-z]{2,}
# Third drug super regexp
(valtrex|zyrtec|\bhgh\b|ambien\b|flonase|allegra|didrex|renova\b|bontril|nexium)[\w\-_.]*\.[a-z]{2,}
# Fourth drug super regexp
That covers a ton of spam URL's.
His excellent plugin works off of the URL's spammers specify. It's a great solution for the situation. Email is a bit tougher. But perhaps we could harness this capacity?
Gary Funck wrote:
As a follow-up to, but off-topic from the bug report ...
------- Additional Comments From [EMAIL PROTECTED]
2004-01-25 02:18 ------- I don't like the idea of having to run mass-checks manually and extracting domain names to check from that -- mostly because most people won't do it.
How about this:
- Extract registerable domain part using reportedly existing heuristics (hostpart.spammer.co.uk -> spammer.co.uk)
Over the weekend, I've collected 3600 host names associated with 16,300 URL's extracted from about 80,000 spam messages going back to August of this year. They're sorted in reverse dot order, for example:
trimtram.net trinketreach.net www.try4free.net www.ultrastats.net umbrellacover.net www.usagov.net www.usaskylink.net ns.usenetsolution.net www.vacationpromo.net mysite.verizon.net viva-x.net www.vivato.net bradford.hfwnflvzxb.wealthnation.net lane.nerbq.wealthnation.net www.whitephantom.net www.whitetrashsluts.net www.whoringfor-college.net www.wideep.net
As you can see, for example, the wealthnation.net entries are together, but the host name prefixes are different.
Question: is there a Perl package that can be used to boil these down to their domain name part, suitable for a whois look up? Where I'm going with this is to try and build a data base of same regirstrar/techinal point of contact and so on. One approach I thought of was to try a whois on the fully qualified host names above, and if it doesn't succed, then remove the first component and try again, and so on, but that's not very elegant.
Regarding whois, I tried a few of the domains in the list and noticed that whois turned up empty. Is there a database somewhere that relates domain names to their registrar, or to a server that will reply with their whois info?
-- Robert J. Accettura [EMAIL PROTECTED]
smime.p7s
Description: S/MIME Cryptographic Signature
