http://bugzilla.spamassassin.org/show_bug.cgi?id=1375
------- Additional Comments From [EMAIL PROTECTED] 2004-01-24 15:46 -------
FYI -- I got in a discussion about this elsewhere, and here's a comment I
posted.
big problem with querying A records for URLs in mail messages,
at scan time, is that this can be *very expensive* in terms of runtime. ...
Consider a spammer who wants to DDOS someone's mail site. If they know
that site uses a scanner which will perform A lookups on all URLs
in the message, they set up a really slow nameserver for a zone,
and use URLs in that zone in their messages, then send hundreds of
msgs. Scanner will take forever, mail will back up, ouch.
Alternatively, if the scanner times out after 30 seconds of checking
URL A records, then they insert maybe 5 links with really slow A
records, in tiny img tags (let's say) so that humans will overlook
them, and 1 link with the *real* payload after that. The scanner
will time out after checking several, and not get to the real meat.
If we randomly select N urls to check from a 200-URL message, this
also provides a way for them to get around it; they just throw in
hundreds of junk links to Yahoo! etc.
We can keep coming up with new ways to heuristically determine why URLs
are likely to be spammy, but there's a whole metric crapload of ways for
them to avoid it, or attack it, IMO.
I'm thinking a good approach to this problem would be this:
- run an offline scanner (something like SpamAssassin's "mass-check")
over a spam/spamtrap corpus periodically
- this scanner greps out the IMG SRC and A href links
- parses out hostname parts
- does SBL/XBL/whatever lookups *in parallel* so timeouts are not a
bottleneck
- if a hostname uses an SBL-listed IP, create an SpamAssassin rule for
that hostname
- output the SpamAssassin ruleset to catch those URLs using the "uri"
rule type
- also, or alternatively, add them to a DNSBL of "spammer URLs"
for network lookups
This has 2 benefits:
- spammers listing legit URLs like www.yahoo.com do not cause FPs,
because those are not on BL-listed IPs
- super-slow servers will not bottleneck the scanner itself, just
the offline rule-generation step
(oh look, Chris Santerre suggested that! Great minds think alike, Chris ;)
Comments? I would be *very* interested in getting this working, given that
spam nowadays seems to be using a lot of self-hosting and/or proxies to host
their sites.
re danger of confirming email addresses. Consider this link:
img src=http://9eea82a2a786474ac9ceebe1ba296ad4.spamscumbag.biz
That's my address md5-encoded. It's also a valid address in a wildcard
zone. To avoid this, I'd suggest that we detect long strings in parts of
the hostname that could be wildcard zones, and throw in some random bits,
or just use random bits ourselves... a wildcard zone will respond to
anything there.
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.