On Monday, March 29, 2004, 10:01:59 AM, Marc Perkel wrote: > Here's some of my initial thoughts.
> In the domain is what I would call the "real" part of the domain. > farmsex.com > farmsex.co.uk > The part before the "farmsex" should be ignored. Anyone who controls the > domains also probably controls the subdomains and that is likely the > rotating part. Yes, that's done automatically by the averaging and summarization on the data service side (i.e. in SURBL). This effect is hopefully adequately explained in my docs: http://spamcheck.freeapp.net/ http://sc.surbl.org/ On the date service side, this desirable effect happens essentially by side-effect of the data handling. It's part of the design, but the "real" domains pop out of the data and into SURBL pretty much on their own. Extraction of the "real" domain also needs to be done on the SA/SURBL client side so that only the "real" part of the domain from the message is compared against SURBL. Heuristically all that's needed most of the time is to compare the second and third levels of any given domain that occurs in a message body URI. That will match the data in SURBL very well: http://spamcheck.freeapp.net/top-sites-domains for named URIs. Numeric addresses should match on all four octets. > Additionally - a reverse lookup should be done on the IPs of the links > for the purpouses of statistical tracking. We might find the the > resolved IP is always spam - or always not spam - or sometimes spam and > sometimes not spam. We may be able to return a score on the resolved IP > addresses. I believe that we are going to see a lot of spam linking to > the same IP or groups of IPs and that if a new URI resolves to the same > IP address as farmsex.com then it is likely also spam. That can be done, but it's not really part of my intended purpose for the SURBL data itself. I envision it as literal, unresolved domain name (and IP address 1.2.3.4 in the original URI like http://1.2.3.4/foo) comparison. I expect no DNS resolution to be used at all anywhere around SURBL, and I expect this to work well! > The thought is that spammers might start linking to cnn.com or something > to try to raise the score - even if it's in hidden text. And - that's an > issue - but live links to other sites might defeat the purpose of the > spam and mixing blacklisted sites with nonblacklisted might even become > a stronger indicator of spam. I'd lump this into the general category of Joe Job. It can pretty easily be defeated by whitelisting. In fact I already have cnn.com in my small whitelist along with a couple other news sites: http://spamcheck.freeapp.net/whitelist-domains.sort Due to the averaging effect and careful reporting, and a somewhat high inclusion threshold, full domains in the whitelist are seldom actually hit however. Frankly this all works somewhat better than I expected on the data side, mostly due to the quality of the SpamCop URI data probably. Jeff C. -- Jeff Chan mailto:[EMAIL PROTECTED] http://sc.surbl.org/
