-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
So here's an interesting idea -- it seems likely that (a) there is a small number of spammers who send all the spam (we knew that already), (b) they can be identified from their spam, and (c) by correlating known fingerprints (such as ROKSO records) against the text, the spammers' output can then be tracked into new spams. In other words, we could identify *which* spammer is likely to have produced a given spam. Anyone interested in trying this out? - --j. - ------- Forwarded Message > From: Brian McNett <[EMAIL PROTECTED]> > Subject: Statistical methods for determining spam authorship > ... > Basically I just identify features in spam, and build clusters based on > them. What the clusters mean depend on which features I match against. > There is no right or wrong way to cluster the data. If I were more of a > programmer, I'd just suck all the spam features into an SQL database, > and dash off a simple clustering algorithm in perl. > > Identifying a spammer based on his spam is largely a matter of > determining authorship based on consistent patterns in the text. > Techniques for doing this are already known. They are collectively > called "Stylometry". > > One of the stylometric techniques which has already been applied to spam > is "Chi by degrees of freedom" O'Brien and Vogel applied it > SPECIFICALLY in the context of identifying spammers from their spam. > > http://master.iu.hio.no/wiki/index.php/Spam#Chi_by_degrees_of_freedom > > The note attached to that wiki entry that the method requires that > spammers not conciously change their writing style across text is > somewhat inaccurate. It takes a rather LARGE change in writing style, > and some identifying features will always be present. The original paper > is here: > > http://www.cs.tcd.ie/publications/tech-reports/reports.03/TCD-CS-2003-13.pdf > > There is also a more recent paper by the same authors on feature > selection: > > http://www.cs.tcd.ie/Cormac.OBrien/subject.pdf > > And yet another comparing chi by d.f. with SpamAssassin: > > http://www.cs.tcd.ie/Cormac.OBrien/spamAss.pdf -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFBEGOcQTcbUG5Y7woRAjRlAJ9J/XZy80uZfEydsU89bcrPtUpRnwCfTJku 9+u3H7231En202rVymmg2R0= =ZiaN -----END PGP SIGNATURE-----
