Hi, I have an idea, similar to Scott A Crosby's datamining application. I didn't use a datamining/analysis program, but used the Bayes database. For example if you use:
sa-learn --dump all | grep "^0\.999 *[0-9]* *0 [0-9]*" sa-learn will show all Bayes entries which are clearly a sign of spam (score=0.999, zero occurences in ham). After manuallycleaning up the list for non URL's, I have lines like: 0.999 36 0 1073851236 www.10cial.biz 0.999 49 0 1074054013 www.tupit.info 0.999 58 0 1074283556 U*www.treasurecity.biz.in 0.999 38 0 1073851236 D*naturalgrowth.us 0.999 48 0 1074371753 www.mytoyz.biz 0.999 34 0 1073976168 N:www.hwyNNz.com 0.999 35 0 1073769982 www.560000x.com 0.999 36 0 1074416509 www.gowebrx.com 0.999 36 0 1073841838 UD:2005hosting.com 0.999 54 0 1074302451 UD:3001hosting.com 0.999 34 0 1074301410 UD:getwebrx.com 0.999 47 0 1074279713 UD:mytoyz.biz 0.999 63 0 1074270837 UD:cashcome.net 0.999 58 0 1074283556 UD:ktbxurnjlpe.ph 0.999 38 0 1074111779 UD:whokz.info 0.999 36 0 1074036850 UD:freeadultranch.com 0.999 35 0 1073769982 UD:560000x.com 0.999 85 0 1074304161 UD:herbalsforcheap.com 0.999 45 0 1073719261 UD:mdpillsource.com 0.999 39 0 1074148074 UD:net.tw 0.999 31 0 1073802737 UD:2006hosting.com 0.999 36 0 1074025244 UD:bestofthestarz.com 0.999 38 0 1074133361 UD:ez-123hosting.com 0.999 71 0 1074372616 UD:amyz.info 0.999 34 0 1073976168 UD:hwy55z.com 0.999 39 0 1074302451 UD:3002hosting.com 0.999 49 0 1073888477 UD:e-hostzz.com 0.999 73 0 1073871887 UD:kimo.com.tw 0.999 36 0 1073851236 UD:10cial.biz 0.999 89 0 1074193423 UD:tupit.info 0.999 31 0 1074318551 UD:nepzzz.com I'm thinking of writing a script that can use this information and can filter the spam mbox to find the full URL patterns. These URL patterns can then be used to write custom rules, or to extend the bigevil ruleset. Some questions: - does this sound like a good idea? - is the source list of domains listed in bigevil available? - making it easier to contribute URL's for bigevil might increase the number of false positives. How can this be prevented? (e.g. using only 0.999 bayesscore, contributor should check that the URL parts don't exist in ham, domainname must exist in DNS, logging contributor, requiring example of spammail when contributing, etc). Suggestions are welcome, Regards, Pieter -- http://zwiki.org/PieterB ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk