Re: [SAtalk] common patterns / improving bigevil

PieterB Sun, 18 Jan 2004 09:54:15 -0800

Hi,

I have an idea, similar to Scott A Crosby's datamining application.
I didn't use a datamining/analysis program, but used the Bayes
database. For example if you use:


        sa-learn --dump all | grep "^0\.999 *[0-9]*  *0 [0-9]*"

sa-learn will show all Bayes entries which are clearly a sign of spam
(score=0.999, zero occurences in ham). After manuallycleaning up the
list for non URL's, I have lines like:

0.999         36          0 1073851236  www.10cial.biz
0.999         49          0 1074054013  www.tupit.info
0.999         58          0 1074283556  U*www.treasurecity.biz.in
0.999         38          0 1073851236  D*naturalgrowth.us
0.999         48          0 1074371753  www.mytoyz.biz
0.999         34          0 1073976168  N:www.hwyNNz.com
0.999         35          0 1073769982  www.560000x.com
0.999         36          0 1074416509  www.gowebrx.com
0.999         36          0 1073841838  UD:2005hosting.com
0.999         54          0 1074302451  UD:3001hosting.com
0.999         34          0 1074301410  UD:getwebrx.com
0.999         47          0 1074279713  UD:mytoyz.biz
0.999         63          0 1074270837  UD:cashcome.net
0.999         58          0 1074283556  UD:ktbxurnjlpe.ph
0.999         38          0 1074111779  UD:whokz.info
0.999         36          0 1074036850  UD:freeadultranch.com
0.999         35          0 1073769982  UD:560000x.com
0.999         85          0 1074304161  UD:herbalsforcheap.com
0.999         45          0 1073719261  UD:mdpillsource.com
0.999         39          0 1074148074  UD:net.tw
0.999         31          0 1073802737  UD:2006hosting.com
0.999         36          0 1074025244  UD:bestofthestarz.com
0.999         38          0 1074133361  UD:ez-123hosting.com
0.999         71          0 1074372616  UD:amyz.info
0.999         34          0 1073976168  UD:hwy55z.com
0.999         39          0 1074302451  UD:3002hosting.com
0.999         49          0 1073888477  UD:e-hostzz.com
0.999         73          0 1073871887  UD:kimo.com.tw
0.999         36          0 1073851236  UD:10cial.biz
0.999         89          0 1074193423  UD:tupit.info
0.999         31          0 1074318551  UD:nepzzz.com

I'm thinking of writing a script that can use this information and
can filter the spam mbox to find the full URL patterns. These URL
patterns can then be used to write custom rules, or to extend the
bigevil ruleset.

Some questions:
- does this sound like a good idea?
- is the source list of domains listed in bigevil available?
- making it easier to contribute URL's for bigevil might increase
  the number of false positives. How can this be prevented?
  (e.g. using only 0.999 bayesscore, contributor should check that
  the URL parts don't exist in ham, domainname must exist in DNS,
   logging contributor, requiring example of spammail when contributing,
   etc).

Suggestions are welcome,
Regards,
Pieter

-- 
http://zwiki.org/PieterB


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Re: [SAtalk] common patterns / improving bigevil

Reply via email to