While I still plan for this to primarily be used via rsync and a
spamassassin plugin, I've loaded the data into DNS records and created
spamassassin rules so it can easily be tested now.  It's updating
automatically once a day.

I'm hoping this will encourage people to contribute data.  Because now you
should get an immediate improvement in your spam filtration, based on data
you've provided on what IPs send you ham and spam.  

More info, including the script to submit data (either from spam/ham
folders, or individual emails piped to standard input) here:
http://www.chaosreigns.com/iprep/

The spamassassin rules:


ifplugin Mail::SpamAssassin::Plugin::DNSEval
header  __RCVD_IN_IPREP   eval:check_rbl('iprep-firsttrusted', 
'iprep.chaosreigns.com.')
tflags  __RCVD_IN_IPREP   nice net

header   RCVD_IN_IPREPDNS_100       eval:check_rbl_sub('iprep-firsttrusted', 
'127.\d+.\d+.100')
describe RCVD_IN_IPREPDNS_100       Sender listed at 
http://www.chaosreigns.com/iprep/, 100% ham
tflags   RCVD_IN_IPREPDNS_100       nice net

header   RCVD_IN_IPREPDNS_50        eval:check_rbl_sub('iprep-firsttrusted', 
'127.\d+.\d+.50')
describe RCVD_IN_IPREPDNS_50        Sender listed at 
http://www.chaosreigns.com/iprep/, 50% ham
tflags   RCVD_IN_IPREPDNS_50        nice net

header   RCVD_IN_IPREPDNS_0         eval:check_rbl_sub('iprep-firsttrusted', 
'127.\d+.\d+.0')
describe RCVD_IN_IPREPDNS_0         Sender listed at 
http://www.chaosreigns.com/iprep/, 0% ham
tflags   RCVD_IN_IPREPDNS_0         net

meta     RCVD_NOT_IN_IPREPDNS       ( ! RCVD_IN_IPREPDNS_100 && ! 
RCVD_IN_IPREPDNS_50 && ! RCVD_IN_IPREPDNS_0 && ! NO_RELAYS )
describe RCVD_NOT_IN_IPREPDNS       Sender not listed at 
http://www.chaosreigns.com/iprep/
tflags   RCVD_NOT_IN_IPREPDNS       net

score RCVD_IN_IPREPDNS_100 -0.1
score RCVD_IN_IPREPDNS_50  -0.0001
score RCVD_IN_IPREPDNS_0    0.1
score RCVD_NOT_IN_IPREPDNS  0.0001
endif



For people not contributing data, this is not likely to be useful yet.

Out of the 86,899 IPs I have data for, all but 38 are either 100% spam or
100% ham, so a great predictor of what the next email from known IPs will
be.  This is why blacklists and whitelists, including spamassassin's AWL
(which is another combination of both) are nothing new.  

The advantages I'm providing over SA's AWL are:
1) It's based on human verified ham and spam, not SA's previous opinions of
   emails.
2) Shared knowledge from other people's email.

What I hope to be an advantage over dnswl.org, which I've been involved in,
is increased automation.


Here's a test I ran using only the last 500 of my own emails.  All hand
categorized as spam or ham, and sorted by received data.  One by one it
learns the IP as a ham source, spammer, or mix, and using what it has
learned, guesses what the next email is.  Every 100 emails it reports its
success rate for the last 100 emails:

$ ./progress.pl
Rank 100, hit 51.7647058823529% of ham, hit 0% of spam.
Rank 50, hit 0% of ham, hit 0% of spam.
Rank 0, hit 0% of ham, hit 0% of spam.
Rank none, hit 48.2352941176471% of ham, hit 100% of spam.

Rank 100, hit 76% of ham, hit 0% of spam.
Rank 50, hit 0% of ham, hit 0% of spam.
Rank 0, hit 0% of ham, hit 28% of spam.
Rank none, hit 24% of ham, hit 72% of spam.

Rank 100, hit 72.3684210526316% of ham, hit 0% of spam.
Rank 50, hit 0% of ham, hit 0% of spam.
Rank 0, hit 0% of ham, hit 4.16666666666667% of spam.
Rank none, hit 27.6315789473684% of ham, hit 95.8333333333333% of spam.

Rank 100, hit 79.4520547945205% of ham, hit 0% of spam.
Rank 50, hit 0% of ham, hit 0% of spam.
Rank 0, hit 0% of ham, hit 48.1481481481481% of spam.
Rank none, hit 20.5479452054795% of ham, hit 51.8518518518519% of spam.

Rank 100, hit 79.2682926829268% of ham, hit 0% of spam.
Rank 50, hit 0% of ham, hit 0% of spam.
Rank 0, hit 0% of ham, hit 27.7777777777778% of spam.
Rank none, hit 20.7317073170732% of ham, hit 72.2222222222222% of spam.


So after 400 emails, RCVD_IN_IPREPDNS_100 is hitting 79% of ham and no
spam.  I don't think anything else spamassassin uses can do this well.

But I have data from 184,335 emails.  Using all that data, results for
the last 10,000 emails were:

Rank 100, hit 94.1176470588235% of ham, hit 0.0101553772722657% of spam.
Rank 50, hit 1.30718954248366% of ham, hit 0.0101553772722657% of spam.
Rank 0, hit 0% of ham, hit 64.2022951152635% of spam.
Rank none, hit 4.57516339869281% of ham, hit 35.7773941301919% of spam.

RCVD_IN_IPREPDNS_100 hits 94% of ham, and 0.01% of spam.
RCVD_IN_IPREPDNS_0 hits 64% of spam and no ham.  Again, I don't think
anything else spamassassin uses can do this well.  

But results this good can only be expected for people contributing data.
At least until we get more people contributing data.

-- 
"The price of freedom is the willingness to do sudden battle, anywhere,
at any time, and with utter recklessness." - Robert A. Heinlein
http://www.ChaosReigns.com

Reply via email to