Hello,

I created a Python script that can process Unix mbox files and
generate statistics on the Bayesian filtering of SpamAssassin. 

I've ran the script (called 'spamp.py') on a Unix mbox containing
a total of 4381 messages. The mbox contained 2615 dutch non-spam
messages, 1 misclassified english non-spam message and 728 spam
messages (all from december 2003). The mbox also contained 133
messages that didn't have a "X-Spam-Status" header, and 904 of the
messages didn't have a BAYES_* test.

The script produced the following results:

SPAMP_REPORT_BAYES, (#=4381, no_saheader=133, no_bayes=904)
===========================================================

Bayes     Non-spam      Spam     Total
--------------------------------------
BAYES_00      1836         2      1838
BAYES_01       567         1       568
BAYES_10       109         0       109
BAYES_20        61         0        61
BAYES_30        29         0        29
BAYES_40         2         0         2
BAYES_44         7         2         9
BAYES_50         2         6         8
BAYES_60         1         0         1
BAYES_70         0         4         4
BAYES_80         0         5         5
BAYES_90         1         7         8
BAYES_99         0       702       702
--------------------------------------
Total:        2615       729      3344


----------------------------------------------------------------

Interpretation of the results
=============================

So only 1 out of the 3344 e-mails was misclassified! 
I'm not very unsatisfied with my Bayesian filter ;)

I've then interpreted the spam mails with BAYES_00 or BAYES_01,
and the ham mail with BAYES_90:

X-Comment: This is a Japanse mail of which I assume it is correctly
           classified as spam (as far as I can see)
Subject: [Plone-developers] ?$BL$>5Bz9-9p"(
X-Spam-Status: Yes, hits=5.5 required=5.0 tests=BAYES_00,
        CHARSET_FARAWAY_HEADER,JAPANESE_UCE_SUBJECT,NO_REAL_NAME,
        RCVD_IN_BL_SPAMCOP_NET,UNWANTED_LANGUAGE_BODY autolearn=no
        version=2.60

X-Comment: this is definite spam, but why did it have BAYES_00?
Subject: [Zope] How to make $250.000.-
X-Spam-Status: Yes, hits=5.6 required=5.0 tests=AWL,BAYES_00,EARN_MONEY,
        FORGED_HOTMAIL_RCVD,HTML_FONT_BIG,HTML_MESSAGE,MIME_BASE64_LATIN,
        MIME_BASE64_TEXT,RAZOR2_CF_RANGE_51_100,RAZOR2_CHECK,
        RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_SORBS autolearn=no version=2.60

X-Comment: This mail is correctly classified as ham, but it has quite a
           high score and BAYES_90
From: "eBay" <[EMAIL PROTECTED]>
Subject: Who's left on your list?
X-Spam-Status: No, hits=4.3 required=5.0 tests=AWL,BAYES_90,EXCUSE_14,
        HTML_50_60,HTML_MESSAGE,HTML_TITLE_UNTITLED,OFFERS_ETC autolearn=no
        version=2.60

X-Comment: This is the only misclassified ham document I could find
           The To-header contains a lot of repetition, and it sent using
           an asian (?) mailclient (I didn't obfuscate the e-mailadress in
           this e-mail yet)
Subject: jahia.properties  Database
X-Spam-Status: Yes, hits=8.5 required=5.0 tests=AWL,BAYES_01,BODY_8BITS,
        CHARSET_FARAWAY_HEADER,DATE_IN_PAST_12_24,INVALID_DATE,
        MIME_BASE64_TEXT, MIME_CHARSET_FARAWAY, SORTED_RECIPS,
        SUSPICIOUS_RECIPS autolearn=no version=2.60

---------------------------

The mailboxes of the 4 messages can be found at my public corpus:
http://gewis.nl/~pieterb/spamp/publiccorpus/small/20031222-hardham.mbox.txt
http://gewis.nl/~pieterb/spamp/publiccorpus/small/20031222-hardspam.mbox.txt
(i did some obfuscation already)

Download spamp.py version 0.100 (alpha)
=======================================
The Python script can be downloaded from:

        http://www.gewis.nl/~pieterb/spamp/spamp.py/

The Python code and it's documentation (called the 'work') is
licenced under the Creative Commons Attribution-ShareAlike 1.0
licence, see http://creativecommons.org/licenses/by-sa/1.0/ 
All disputes should be handled according to Dutch law.

All bugs, questions, patches, feedback are welcome ;)
Regards,

PieterB

-- 
No matter what goes wrong, there is always somebody
who knew it would.


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to