Hello, I created a Python script that can process Unix mbox files and generate statistics on the Bayesian filtering of SpamAssassin.
I've ran the script (called 'spamp.py') on a Unix mbox containing a total of 4381 messages. The mbox contained 2615 dutch non-spam messages, 1 misclassified english non-spam message and 728 spam messages (all from december 2003). The mbox also contained 133 messages that didn't have a "X-Spam-Status" header, and 904 of the messages didn't have a BAYES_* test. The script produced the following results: SPAMP_REPORT_BAYES, (#=4381, no_saheader=133, no_bayes=904) =========================================================== Bayes Non-spam Spam Total -------------------------------------- BAYES_00 1836 2 1838 BAYES_01 567 1 568 BAYES_10 109 0 109 BAYES_20 61 0 61 BAYES_30 29 0 29 BAYES_40 2 0 2 BAYES_44 7 2 9 BAYES_50 2 6 8 BAYES_60 1 0 1 BAYES_70 0 4 4 BAYES_80 0 5 5 BAYES_90 1 7 8 BAYES_99 0 702 702 -------------------------------------- Total: 2615 729 3344 ---------------------------------------------------------------- Interpretation of the results ============================= So only 1 out of the 3344 e-mails was misclassified! I'm not very unsatisfied with my Bayesian filter ;) I've then interpreted the spam mails with BAYES_00 or BAYES_01, and the ham mail with BAYES_90: X-Comment: This is a Japanse mail of which I assume it is correctly classified as spam (as far as I can see) Subject: [Plone-developers] ?$BL$>5Bz9-9p"( X-Spam-Status: Yes, hits=5.5 required=5.0 tests=BAYES_00, CHARSET_FARAWAY_HEADER,JAPANESE_UCE_SUBJECT,NO_REAL_NAME, RCVD_IN_BL_SPAMCOP_NET,UNWANTED_LANGUAGE_BODY autolearn=no version=2.60 X-Comment: this is definite spam, but why did it have BAYES_00? Subject: [Zope] How to make $250.000.- X-Spam-Status: Yes, hits=5.6 required=5.0 tests=AWL,BAYES_00,EARN_MONEY, FORGED_HOTMAIL_RCVD,HTML_FONT_BIG,HTML_MESSAGE,MIME_BASE64_LATIN, MIME_BASE64_TEXT,RAZOR2_CF_RANGE_51_100,RAZOR2_CHECK, RCVD_IN_BL_SPAMCOP_NET,RCVD_IN_SORBS autolearn=no version=2.60 X-Comment: This mail is correctly classified as ham, but it has quite a high score and BAYES_90 From: "eBay" <[EMAIL PROTECTED]> Subject: Who's left on your list? X-Spam-Status: No, hits=4.3 required=5.0 tests=AWL,BAYES_90,EXCUSE_14, HTML_50_60,HTML_MESSAGE,HTML_TITLE_UNTITLED,OFFERS_ETC autolearn=no version=2.60 X-Comment: This is the only misclassified ham document I could find The To-header contains a lot of repetition, and it sent using an asian (?) mailclient (I didn't obfuscate the e-mailadress in this e-mail yet) Subject: jahia.properties Database X-Spam-Status: Yes, hits=8.5 required=5.0 tests=AWL,BAYES_01,BODY_8BITS, CHARSET_FARAWAY_HEADER,DATE_IN_PAST_12_24,INVALID_DATE, MIME_BASE64_TEXT, MIME_CHARSET_FARAWAY, SORTED_RECIPS, SUSPICIOUS_RECIPS autolearn=no version=2.60 --------------------------- The mailboxes of the 4 messages can be found at my public corpus: http://gewis.nl/~pieterb/spamp/publiccorpus/small/20031222-hardham.mbox.txt http://gewis.nl/~pieterb/spamp/publiccorpus/small/20031222-hardspam.mbox.txt (i did some obfuscation already) Download spamp.py version 0.100 (alpha) ======================================= The Python script can be downloaded from: http://www.gewis.nl/~pieterb/spamp/spamp.py/ The Python code and it's documentation (called the 'work') is licenced under the Creative Commons Attribution-ShareAlike 1.0 licence, see http://creativecommons.org/licenses/by-sa/1.0/ All disputes should be handled according to Dutch law. All bugs, questions, patches, feedback are welcome ;) Regards, PieterB -- No matter what goes wrong, there is always somebody who knew it would. ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk