Bart Schaefer <[email protected]> writes:
> On Sun, Sep 15, 2013 at 7:53 PM, Harry Putnam <[email protected]> wrote:
>> I've been trying to `teach' SA to spam from ham in my mail system.
>>
>> I've made it thru two main learning sessions where I ran around 450
>> msgs (each time) thru sa-learn spam/ham and yet SA is still incapable
>> of getting it right more than about 40 % or maybe less.
>
> You say you've run 1100 messages through -- have at least 200 of those
> been ham? Bayes won't kick in until 200 *each* of spam and ham are
> trained. You can run "sa-learn --dump magic" to see how many of each
> it believes it has seen.
Yes
> If you've sa-learned enough of both types, is it possible you haven't
> enabled bayes scoring? Are the BAYES_* rules showing up at all in the
> score details for newly arrived messages fed through spamc?
Yes, here is an example of a message rated as spam:
X-Spam-Report: * 3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
* [score: 0.9999]
* 0.4 STOX_REPLY_TYPE STOX_REPLY_TYPE
* 1.2 RCVD_NUMERIC_HELO Received: contains an IP address used for
HELO
* 1.8 STOX_REPLY_TYPE_WITHOUT_QUOTES STOX_REPLY_TYPE_WITHOUT_QUOTES
------- --------- ---=--- --------- --------
This message is a bit disorganized but I'm experimenting all thru
this.
Below is the message counts and the 'magic' produced of my 2 learning
sessions:
675 msgs thru sa-learn --mbox --spam spam
228 msgs thru sa-learn --mbox --ham ham
Resulting in this magic output:
reader > sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 675 0 non-token data: nspam
0.000 0 214 0 non-token data: nham
0.000 0 117579 0 non-token data: ntokens
0.000 0 1369611901 0 non-token data: oldest atime
0.000 0 1374276652 0 non-token data: newest atime
0.000 0 0 0 non-token data: last journal sync atime
0.000 0 0 0 non-token data: last expiry atime
0.000 0 0 0 non-token data: last expire atime delta
0.000 0 0 0 non-token data: last expire reduction count
Now I'm running several thousand mixed/spam/ham thru procmail/SA with
the magic as above.
------- --------- ---=--- --------- --------
.procmailrc consists of:
#shell-script-*--
PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin
SHELL=/bin/sh
MAILDIR=/home/reader/projects/reader/proc/spool
LOGFILE=/home/reader/projects/reader/proc/log/log
ORGMAIL=/home/reader/projects/reader/proc/spool/$LOGNAME
DEFAULT=$ORGMAIL
VERBOSE=YES
LOG="Processing <$FILENO>
"
TRAP='formail -XMessage-Id: && date +"%b %d %T%nSTOP"'
PSCRIPTS="/home/reader/projects/perl"
SCRIPTS="/home/reader/scripts/"
MAILARC="/home/reader/proc/spool"
:0fw
| /usr/bin/spamc
:0:
* ^X-Spam-Status: Yes
spam
:0
ham
------- --------- ---=--- --------- --------
Local.cf looks like:
ok_locales en
report_safe 0
## Trusted network
192.168.1.
use_bayes 1
bayes_auto_learn 0
------- --------- ---=--- --------- --------
Below file sizes shows what happens with no learning
sessions.
-rw------- 1 reader nfsu 16878376 Sep 16 10:45 ham
-rw------- 1 reader nfsu 4406449 Sep 16 10:45 spam
There is way more ham than spam and my actual ham is probably
something like 10-12% of mail... probably less. So there is roughly 4
times MORE spam registered than there should be. But that is with
no learning.
------- --------- ---=--- --------- --------
Below shows the relative size of ham/spam at the 3825 mark in message
count. Still way way over what it should be since it is after the
learning sessions that produced the 'magic' posted above.
So I guess that is significant improvement although seems like it
should be a good bit better. Here it is closer to 3:1 and above is
closer to 4:1
reader > lsp
total 119741
-rw------- 1 reader nfsu 92106819 Sep 16 16:36 ham
-rw------- 1 reader nfsu 30382534 Sep 16 16:36 spam
------- --------- ---=--- --------- --------
Do you think the ratio shown above is about normal for the amount of
learning done?