Bart Schaefer <[email protected]> writes:

> On Sun, Sep 15, 2013 at 7:53 PM, Harry Putnam <[email protected]> wrote:
>> I've been trying to `teach' SA to spam from ham in my mail system.
>>
>> I've made it thru two main learning sessions where I ran around 450
>> msgs (each time) thru sa-learn spam/ham and yet SA is still incapable
>> of getting it right more than about 40 % or maybe less.
>
> You say you've run 1100 messages through -- have at least 200 of those
> been ham?  Bayes won't kick in until 200 *each* of spam and ham are
> trained.  You can run "sa-learn --dump magic" to see how many of each
> it believes it has seen.

Yes

> If you've sa-learned enough of both types, is it possible you haven't
> enabled bayes scoring?  Are the BAYES_* rules showing up at all in the
> score details for newly arrived messages fed through spamc?

Yes, here is an example of a message rated as spam:

X-Spam-Report: *  3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
        *      [score: 0.9999]
        *  0.4 STOX_REPLY_TYPE STOX_REPLY_TYPE
        *  1.2 RCVD_NUMERIC_HELO Received: contains an IP address used for
        HELO
        *  1.8 STOX_REPLY_TYPE_WITHOUT_QUOTES STOX_REPLY_TYPE_WITHOUT_QUOTES

-------        ---------       ---=---       ---------      -------- 

This message is a bit disorganized but I'm experimenting all thru
this. 

Below is the message counts and the 'magic' produced of my 2 learning
sessions: 

  675 msgs thru sa-learn --mbox --spam spam
  228 msgs thru sa-learn --mbox --ham  ham

Resulting in this magic output:
reader > sa-learn --dump magic 
0.000     0          3     0  non-token data: bayes db version
0.000     0        675     0  non-token data: nspam
0.000     0        214     0  non-token data: nham
0.000     0     117579     0  non-token data: ntokens
0.000     0 1369611901     0  non-token data: oldest atime
0.000     0 1374276652     0  non-token data: newest atime
0.000     0          0     0  non-token data: last journal sync atime
0.000     0          0     0  non-token data: last expiry atime
0.000     0          0     0  non-token data: last expire atime delta
0.000     0          0     0  non-token data: last expire reduction count


Now I'm running several thousand mixed/spam/ham thru procmail/SA with
the magic as above.
 -------        ---------       ---=---       ---------      -------- 

.procmailrc consists of:

#shell-script-*--
PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin
SHELL=/bin/sh
MAILDIR=/home/reader/projects/reader/proc/spool
LOGFILE=/home/reader/projects/reader/proc/log/log
ORGMAIL=/home/reader/projects/reader/proc/spool/$LOGNAME
DEFAULT=$ORGMAIL
VERBOSE=YES 
LOG="Processing <$FILENO>
"
TRAP='formail -XMessage-Id: && date +"%b %d %T%nSTOP"'

PSCRIPTS="/home/reader/projects/perl"
SCRIPTS="/home/reader/scripts/"
MAILARC="/home/reader/proc/spool"

:0fw
| /usr/bin/spamc

:0:
* ^X-Spam-Status: Yes   
spam

:0
ham

-------        ---------       ---=---       ---------      -------- 

Local.cf looks like: 

ok_locales en
report_safe 0

## Trusted network
192.168.1.

use_bayes 1

bayes_auto_learn 0

-------        ---------       ---=---       ---------      -------- 

Below file sizes shows what happens with no learning
sessions.

-rw------- 1 reader nfsu 16878376 Sep 16 10:45 ham
-rw------- 1 reader nfsu  4406449 Sep 16 10:45 spam

There is way more ham than spam and my actual ham is probably
something like 10-12% of mail... probably less. So there is roughly 4
times MORE spam registered than there should be.  But that is with
no learning.

-------        ---------       ---=---       ---------      -------- 

Below shows the relative size of ham/spam at the 3825 mark in message
count.  Still way way over what it should be since it is after the
learning sessions that produced the 'magic' posted above. 

So I guess that is significant improvement although seems like it
should be a good bit better.  Here it is closer to 3:1 and above is
closer to 4:1

reader > lsp
total 119741
-rw------- 1 reader nfsu 92106819 Sep 16 16:36 ham
-rw------- 1 reader nfsu 30382534 Sep 16 16:36 spam

-------        ---------       ---=---       ---------      -------- 

Do you think the ratio shown above is about normal for the amount of
learning done?



Reply via email to