Dhaval> I also know that training the same messages twice is not a good
    Dhaval> thing.  Are there flags which will not train any message which
    Dhaval> has already been trained?

Dunno, but in the contrib directory of the CVS repository (does contrib make
it into distributions?) there is a fuzzy checksum program (pycksum.py) I
wrote a long time ago based upon a similar tool developed by the
SpamAssassin folks.  If you pipe your mails through it before training it
might do a reasonable job of deleting putative duplicates.  If they are true
duplicates you can do something similar, just replace the guts of the
generate_checksum() function with something like md5.checksum().

    Dhaval> If possible, it would be helpful if you show me the flags you
    Dhaval> use when training initially after a fresh db is made, and the
    Dhaval> flags you for ongoing training.

I never incrementally train.  I use my train-to-exhaustion script (tte.py,
also in the contrib directory) fronted by a small shell script that sets
parameters, cleans the mails, etc.  It doesn't sound like you do incremental
training either.  I don't expect you will be able to use it as-is, but I've
attached it as something you can use as a starting point for tte.py
experimentation.

Skip

#!/bin/bash

# various generated files
## DB=$HOME/tmp/tte.db
LOG=$HOME/tmp/tte.log

# these are tighter than my actual scoring thresholds
HC=${HC:-0.03}
SC=${SC:-0.75}

TTEPY=$HOME/src/spambayes/contrib/tte.py

PYTHON=python
EXPIMP=sb_dbexpimp.py
UNHDR=sb_unheader.py

# can override these from the environment but I never do...
NEWHAM=${NEWHAM:-$HOME/tmp/newham}
NEWSPAM=${NEWSPAM:-$HOME/tmp/newspam}

# output of preliminary cleaning
OLDHAM=${NEWHAM}.old
OLDSPAM=${NEWSPAM}.old

# ratio of spam to ham
RATIO=${RATIO:-1:1}

# there must be at least one new ham or spam message to consider retraining
if [ -f $NEWHAM -o -f NEWSPAM ] ; then
    # clean up ham and spam collections, removing spambayes headers
    echo cleaning $NEWHAM
    touch $NEWHAM
    $UNHDR -p 'X-Hammie|X-Spam' $NEWHAM >> $OLDHAM
    chmod 600 $OLDHAM
    rm $NEWHAM

    echo cleaning $NEWSPAM
    touch $NEWSPAM
    $UNHDR -p 'X-Hammie|X-Spam' $NEWSPAM >> $OLDSPAM
    chmod 600 $OLDSPAM
    rm $NEWSPAM

    echo $TTEPY > $LOG
    if $PYTHON $TTEPY \
        --ratio=$RATIO \
          -g $OLDHAM \
          -s $OLDSPAM \
          -R \
          -o Categorization:ham_cutoff:$HC \
          -o Categorization:spam_cutoff:$SC \
          -c .cull \
          -v 2>> $LOG ; then
        chmod 600 $OLDHAM.cull $OLDSPAM.cull
        exit $?
    else
        echo "db generation failed - check tte.log"
        exit 1
    fi
else
    echo "nothing new to train on"
fi
_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Reply via email to