Dhaval> I also know that training the same messages twice is not a good
Dhaval> thing. Are there flags which will not train any message which
Dhaval> has already been trained?
Dunno, but in the contrib directory of the CVS repository (does contrib make
it into distributions?) there is a fuzzy checksum program (pycksum.py) I
wrote a long time ago based upon a similar tool developed by the
SpamAssassin folks. If you pipe your mails through it before training it
might do a reasonable job of deleting putative duplicates. If they are true
duplicates you can do something similar, just replace the guts of the
generate_checksum() function with something like md5.checksum().
Dhaval> If possible, it would be helpful if you show me the flags you
Dhaval> use when training initially after a fresh db is made, and the
Dhaval> flags you for ongoing training.
I never incrementally train. I use my train-to-exhaustion script (tte.py,
also in the contrib directory) fronted by a small shell script that sets
parameters, cleans the mails, etc. It doesn't sound like you do incremental
training either. I don't expect you will be able to use it as-is, but I've
attached it as something you can use as a starting point for tte.py
experimentation.
Skip
#!/bin/bash
# various generated files
## DB=$HOME/tmp/tte.db
LOG=$HOME/tmp/tte.log
# these are tighter than my actual scoring thresholds
HC=${HC:-0.03}
SC=${SC:-0.75}
TTEPY=$HOME/src/spambayes/contrib/tte.py
PYTHON=python
EXPIMP=sb_dbexpimp.py
UNHDR=sb_unheader.py
# can override these from the environment but I never do...
NEWHAM=${NEWHAM:-$HOME/tmp/newham}
NEWSPAM=${NEWSPAM:-$HOME/tmp/newspam}
# output of preliminary cleaning
OLDHAM=${NEWHAM}.old
OLDSPAM=${NEWSPAM}.old
# ratio of spam to ham
RATIO=${RATIO:-1:1}
# there must be at least one new ham or spam message to consider retraining
if [ -f $NEWHAM -o -f NEWSPAM ] ; then
# clean up ham and spam collections, removing spambayes headers
echo cleaning $NEWHAM
touch $NEWHAM
$UNHDR -p 'X-Hammie|X-Spam' $NEWHAM >> $OLDHAM
chmod 600 $OLDHAM
rm $NEWHAM
echo cleaning $NEWSPAM
touch $NEWSPAM
$UNHDR -p 'X-Hammie|X-Spam' $NEWSPAM >> $OLDSPAM
chmod 600 $OLDSPAM
rm $NEWSPAM
echo $TTEPY > $LOG
if $PYTHON $TTEPY \
--ratio=$RATIO \
-g $OLDHAM \
-s $OLDSPAM \
-R \
-o Categorization:ham_cutoff:$HC \
-o Categorization:spam_cutoff:$SC \
-c .cull \
-v 2>> $LOG ; then
chmod 600 $OLDHAM.cull $OLDSPAM.cull
exit $?
else
echo "db generation failed - check tte.log"
exit 1
fi
else
echo "nothing new to train on"
fi
_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html