On Apr 25, 2007, at 4:30 PM, Arik Raffael Funke wrote:
I am now probably venturing off-topic on my own thread but the point you make is interesting: You train only misfiled messages. What about new but correctly filed messages? You _never_ train on them? Given that bayes is a statistical method, is it really sufficient to only train on the mis-files?

the nightly cron job trained against the spam folder and a subset of the read folders likely to have spam in them (archive, recent working folders, etc.). i'd periodically retrain across the entire mail tree. the retraining only for specific misfiled messages handles both spam and hand.

retraining only on misfiles is not as accurate as training on all mail, but is a lot lighter weight, so i can run it every 5 minutes instead of every night.

The proportional spam/ham weight of keywords would in this case not be adjusted in the database if/when they change in your mail traffic, or? Are you not encountering a higher number of mis-files compared to your previous learning practise?

the number of misfiles i get is so low that it's hard to tell if there's a difference. i periodically get floods of new false- negatives, but those typically correct after the first few are retrained. when retraining across the entire mail spool the problems usually corrected after the first night.

-faisal

Reply via email to