Do you have reason to believe that incremental training on messages that you're currently receiving would be ineffective? I retrain from scratch periodically, and I generally find that a remarkably small corpus (maybe a total of couple of dozen messages trained) is effective. I retrain in part because I suspect that the content of spam that I receive changes over time, so training performed on messages from the distant past (say, six months ago) may be irrelevant or worse for my current message stream.
One of the counter-intuitive things about SpamBayes is how little data it needs to go on. This makes retraining fast, easy, and (for me, at least) perversely rewarding. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of gpr Sent: Thursday, January 03, 2008 1:44 PM To: [email protected] Subject: [Spambayes] Does SpamBayes support automatic selective training? Hi, I have a large no of good and spam messages (few thousands) collected over a year and would like to use these for initial training. But I know that it is preferable to train with only small subset of these messages (may be a thousand - 500 spam and 500 ham) to keep my training db minimal,fast and effective. My query is....do I need to manually pick out some thousand latest messages from this large corpus and input to SpamBayes or Can SpamBayes automatically (in fact smartly) do this job for me when given the entire set and a required corpus size? If this feature is not available would this not be a hell of useful feature to support? Ok, why I think manual classification - just picking up the latest 1000 messages (for a corpus size 1000) from my large corpus- may not be much effective : Not all the messages from the corpus may need to be trained ( using train on error+unsures strategy) , for example if the last hundred good messages I received are of the same type (ex:a long running thread about a specific topic)...then SpamBayes can easily classify any future message of this type by just training on small part of these messages...So to get to a message corpus size of 1000 messages (and to train SpamBayes over a wide coverage of spam and ham message types), I may need to repeat the training multiple times with different subsets until I achieve an effective corpus. Hope I have explained my query clearly...Pardon me for any ignorance. Thanks for clarifications in advance. Ram -- View this message in context: http://www.nabble.com/Does-SpamBayes-support-automatic-selective-trainin g--tp14602895p14602895.html Sent from the Spambayes - General mailing list archive at Nabble.com. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Info/Unsubscribe: http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Info/Unsubscribe: http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
