Hi, I have a large no of good and spam messages (few thousands) collected over a year and would like to use these for initial training. But I know that it is preferable to train with only small subset of these messages (may be a thousand - 500 spam and 500 ham) to keep my training db minimal,fast and effective.
My query is....do I need to manually pick out some thousand latest messages from this large corpus and input to SpamBayes or Can SpamBayes automatically (in fact smartly) do this job for me when given the entire set and a required corpus size? If this feature is not available would this not be a hell of useful feature to support? Ok, why I think manual classification - just picking up the latest 1000 messages (for a corpus size 1000) from my large corpus- may not be much effective : Not all the messages from the corpus may need to be trained ( using train on error+unsures strategy) , for example if the last hundred good messages I received are of the same type (ex:a long running thread about a specific topic)...then SpamBayes can easily classify any future message of this type by just training on small part of these messages...So to get to a message corpus size of 1000 messages (and to train SpamBayes over a wide coverage of spam and ham message types), I may need to repeat the training multiple times with different subsets until I achieve an effective corpus. Hope I have explained my query clearly...Pardon me for any ignorance. Thanks for clarifications in advance. Ram -- View this message in context: http://www.nabble.com/Does-SpamBayes-support-automatic-selective-training--tp14602895p14602895.html Sent from the Spambayes - General mailing list archive at Nabble.com. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Info/Unsubscribe: http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
