FWIW I use a combination of two sources for HAM training:
1) some selected chunks of my own email (ie: mailing lists not involving SA, personal email, etc)
2) I set up a "nonspamtrap" account, and I've subscribed this to a few of the newsletters my user's commonly subscribe to.
Note that an equal amount of spam and ham isn't exactly required, and it's not exactly optimal either, so don't kill yourself trying to make the numbers exactly match. Just don't have some huge imbalance (optimal would be to have the same spam/ham ratio in your training that your server sees in reality)
At 04:35 PM 2/2/2004, Mike Samba wrote:
This might sound more than a little stupid, but...
I am looking into implementing Bayes filtering and have stockpiled a TON of Spam to train with. Where are you getting an equal amount of Ham to train with? I administer an email domain, but only have access to my own mail (ethically). What are your suggestions on rounding up 1000 or so Ham messages from my users so that it is not too intrusive or annoying for the user?
Any suggestions would be great!!!
Mike
