The data set which i use for bayes consists of both ham and spam. ( https://www.cs.cmu.edu/~./enron/)
Lets consider a scenario, where I have a domain and I point it to a mailserver. It might take a while for me to generate 50,000 mails a day ( mailinator provides me this) . I need to embed multiple mail ids into several forums for the web scrapers to pick it up. I have tried to get hold of mails from my university - but it is a long and tedious process. I can try the method which Reindl suggested. On Tue, May 31, 2016 at 6:32 AM, Reindl Harald <h.rei...@thelounge.net> wrote: > > > Am 31.05.2016 um 15:28 schrieb Antony Stone: > >> 2. You should be aware (*especially* if using this stuff as the basis of a >> research project - any competent referee should pick up on something like >> this) that SA works best when the emails it is asked to process are from >> the >> same source as it has been trained with. In other words, you shovel real >> emails through a real mail server and train SA using this spam and ham; >> you >> then use that trains SA to assess mail passing through that same mail >> server, >> for the same users. Anything significantly varying from this is not >> going to >> work well, and is certainly not a good test of how well SA works. >> > > not true - i heard similar nonsense about "you can't re-use you MX bayes > database on a submission server" - i can, do and it works like a charm > > our current corpus is 90000 mails large, conatins samples in many > languages for many users (site-wide setup) and that bayes is shared with > another company for more than a year now and has the same results there as > here (96% hit quote) > >