In general, please stop worrying about your corpus being ideal. Our sample size right now is so small that even non-ideal corpora would be helpful. Get started with cron nightly masschecks then work on improving your corpus later.
I personally include: * The last 4 weeks of spam. I use logrotate to automatically rotate one week at a time so I don't have to worry about it. I receive LOTS of spam so this is a good quantity. IMHO, spam older than a month is far less useful to test spamassassin's rules. * Last 2 years of ham. If we had 10x as many contributors to nightly masscheck then I might reduce this to last 1 year of ham. Warren