Justin Mason wrote:
Darryl Bleau writes:
>The SA Public Corpus at spamassassin.org/publiccorpus has been a great >help to myself and others who like to use a standard corpus of mail to >evaluate new anti-spam ideas and current techniques.
>However, it's now quite dated, with the newest collection being 2003/02/28.
>My question is, is there a newer collection in another location that I'm >missing out on, or if not, are there any plans to have an updated public >corpus?
Well, it's pretty labour-intensive to put together -- but I suppose the ham hasn't changed much since 2003/02, so I could just upload some newer spam.
sound useful?
Yes, quite. :)
The issue really isn't gathering spam... while it is a pain to manually verify them, anyone with enough time can do it. What's nice about the SA public corpus is that it's a common, open set of mail from a trusted source which makes it quite useful to use when comparing with others.
The only suggestion I would have for the ham would be to remove the SA-list (or Spam-topic ham) related messages, for the same reason that you don't incude these types of messages in the mass checks.
On a related note, there was talk some time back (I'm not sure if it was on this list or not) about setting up a publicly-updated corpus using some sort of trust/verification mechanism. If there is interest (besides myself) in this sort of thing I could take a look into seeing how to get it going.
