On 2/2/2011 5:25 PM, Karsten Bräckelmann wrote:
Spam to live accounts strongly preferred, human reviewed by "trained
monkeys". Emphasis on trained. ;)  Some crap like backscatter should be
filtered from the trap data, if possible, and trap volume kept lower --
best done by random sampling, rather than dupe elimination.

How much will that add to the corpus? In particular, how much would the
first class be, without trap data at all?

Karsten brings up a good point about two types of spam. How about something like:

* We want a total of 70K spam in your nightly corpus over the past week. This means 10K spam per day. * 3K spam on Monday is from trained monkeys. Include 7K from a random selection of trap spam. * 2K spam on Tuesday is from trained monkeys. Include 8K from a random selection of trap spam.
* etc.

You could even split it into two separate masscheck runs.
anubis-monkey
anubis-trap



Given we're talking original figures of 1 million spam per *day*,
already discussing ways to cut that down to 50-100k -- over a period of
up to 2 months for spam, 60 days, mind you -- which is less than 2k a
day...

It seems his spam is lacking spamassassin headers, so without "reuse" we are unable to determine delivery-time status of the network rules. I suggested that as long as his mail is lacking spamassassin headers, perhaps his random sample should be limited to the past week. Although not perfect, the past week might be closest to "reuse" in results.

A better alternative would to add spamassassin headers as each message was decided to be added to nightly masscheck corpus. The random subset of trap spam would have headers from seconds after delivery, and trained-monkey spam headers would be from whenever it was sorted. "reuse" would then be possible, and the age of spam included in the nightly masscheck can be calibrated based upon how much this corpus overwhelms everyone else's recent spam.

Warren

Reply via email to