Re: Mass-check Corpora (once was: Re: Update Mirror Issues)

Warren Togami Jr. Wed, 02 Feb 2011 21:15:14 -0800

On 2/2/2011 5:25 PM, Karsten Bräckelmann wrote:

Spam to live accounts strongly preferred, human reviewed by "trained
monkeys". Emphasis on trained. ;)  Some crap like backscatter should be
filtered from the trap data, if possible, and trap volume kept lower --
best done by random sampling, rather than dupe elimination.


How much will that add to the corpus? In particular, how much would the
first class be, without trap data at all?

Karsten brings up a good point about two types of spam. How aboutsomething like:

* We want a total of 70K spam in your nightly corpus over the past week.This means 10K spam per day.* 3K spam on Monday is from trained monkeys. Include 7K from a randomselection of trap spam.* 2K spam on Tuesday is from trained monkeys. Include 8K from a randomselection of trap spam.

* etc.

You could even split it into two separate masscheck runs.
anubis-monkey
anubis-trap



Given we're talking original figures of 1 million spam per *day*,
already discussing ways to cut that down to 50-100k -- over a period of
up to 2 months for spam, 60 days, mind you -- which is less than 2k a
day...

It seems his spam is lacking spamassassin headers, so without "reuse" weare unable to determine delivery-time status of the network rules. Isuggested that as long as his mail is lacking spamassassin headers,perhaps his random sample should be limited to the past week. Althoughnot perfect, the past week might be closest to "reuse" in results.

A better alternative would to add spamassassin headers as each messagewas decided to be added to nightly masscheck corpus. The random subsetof trap spam would have headers from seconds after delivery, andtrained-monkey spam headers would be from whenever it was sorted."reuse" would then be possible, and the age of spam included in thenightly masscheck can be calibrated based upon how much this corpusoverwhelms everyone else's recent spam.


Warren

Re: Mass-check Corpora (once was: Re: Update Mirror Issues)

Reply via email to