----- "Warren Togami Jr." <[email protected]> wrote:

> On 2/2/2011 5:25 PM, Karsten Bräckelmann wrote:
> > Spam to live accounts strongly preferred, human reviewed by
> "trained
> > monkeys". Emphasis on trained. ;)  Some crap like backscatter should
> be
> > filtered from the trap data, if possible, and trap volume kept lower
> --
> > best done by random sampling, rather than dupe elimination.
> >
> > How much will that add to the corpus? In particular, how much would
> the
> > first class be, without trap data at all?
> 
> Karsten brings up a good point about two types of spam.  How about 
> something like:
> 
> * We want a total of 70K spam in your nightly corpus over the past
> week. 
>   This means 10K spam per day.
> * 3K spam on Monday is from trained monkeys.  Include 7K from a random
> 
> selection of trap spam.
> * 2K spam on Tuesday is from trained monkeys.  Include 8K from a
> random 
> selection of trap spam.
> * etc.
> 
> You could even split it into two separate masscheck runs.
> anubis-monkey
> anubis-trap

Thanks for the clear specs Warren, that helps ;-)
We shall try to do it like that.
I still need to setup a proper environment for this. Hopefully on this next 
weekend.

> >
> >
> > Given we're talking original figures of 1 million spam per *day*,
> > already discussing ways to cut that down to 50-100k -- over a period
> of
> > up to 2 months for spam, 60 days, mind you -- which is less than 2k
> a
> > day...
> 
> It seems his spam is lacking spamassassin headers, so without "reuse"
> we 
> are unable to determine delivery-time status of the network rules.  I
> 
> suggested that as long as his mail is lacking spamassassin headers, 
> perhaps his random sample should be limited to the past week. 
> Although 
> not perfect, the past week might be closest to "reuse" in results.
> 
> A better alternative would to add spamassassin headers as each message
> 
> was decided to be added to nightly masscheck corpus.  The random
> subset 
> of trap spam would have headers from seconds after delivery, and 
> trained-monkey spam headers would be from whenever it was sorted. 
> "reuse" would then be possible, and the age of spam included in the 
> nightly masscheck can be calibrated based upon how much this corpus 
> overwhelms everyone else's recent spam.
> 
> Warren

-- 
João Gouveia

Reply via email to