----- "Karsten Bräckelmann" <[email protected]> wrote:

> > > > > SPAM: 51330 (150000 required)
> > > >
> > > > Joao Gouveia will soon be requesting an account to join the
> nightly
> > > > masscheck. He has a significant quantity of spam, and hopefully
> much
> > > > of it is European language so it should add to our diversity.
> > >
> > > I wonder how scoring will be affected if his corpus is >50k
> messages?
> > > :)
> > 
> > Yikes.  He has over 1 million per day spam.  He's figuring out a way
> to 
> > filter it to eliminate duplicates and do a random sample of ~20k * 7
> 
> > days.  But still, that's going to skew us too much.
> 
> Yikes indeed.
> 
> Maybe Joao should answer these himself...
> 
> Given the numbers, is that purely trap driven? Is there a legion
> human
> users manually verifying the spam?

Can't really go into much detail (unrelated to Spamassassin), but there are 
some traps involved of course, and some trained monkeys as well.

> 
> What exactly does "filter duplicates" mean? If that includes
> "identical"
> payload sent to different users, these dupes should not be eliminated
> I
> believe, since it will bias results. A random sample already will
> eliminate most duplicates, while preserving distribution.

It means equal or similar messages from the same spam campaign (not directly 
related to recipients).

> Is there also ham?

Unfortunately no (not yet anyway).
Cannot guarantee sufficient high quality ham (variety and contributers) to be 
useful at this point.

> 
> Regarding skewing of results due to a single source with overwhelming
> numbers: I recall days, where mass-checks (though not for scoring)
> basically consisted of one huge corpus, and a bunch of additional,
> *much* smaller corpora. It did indeed have an impact on quite a few
> rules, hardly matching the dominant corpus at all, though others
> quite
> nicely. :/
> 
> 
> -- 
> char
> *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
> main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8?
> c<<=1:
> (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){
> putchar(t[s]);h=m;s=0; }}}

-- 
João Gouveia

Reply via email to