Mass-check Corpora (once was: Re: Update Mirror Issues)

Karsten Bräckelmann Tue, 01 Feb 2011 15:03:15 -0800

> > > > SPAM: 51330 (150000 required)
> > >
> > > Joao Gouveia will soon be requesting an account to join the nightly
> > > masscheck. He has a significant quantity of spam, and hopefully much
> > > of it is European language so it should add to our diversity.
> >
> > I wonder how scoring will be affected if his corpus is >50k messages?
> > :)
> 
> Yikes.  He has over 1 million per day spam.  He's figuring out a way to 
> filter it to eliminate duplicates and do a random sample of ~20k * 7 
> days.  But still, that's going to skew us too much.


Yikes indeed.

Maybe Joao should answer these himself...

Given the numbers, is that purely trap driven? Is there a legion human
users manually verifying the spam?

What exactly does "filter duplicates" mean? If that includes "identical"
payload sent to different users, these dupes should not be eliminated I
believe, since it will bias results. A random sample already will
eliminate most duplicates, while preserving distribution.

Is there also ham?


Regarding skewing of results due to a single source with overwhelming
numbers: I recall days, where mass-checks (though not for scoring)
basically consisted of one huge corpus, and a bunch of additional,
*much* smaller corpora. It did indeed have an impact on quite a few
rules, hardly matching the dominant corpus at all, though others quite
nicely. :/


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Mass-check Corpora (once was: Re: Update Mirror Issues)

Reply via email to