----- "Karsten Bräckelmann" <[email protected]> wrote:
> > > > > SPAM: 51330 (150000 required)
> > > >
> > > > Joao Gouveia will soon be requesting an account to join the
> nightly
> > > > masscheck. He has a significant quantity of spam, and hopefully
> much
> > > > of it is European language so it should add to our diversity.
> > >
> > > I wonder how scoring will be affected if his corpus is >50k
> messages?
> > > :)
> >
> > Yikes. He has over 1 million per day spam. He's figuring out a way
> to
> > filter it to eliminate duplicates and do a random sample of ~20k * 7
>
> > days. But still, that's going to skew us too much.
>
> Yikes indeed.
>
> Maybe Joao should answer these himself...
>
> Given the numbers, is that purely trap driven? Is there a legion
> human
> users manually verifying the spam?
Can't really go into much detail (unrelated to Spamassassin), but there are
some traps involved of course, and some trained monkeys as well.
>
> What exactly does "filter duplicates" mean? If that includes
> "identical"
> payload sent to different users, these dupes should not be eliminated
> I
> believe, since it will bias results. A random sample already will
> eliminate most duplicates, while preserving distribution.
It means equal or similar messages from the same spam campaign (not directly
related to recipients).
> Is there also ham?
Unfortunately no (not yet anyway).
Cannot guarantee sufficient high quality ham (variety and contributers) to be
useful at this point.
>
> Regarding skewing of results due to a single source with overwhelming
> numbers: I recall days, where mass-checks (though not for scoring)
> basically consisted of one huge corpus, and a bunch of additional,
> *much* smaller corpora. It did indeed have an impact on quite a few
> rules, hardly matching the dominant corpus at all, though others
> quite
> nicely. :/
>
>
> --
> char
> *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
> main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8?
> c<<=1:
> (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){
> putchar(t[s]);h=m;s=0; }}}
--
João Gouveia