On Thu, 2011-02-03 at 02:40 +0000, João Gouveia wrote:
> ----- "Karsten Bräckelmann" <[email protected]> wrote:

> > Given the numbers, is that purely trap driven? Is there a legion human
> > users manually verifying the spam?
> 
> Can't really go into much detail (unrelated to Spamassassin), but there
> are some traps involved of course, and some trained monkeys as well.
> 
> > What exactly does "filter duplicates" mean? If that includes "identical"
> > payload sent to different users, these dupes should not be eliminated I
> > believe, since it will bias results. A random sample already will
> > eliminate most duplicates, while preserving distribution.
> 
> It means equal or similar messages from the same spam campaign (not
> directly related to recipients).

Sure, not all detail, but maybe some (more)? :)

According to my comments above and Justin's addition regarding trap
data, plus the overall consensus in that sub-thread...

Spam to live accounts strongly preferred, human reviewed by "trained
monkeys". Emphasis on trained. ;)  Some crap like backscatter should be
filtered from the trap data, if possible, and trap volume kept lower --
best done by random sampling, rather than dupe elimination.

How much will that add to the corpus? In particular, how much would the
first class be, without trap data at all?


Given we're talking original figures of 1 million spam per *day*,
already discussing ways to cut that down to 50-100k -- over a period of
up to 2 months for spam, 60 days, mind you -- which is less than 2k a
day...

How much trap data would you actually need, to match that fraction of
the overall data available?


> > Is there also ham?
> 
> Unfortunately no (not yet anyway). Cannot guarantee sufficient high
> quality ham (variety and contributers) to be useful at this point.

Fair enough, we'll gladly accept the ham, once you're happy with the
quality and variety. :)

Though particularly the variety probably won't be that much of an issue.
Keep in mind "some" of our ham corpora most likely are geek biased, let
alone not as globally diverse as we'd wish for. The latter I guess you
could help with single-handedly.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to