2011/2/2 Warren Togami Jr. <[email protected]>:
> On 2/1/2011 1:02 PM, Karsten Bräckelmann wrote:
>>
>> Yikes indeed.
>>
>> Maybe Joao should answer these himself...
>>
>> Given the numbers, is that purely trap driven? Is there a legion human
>> users manually verifying the spam?
>>
>> What exactly does "filter duplicates" mean? If that includes "identical"
>> payload sent to different users, these dupes should not be eliminated I
>> believe, since it will bias results. A random sample already will
>> eliminate most duplicates, while preserving distribution.
>
> Good point. +1

+1.

My approach btw when dealing with traps is to (a) upload those using a
distinct filename if possible (e.g. "ham-jm-traps.log" or similar),
and (b) sample randomly to get the volume down to something comparable
to the other corpora.  Trap spam tends to contain  bounce blowback and
other "noise" that we don't necessarily want in large numbers in our
corpora.

Reply via email to