On 01/21/2011 01:06 AM, Warren Togami Jr. wrote:
> On 1/20/2011 7:23 AM, R - elists wrote:
> 
>> initially this came across as a really suspect idea...
>>
>> i.e., one man's junk is another man's treasure
> 
> Ham is a lot easier to define than Spam.  Ham is simply anything that
> you subscribed for.
> 

I am currently subscribed to number of mailing lists to collect ham
emails (in addition to other sources). While it might be true that
mailing lists can be good sources of ham, their emails do not contain
realistic diversity of features/characteristics.

In my view, the issue is not just insuring an email is ham, but also
insuring that it contains realistic set of features. If the features are
not realistic, and if we optimize tests scores based on that, then we
might end up worsening test scores for realistic end-users.

For example, most list emails are non-HTML. While most end-user ham and
spam emails are HTML. Evaluating sets of features (or tests) based on
this unrealistic corpus is likely to fools us into thinking that a
feature/test is more effective that what it is in reality (i.e. we might
end up giving MIME-based tests higher scores).


Mahmoud

Reply via email to