On 01/21/2011 01:06 AM, Warren Togami Jr. wrote: > On 1/20/2011 7:23 AM, R - elists wrote: > >> initially this came across as a really suspect idea... >> >> i.e., one man's junk is another man's treasure > > Ham is a lot easier to define than Spam. Ham is simply anything that > you subscribed for. >
I am currently subscribed to number of mailing lists to collect ham emails (in addition to other sources). While it might be true that mailing lists can be good sources of ham, their emails do not contain realistic diversity of features/characteristics. In my view, the issue is not just insuring an email is ham, but also insuring that it contains realistic set of features. If the features are not realistic, and if we optimize tests scores based on that, then we might end up worsening test scores for realistic end-users. For example, most list emails are non-HTML. While most end-user ham and spam emails are HTML. Evaluating sets of features (or tests) based on this unrealistic corpus is likely to fools us into thinking that a feature/test is more effective that what it is in reality (i.e. we might end up giving MIME-based tests higher scores). Mahmoud