On Fri, Oct 08, 2004 at 03:49:08PM -0700, Justin Mason wrote:
> I've been considering possible improvements to how we figure out what
> rules are effective.
> 
> Currently we use the S/O ratio and hit-rate of each individual rule, in
> other words, if a rule hits a lot of spam, and little nonspam, we detect
> that and consider it "good".
> 
> However, that doesn't take in account the situation where multiple rules
> are hitting mostly the same mail; for example, like this:
> 
>              S1  S2  S3  S4  S5  H1  H2  H3  H4  H5
>     RULE1:   x   x   x   x                       
>     RULE2:   x   x   x   x                       
>     RULE3:               x   x                   x
>     RULE4:                   x                    

I've thought about this as well, and in the social science world a
common statistical technique for describing things like "personality" is
factor analysis.  Its been years, but from what I remember factor
analysis is used to get cocorrelating "factors" together to describe a
common construct or idea.

I'm not familiar with the genetic or whatever algorithms that are
currently used in SA, but I would be interested in looking into
alternate algorithms like factor analysis for identifying spam.

One interesting thing with the factor analysis, is that it could
describe "SPAM" better than "SPAM" or "HAM".  It could categorize mails
as say "nigerian spam", "porn spam", "mortage spam", etc.

Mike

-- 
/-----------------------------------------\
| Michael Barnes <[EMAIL PROTECTED]> |
| UNIX Systems Administrator              |
| College of William and Mary             |
| Phone: (757) 879-3930                   |
\-----------------------------------------/

Reply via email to