Re: Score Hit Frequency in SA Corpus?

Justin Mason Mon, 22 Sep 2008 06:37:21 -0700

Joseph Brennan writes:
> 
> 
> --On Sunday, September 21, 2008 18:39 -0600 Bob Proulx <[EMAIL PROTECTED]> 
> wrote:
> 
> >> OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
> >>   1.116   1.5957   0.2705    0.855   0.51    2.08  SUBJ_ALL_CAPS
> >
> > Am I reading that correctly to see that in spam all caps showed up in
> > 1.60% of the regression corpus and only in 0.27% of the non-spam?
> > Gosh that seems like a very small indicator.
> 
> 
> No, it's high.  Only 1.87% had all caps subject, but of those 85%
> were spam: 1.60 / 1.87.
> 
> If I am reading correctly.


That's right.  

The problem with SUBJ_ALL_CAPS is that it tends to catch really odd
fraud spams, foreign-language spam etc. that the other rules fail to
spot; this means that the GA likes it quite a lot, since despite 
the occasional FP, it reduces FNs enough to make it "worth it".

it's hard to avoid this issue. :(

--j.

Re: Score Hit Frequency in SA Corpus?

Reply via email to