https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114


Adam Katz <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #4448|0                           |1
        is obsolete|                            |




--- Comment #8 from Adam Katz <[email protected]>  2009-07-15 17:29:42 PST ---
Created an attachment (id=4485)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4485)
khop-sc-neighbors channel SA config from 2009-07-15 8p EDT

> KHOP_SC_TOP_CIDR16 looks good!

Actually, I think most of them look pretty good:

MSECS      SPAM%     HAM%     S/O    RANK   SCORE  NAME
0.00000  12.3655   0.8152   0.938    0.70    0.01  T_KHOP_SC_CIDR8
0.00000  22.1685   0.5394   0.976    0.75    1.00  KHOP_SC_TOP_CIDR8
0.00000   0.3683   0.0000   1.000    0.79    1.00  KHOP_SC_CIDR16
0.00000   0.8412   0.0000   1.000    0.85    1.00  KHOP_SC_TOP_CIDR16
0.00000   0.0129   0.0000   1.000    0.56    0.01  T_KHOP_SC_CIDR24
0.00000   0.0000   0.0000   0.500    0.49    0.01  T_KHOP_SC_TOP_CIDR24
0.00000   0.0909   0.0000   1.000    0.70    1.00  KHOP_SC_TOP200
0.00000   0.3400   0.0000   1.000    0.79    1.00  KHOP_SC_TOP100
0.00000   0.0024   0.0000   1.000    0.50    0.01  T_KHOP_SC_TOP20
0.00000   0.0008   0.0000   1.000    0.49    0.01  T_KHOP_SC_TOP10
0.00000   0.4341   0.0000   1.000   >0.79    0.00  (union of last 4)

Keep in mind that this is using data that is 57 days old (May 19, new version
attached) for a data set that is very time-specific.  You can see this impact
in the hit-rate over time graph, best illustrated by KHOP_SC_TOP_CIDR8, 
http://tinyurl.com/ksc3wa  (that's a shot of what it looks like now) - there
were almost zero hams on May 19, but the hams spiked up a week later and again
for this week.  Who's to say that the problematic entries were present at those
times?  We know only that the ham count was best on the day it was released.

This data suggests that I should either fold TOP10 and TOP20 back into TOP100
and possibly TOP200 (as summed above) or get rid of those single-ip hits
altogether.  I do worry about the length of the regular expression ... though
it's not as long as some of the sought rules.  I've considered fixing it with a
search tree optimization, short circuit groups by octet, so something like
/\b(?:1\.(?:2\.(?:3\.(?:4|5|6)|7\.(?:8|9))))\b/ to match what would otherwise
be /\b(?:1\.2\.3\.4|1\.2\.3\.5|1\.2\.3\.6|1\.2\.7\.8|1\.2\.7\.9)\b/, but either
sa-compile is smart enough to do that for me and/or it isn't worth my time. 
This stuff was mostly just to appease the people who wanted to highly penalize
the top 200 offender list (like the original SARE channel).

Running some math using just SpamCop's numbers, the top200 list's summed
percentage of contributions to their spam total is only 1.356% (or 1.556% if we
assume rounding by truncation with full-blown optimism on the hidden values). 
Adjusting for the fact that RCVD_IN_BL_SPAMCOP_NET only hits 56.7% of the SA
test corpus, we're down to 0.769% (or round that up to 0.883%).  I guess that's
not bad, but it is twice the 0.434% reported above.

I've also noticed that a large number of SA admins don't have DNSEval
functioning properly.  My khop-sc-neighbors channel now compensates for this by
adding the points that would have been expected from those DNSBLs, which you
can see at the very end of the attached latest version.

Now that I know a little more about the ruleqa system (the T_* bit), I'll try
to post more immediate stats on the data from this attachment once it lands; it
should yield results a few days after landing in SVN, right?  Last time missed
a bit in that by the time I found the stats, the data had already grown stale,
as noted in the next week's ham spike detailed above.


Additionally, recall that I assigned a very small number of points to the CIDR8
rules as I was fully expecting some FPs.  I've even scored them a little lower
just in case, clocking in at 0.6 for TOP_CIDR8 and 0.2 for CIDR8.  Perhaps I'm
not reading the score-map right, but 95.77% of the ham hits scored under 3.999
(84.14% scored under 0.999), so a small bump won't make a difference.  Given
the current data, T_KHOP_SC_CIDR8 would only add points to ONE false positive
hit (0.21% of the ham) and even if scored at 2.0, it would create 23 FPs (4.87%
of the 0.8152% of the hams, which is to say 0.0397% of the ham).  Scoring it
1.0 or less wouldn't actually have added any FPs.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Reply via email to