https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6114
Adam Katz <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #4448|0 |1 is obsolete| | --- Comment #8 from Adam Katz <[email protected]> 2009-07-15 17:29:42 PST --- Created an attachment (id=4485) --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4485) khop-sc-neighbors channel SA config from 2009-07-15 8p EDT > KHOP_SC_TOP_CIDR16 looks good! Actually, I think most of them look pretty good: MSECS SPAM% HAM% S/O RANK SCORE NAME 0.00000 12.3655 0.8152 0.938 0.70 0.01 T_KHOP_SC_CIDR8 0.00000 22.1685 0.5394 0.976 0.75 1.00 KHOP_SC_TOP_CIDR8 0.00000 0.3683 0.0000 1.000 0.79 1.00 KHOP_SC_CIDR16 0.00000 0.8412 0.0000 1.000 0.85 1.00 KHOP_SC_TOP_CIDR16 0.00000 0.0129 0.0000 1.000 0.56 0.01 T_KHOP_SC_CIDR24 0.00000 0.0000 0.0000 0.500 0.49 0.01 T_KHOP_SC_TOP_CIDR24 0.00000 0.0909 0.0000 1.000 0.70 1.00 KHOP_SC_TOP200 0.00000 0.3400 0.0000 1.000 0.79 1.00 KHOP_SC_TOP100 0.00000 0.0024 0.0000 1.000 0.50 0.01 T_KHOP_SC_TOP20 0.00000 0.0008 0.0000 1.000 0.49 0.01 T_KHOP_SC_TOP10 0.00000 0.4341 0.0000 1.000 >0.79 0.00 (union of last 4) Keep in mind that this is using data that is 57 days old (May 19, new version attached) for a data set that is very time-specific. You can see this impact in the hit-rate over time graph, best illustrated by KHOP_SC_TOP_CIDR8, http://tinyurl.com/ksc3wa (that's a shot of what it looks like now) - there were almost zero hams on May 19, but the hams spiked up a week later and again for this week. Who's to say that the problematic entries were present at those times? We know only that the ham count was best on the day it was released. This data suggests that I should either fold TOP10 and TOP20 back into TOP100 and possibly TOP200 (as summed above) or get rid of those single-ip hits altogether. I do worry about the length of the regular expression ... though it's not as long as some of the sought rules. I've considered fixing it with a search tree optimization, short circuit groups by octet, so something like /\b(?:1\.(?:2\.(?:3\.(?:4|5|6)|7\.(?:8|9))))\b/ to match what would otherwise be /\b(?:1\.2\.3\.4|1\.2\.3\.5|1\.2\.3\.6|1\.2\.7\.8|1\.2\.7\.9)\b/, but either sa-compile is smart enough to do that for me and/or it isn't worth my time. This stuff was mostly just to appease the people who wanted to highly penalize the top 200 offender list (like the original SARE channel). Running some math using just SpamCop's numbers, the top200 list's summed percentage of contributions to their spam total is only 1.356% (or 1.556% if we assume rounding by truncation with full-blown optimism on the hidden values). Adjusting for the fact that RCVD_IN_BL_SPAMCOP_NET only hits 56.7% of the SA test corpus, we're down to 0.769% (or round that up to 0.883%). I guess that's not bad, but it is twice the 0.434% reported above. I've also noticed that a large number of SA admins don't have DNSEval functioning properly. My khop-sc-neighbors channel now compensates for this by adding the points that would have been expected from those DNSBLs, which you can see at the very end of the attached latest version. Now that I know a little more about the ruleqa system (the T_* bit), I'll try to post more immediate stats on the data from this attachment once it lands; it should yield results a few days after landing in SVN, right? Last time missed a bit in that by the time I found the stats, the data had already grown stale, as noted in the next week's ham spike detailed above. Additionally, recall that I assigned a very small number of points to the CIDR8 rules as I was fully expecting some FPs. I've even scored them a little lower just in case, clocking in at 0.6 for TOP_CIDR8 and 0.2 for CIDR8. Perhaps I'm not reading the score-map right, but 95.77% of the ham hits scored under 3.999 (84.14% scored under 0.999), so a small bump won't make a difference. Given the current data, T_KHOP_SC_CIDR8 would only add points to ONE false positive hit (0.21% of the ham) and even if scored at 2.0, it would create 23 FPs (4.87% of the 0.8152% of the hams, which is to say 0.0397% of the ham). Scoring it 1.0 or less wouldn't actually have added any FPs. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug.
