[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #185 from Henrik Krohns h...@hege.li 2010-01-05 10:47:51 UTC --- I have a hunch that FREEMAIL_ENVFROM_END_DIGIT has a bit too high score (1.553). Probably there wasn't enough nicedude90 ham in corpora. Strangely FREEMAIL_REPLYTO_END_DIGIT has a lower score, one would think it would be safer FP wise.. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #180 from Mark Martinec mark.marti...@ijs.si 2009-12-02 07:31:01 UTC --- Mark, please correct me if I am wrong. But it seems only you can complete the final steps since we don't know exactly which subset of data you used. I'm doing it right now. The config.set* is already checked in, logs are being transferred, ... -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #181 from Mark Martinec mark.marti...@ijs.si 2009-12-02 10:48:45 UTC --- Ok, I think I'm done now (RescoreMassCheck): 5. generate scores for score sets svn commit -m runGA config files used masses/config.set* r886173 | mmartinec | 2009-12-02 16:24:32 +0100 (Wed, 02 Dec 2009) | 1 line runGA config files used tar cvf rescore-logs.tar gen-set{0,1,2,3}-* 6. upload the test logs to zone (spamassassin.zones.apache.org): sudo mkdir /home/corpus-rsync/ARCHIVE/3.3.0 sudo mv rescore-logs.tar.bz2 \ /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tar.bz2 ls -l /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tar.bz2 -rw-r--r-- 1 mmartinec other20380424 Dec 2 18:23 /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tar.bz2 6.5. mark evolved-score rules as 'always published' ./force-publish-active-rules ../rules/active.list ../rulesrc/10_force_active.cf svn commit -m force publish of rescored rules ../rulesrc/10_force_active.cf r886212 | mmartinec | 2009-12-02 18:33:57 +0100 (Wed, 02 Dec 2009) | 3 lines Bug 6155: generated new rulesrc/10_force_active.cf as per step 6.5 in RescoreMassCheck 6.6. fix test failures nothing to tweak, all tests pass 7. upload proposed new scores done some time ago, some tweaks later: r881159 | wtogami | 2009-11-17 06:35:00 +0100 (Tue, 17 Nov 2009) | 2 lines Bug #6155 commit raw scores from Comment #146 as documented in #162. To view the diffs: svn diff -r 881158:886232 rules/50_scores.cf 8. Make the stats files cp config.set0 config ; bash ./runGA stats cp config.set1 config ; bash ./runGA stats cp config.set2 config ; bash ./runGA stats cp config.set3 config ; bash ./runGA stats 8(.1) upload new stats files r886232 | mmartinec | 2009-12-02 19:11:35 +0100 (Wed, 02 Dec 2009) | 2 lines rules/STATISTICS-set*.txt Attach the new proposed STATISTICS*.txt as a patch to the rescoring bug too many differences, just do a: svn diff -c886232 -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #183 from Warren Togami wtog...@redhat.com 2009-12-02 11:43:16 UTC --- Why is active.list (the result of auto-promotion) relevant as input to this script? Seems kind of like circular logic that makes no sense. + SPAMMY_MIME_BDRY_01 force-publish-active-rules added a few lines like this that have no scores assigned in rules/50_scores.cf. It seems what I already did by copying rule names from rules/50_scores.cf into rulesrc/10_force_active.cf is more correct? If so, then it appears we are ready for beta if we can clear up the GPG key issue in Bug #6223. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #175 from Justin Mason j...@jmason.org 2009-12-01 05:08:47 UTC --- 10_force_active.cf is generated at this step in the RescoreMassCheck process (see https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155#c3): 6.5. mark evolved-score rules as 'always published' sounds like we could be missing a few steps if that got missed... -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 Mark Thomas ma...@apache.org changed: What|Removed |Added CC|ma...@apache.org| -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #176 from Warren Togami wtog...@redhat.com 2009-12-01 08:50:38 UTC --- http://wiki.apache.org/spamassassin/RescoreMassCheck Mark, did you do these steps? 6. upload the test logs to zone 8. Make the stats files 8. upload new stats files -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #177 from Mark Martinec mark.marti...@ijs.si 2009-12-01 09:17:58 UTC --- Mark, did you do these steps? 6. upload the test logs to zone 8. Make the stats files 8. upload new stats files No, I left at the '5. generate scores for score sets', I only attached the score file for considerations. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #178 from Warren Togami wtog...@redhat.com 2009-12-01 10:28:26 UTC --- Mark, it appears that only you can do those steps? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 Mark Thomas ma...@apache.org changed: What|Removed |Added CC||ma...@apache.org --- Comment #174 from Mark Thomas ma...@apache.org 2009-11-30 13:40:07 UTC --- Restoring comment originally made by Mark Martinec (In reply to comment #171) Btw, the: prove xt/10_rule_test_suite.t is failing for several rules. Can someone more familiar with rules please check where the reported problems lie? Actually it's just two rules failing on multiple tests: FM_FRM_RN_L_BRACK and TVD_SPACE_RATIO. Luckily their score is zero or near zero: score TVD_SPACE_RATIO 0.001 score FM_FRM_RN_L_BRACK 0 | Changed score of FM_FRM_RN_L_BRACK from 0 into 0.001, | to make xt/10_rule_test_suite.t happy. | Sending rules/50_scores.cf | Committed revision 884927. So that leaves the TVD_SPACE_RATIO. Is it something to worry about? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #173 from Warren Togami wtog...@redhat.com 2009-11-27 09:13:25 UTC --- Sendingrulesrc/10_force_active.cf Transmitting file data . Committed revision 884912. Please review. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #174 from Mark Martinec mark.marti...@ijs.si 2009-11-27 10:03:42 UTC --- (In reply to comment #171) Btw, the: prove xt/10_rule_test_suite.t is failing for several rules. Can someone more familiar with rules please check where the reported problems lie? Actually it's just two rules failing on multiple tests: FM_FRM_RN_L_BRACK and TVD_SPACE_RATIO. Luckily their score is zero or near zero: score TVD_SPACE_RATIO 0.001 score FM_FRM_RN_L_BRACK 0 | Changed score of FM_FRM_RN_L_BRACK from 0 into 0.001, | to make xt/10_rule_test_suite.t happy. | Sendingrules/50_scores.cf | Committed revision 884927. So that leaves the TVD_SPACE_RATIO. Is it something to worry about? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #172 from Daryl C. W. O'Shea spamassas...@dostech.ca 2009-11-26 17:24:49 UTC --- Warren, The file was originally used to list all *rules from sandboxes* that had scores assigned by the GA so that they didn't get auto-demoted leaving a score line but no rule. I don't think its use has changed, but I'm not completely up-to-date on the re-org of the rules source structure. jm might have a script to generate the file... although it's been a long time. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #169 from Warren Togami wtog...@redhat.com 2009-11-23 20:08:06 UTC --- spamassassin/trunk/rulesrc/10_force_active.cf It seems this file needs to be updated after the rescoring. Should all the rules in 50_scores.cf be listed in 10_force_active.cf? Even the rules that are zeroed out in 50_scores.cf? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #168 from Justin Mason j...@jmason.org 2009-11-20 15:10:05 UTC --- (In reply to comment #167) locally, I've have lowered the MISSING_HB_SEP score to 0.5 lottsa funky ERP stuff seems to have a talent to FP on it. its great for metas but usually triggers scores close to FP with the usual suspects their very ugly HTML formatting. (sorry, cannot supply samples) I'd say 2.5 is sorta high ok -- I was under the impression it was FP-free. 0.5 works for me in that case. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #164 from Mark Martinec mark.marti...@ijs.si 2009-11-17 03:03:22 UTC --- It appears that tests here are failing after commit because rules required by this test were zeroed out. It seems these rules have almost zero hits in masscheck. What should we do about this? Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO for the test Sending t/missing_hb_separator.t Committed revision 881240. I hope this is the right approach. Alternative would be to introduce a file similar to t/data/01_test_rules.cf to hold score overrides, but with a name like 51_test_rules.cf to be sorted after the 50_scores.cf. Btw, is the 01_ in the name intentional, or could the existing file just be renamed to something like 99_test_rules.cf ? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #165 from Mark Martinec mark.marti...@ijs.si 2009-11-17 03:18:15 UTC --- (In reply to comment #161) -score RDNS_NONE 0.1 -score RDNS_DYNAMIC 0.1 +# score RDNS_NONE 0 1.1 0 0.7 +# score RDNS_DYNAMIC 0 0.5 0 0.5 Doesn't commented out mean 1 point? It would mean 1 point, if there were no other score lines for these two rules: score RDNS_DYNAMIC 2.639 0.363 1.663 0.982 score RDNS_NONE2.399 1.274 1.228 0.793 These are supposed to be informational rules according to the comment. Is this supposed to become commented out? Comment 116, 120, 124, 137, 139. I left it mutable, I think it still makes sense - it's kind of a poor man's Botnet plugin. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #166 from Justin Mason j...@jmason.org 2009-11-17 07:41:11 UTC --- (In reply to comment #164) It appears that tests here are failing after commit because rules required by this test were zeroed out. It seems these rules have almost zero hits in masscheck. What should we do about this? Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO for the test Sending t/missing_hb_separator.t Committed revision 881240. I hope this is the right approach. Alternative would be to introduce a file similar to t/data/01_test_rules.cf to hold score overrides, but with a name like 51_test_rules.cf to be sorted after the 50_scores.cf. Btw, is the 01_ in the name intentional, or could the existing file just be renamed to something like 99_test_rules.cf ? X_MESSAGE_INFO can be dropped, but MISSING_HB_SEP should not have been made mutable; I'd say lock to 2.5. btw it is to be expected that with less mutability the scores become slightly less optimal for the rescoring corpus; this always happens. If scores are allowed to wander without locking down the unsafe rules, the GA will overfit to the training data and produce great FP/FN figures, but scores that are risky for real world usage. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 AXB alex.ur...@gmail.com changed: What|Removed |Added CC||alex.ur...@gmail.com --- Comment #167 from AXB alex.ur...@gmail.com 2009-11-17 07:56:17 UTC --- (In reply to comment #166) (In reply to comment #164) It appears that tests here are failing after commit because rules required by this test were zeroed out. It seems these rules have almost zero hits in masscheck. What should we do about this? Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO for the test Sending t/missing_hb_separator.t Committed revision 881240. I hope this is the right approach. Alternative would be to introduce a file similar to t/data/01_test_rules.cf to hold score overrides, but with a name like 51_test_rules.cf to be sorted after the 50_scores.cf. Btw, is the 01_ in the name intentional, or could the existing file just be renamed to something like 99_test_rules.cf ? X_MESSAGE_INFO can be dropped, but MISSING_HB_SEP should not have been made mutable; I'd say lock to 2.5. btw it is to be expected that with less mutability the scores become slightly less optimal for the rescoring corpus; this always happens. If scores are allowed to wander without locking down the unsafe rules, the GA will overfit to the training data and produce great FP/FN figures, but scores that are risky for real world usage. locally, I've have lowered the MISSING_HB_SEP score to 0.5 lottsa funky ERP stuff seems to have a talent to FP on it. its great for metas but usually triggers scores close to FP with the usual suspects their very ugly HTML formatting. (sorry, cannot supply samples) I'd say 2.5 is sorta high Axb -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #159 from Justin Mason j...@jmason.org 2009-11-16 16:27:51 UTC --- will we go ahead and check in those scores, anyway? that would allow another beta (soon). re: HTML_IMAGE_RATIO_* -- it's very common for that kind of multi-valued set of rules to wind up with nonintuitive scoring. This happens from either low hitrates or hitting alongside other (better) rules. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #160 from Warren Togami wtog...@redhat.com 2009-11-16 18:28:03 UTC --- (In reply to comment #142) Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly false positives are due to freelotto.com mail. I wonder whether such samples are rightfully in the spam* corpora - I'd say yes, but, as they say, spam is about consent, not content, and people receiving mail from freelotto.com most likely did register once, not realizing what they are dealing with. So there was a consent, at least initially. It is also about fraud and advertising, so, should one leave such mail samples in the spam corpus or not? Perhaps we should explicitly exclude known sketchy senders like freelotto.com from HABEAS_ACCREDITED_SOI. This would allow us to more easily monitor for clear violators by not being distracted by the common FP's. Exclusion in this case only brings the listed back to neutral which is pretty clearly a good idea. Any objections? Otherwise I'll file a separate bug for this. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #161 from Warren Togami wtog...@redhat.com 2009-11-16 19:27:50 UTC --- -score RDNS_NONE 0.1 -score RDNS_DYNAMIC 0.1 +# score RDNS_NONE 0 1.1 0 0.7 +# score RDNS_DYNAMIC 0 0.5 0 0.5 These are supposed to be informational rules according to the comment. Is this supposed to become commented out? Doesn't commented out mean 1 point? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #162 from Warren Togami wtog...@redhat.com 2009-11-16 21:28:44 UTC --- fp-fn-statistics across the entire rescore logs. Set 3 Before === # SUMMARY for threshold 5.0: # Correctly non-spam: 703647 99.90% # Correctly spam: 2559525 98.28% # False positives: 719 0.10% # False negatives: 44795 1.72% # TCR(l=50): 32.253638 SpamRecall: 98.280% SpamPrec: 99.972% Set 3 Raw Rescoring from Comment #146 == # SUMMARY for threshold 5.0: # Correctly non-spam: 703520 99.88% # Correctly spam: 2548134 97.84% # False positives: 846 0.12% # False negatives: 56186 2.16% # TCR(l=50): 26.443555 SpamRecall: 97.843% SpamPrec: 99.967% Doesn't look like an improvement. Set 3 + Rescore + Reductions == # SUMMARY for threshold 5.0: # Correctly non-spam: 704002 99.95% # Correctly spam: 2558896 98.26% # False positives: 364 0.05% # False negatives: 45424 1.74% # TCR(l=50): 40.932981 SpamRecall: 98.256% SpamPrec: 99.986% Looks like a statistically insignificant improvement over the old scores. I only hope our corpora was sufficiently varied. Rules Made Informational == TVD_RCVD_SPACE_BRACKET MISSING_MIME_HB_SEP FUZZY_CPILL X_IP Bug #5920 appears not fixed as claimed. FRT_SOMA2 CTYPE_001C_B MIME_BASE64_BLANKS WEIRD_QUOTING SPF_HELO_FAIL HTML_IMAGE_RATIO_06 HTML_IMAGE_RATIO_08 Other Changes * EXTRA_MPART_TYPE was left as 1.0 because while it does relatively poorly in the weeky masscheck, it did far better in rescore masscheck. * I am increasing the scores of PSBL *after* the above fp-fn-statistics run because the old logs do not reflect its current safety level. I am committing these changes now. I suspect the key to these reductions is getting rid of the rules that wouldn't have passed our ruleqa auto-promotion criteria? There might be additional tweaks to make. Please comment here. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #163 from Warren Togami wtog...@redhat.com 2009-11-16 22:58:57 UTC --- http://hudson.zones.apache.org/hudson/job/SpamAssassin-trunk/4344/testReport/ -score MISSING_HB_SEP 2.5 +# score MISSING_HB_SEP 2.5 +score MISSING_HB_SEP 0 # n=0 n=1 n=2 -score X_MESSAGE_INFO 3.499 3.496 3.330 1.597 +score X_MESSAGE_INFO 0 # n=0 n=1 n=2 n=3 It appears that tests here are failing after commit because rules required by this test were zeroed out. It seems these rules have almost zero hits in masscheck. What should we do about this? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #157 from Warren Togami wtog...@redhat.com 2009-11-12 10:07:55 UTC --- TVD_RCVD_SPACE_BRACKET MISSING_MIME_HB_SEP FUZZY_CPILL X_IP Bug #5920 appears not fixed as claimed. FRT_SOMA2 CTYPE_001C_B MIME_BASE64_BLANKS WEIRD_QUOTING SPF_HELO_FAIL EXTRA_MPART_TYPE It appears to be correct to zero out these rules, or at least make them informational. spamassassin-3.2.5 score HTML_IMAGE_RATIO_02 1.518 0.550 0.573 0.383 score HTML_IMAGE_RATIO_04 1.561 0.170 0.863 0.172 score HTML_IMAGE_RATIO_06 0.401 0.001 0.501 0.001 score HTML_IMAGE_RATIO_08 0.203 0.001 0.179 0.001 attachment 4565 resulting 50_scores.cf from garescorer runs - V5 score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437 score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556 score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882 score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021 The old scores showed a more linear relationship, with a sharp drop-off between _04 and _06. Our masscheck results indicate _02 and _04 hit on more spam than ham, but _06 and _08 are pretty worthless. I think we should zero out _06 and _08 while reducing the scores of _02 and _04. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #158 from Adam Katz antis...@khopis.com 2009-11-12 16:20:15 UTC --- (In reply to comment #157) spamassassin-3.2.5 score HTML_IMAGE_RATIO_02 1.518 0.550 0.573 0.383 score HTML_IMAGE_RATIO_04 1.561 0.170 0.863 0.172 score HTML_IMAGE_RATIO_06 0.401 0.001 0.501 0.001 score HTML_IMAGE_RATIO_08 0.203 0.001 0.179 0.001 attachment 4565 [details] resulting 50_scores.cf from garescorer runs - V5 score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437 score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556 score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882 score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021 The old scores showed a more linear relationship, with a sharp drop-off between _04 and _06. Our masscheck results indicate _02 and _04 hit on more spam than ham, but _06 and _08 are pretty worthless. I think we should zero out _06 and _08 while reducing the scores of _02 and _04. I didn't mention _08 because it wasn't a remarkable enough margin of HAM SPAM (my script only reports if HAM% + 0.05 SPAM%) and my hand-sampling utilized S/O ratios under .250 while this rule is .320. Still, it has the problem: SPAM% HAM%S/ORANK SCORE NAMEDateRev 0.2709 0.5491 0.330 0.34 0.20 HTML_IMAGE_RATIO_08 2009-r834803-n 0.2717 0.5492 0.331 0.34 0.20 HTML_IMAGE_RATIO_08 20091110-r834389-n 0.2672 0.5493 0.327 0.34 0.20 HTML_IMAGE_RATIO_08 20091109-r833997-n 0.2075 0.4995 0.294 0.34 0.20 HTML_IMAGE_RATIO_08 20091104-r832683-n 0.2548 0.5476 0.318 0.34 0.20 HTML_IMAGE_RATIO_08 20091028-r830464-n Here are the results from the 2009-r834803-n set, pruning only rules scoring under 0.2 (all hits from my last report are present and asterisked): S/O RANK HAM%SPAM% Score in attachment 4565 Rule .014 .15 0.6328 0.0093 0.001 0.001 0.131 0.700 TVD_RCVD_SPACE_BRACKET* .015 .24 0.1927 0.0029 0.000 2.099 0.001 1.711 MISSING_MIME_HB_SEP* .019 .22 0.2528 0.0049 1.482 0.855 2.399 2.399 FUZZY_CPILL* .043 .29 0.1298 0.0059 0.001 1.699 1.498 1.699 X_IP* .075 .35 0.0603 0.0049 0.000 0.001 0.308 0.001 HTML_NONELEMENT_30_40 .092 .21 0.8123 0.0825 0.699 0.332 0.480 0.800 MIME_BASE64_BLANKS* .106 .25 0.2483 0.0293 0.551 1.026 1.033 1.250 CTYPE_001C_B* .123 .33 0.0837 0.0117 0.001 0.648 0.836 1.293 TVD_FW_GRAPHIC_NAME_LONG .123 .28 0.1632 0.0229 0.001 2.499 0.392 0.164 DRUGS_MUSCLE(*) .130 .25 0.3663 0.0547 2.385 0.345 0.998 2.503 FRT_SOMA2* .155 .29 0.1736 0.0317 0.001 0.001 0.001 1.741 MIME_BASE64_TEXT .188 .27 0.4622 0.1069 0 0.973 0 2.385 SPF_HELO_FAIL* .214 .31 0.1449 0.0395 2.200 2.199 0.540 2.199 WEIRD_QUOTING* .239 .30 0.8321 0.2612 1.799 0.579 0.901 0.882 HTML_IMAGE_RATIO_06* .254 .34 1.3070 0.4442 1.0 EXTRA_MPART_TYPE* .330 .34 0.5491 0.2709 1.410 0.351 0.874 0.021 HTML_IMAGE_RATIO_08 .363 .38 1.0856 0.6194 2.600 2.070 1.233 3.405 DATE_IN_PAST_96_XX .368 .36 0.3029 0.1767 0.001 0.791 0.001 0.008 UPPERCASE_50_75 .381 .37 0.6473 0.3983 0.354 0.001 0.725 0.428 MIME_HTML_MOSTLY .660 .51 1.8514 3.5893 0.518 1.625 1.197 1.506 SUBJ_ALL_CAPS .905 .58 1.0822 10.2987 0 1.246 0 1.347 RCVD_IN_BL_SPAMCOP_NET .934 .56 3.6172 51.2001 2.199 1.105 1.199 0.723 MIME_HTML_ONLY .957 .52 2.2200 50.3063 2.399 1.274 1.228 0.793 RDNS_NONE DRUGS_MUSCLE met all the requirements I set for my last report, but I removed it because it had almost no hits anyway, and it scored very very low except on net+no-bayes, so I was assuming it had some justification there somehow. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #154 from Warren Togami wtog...@redhat.com 2009-11-11 11:38:13 UTC --- (In reply to comment #152) | Please manually adjust the scores of RCVD_IN_PSBL up. At the time of the | rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a | number of major ISP's. As a result, for 5 weeks straight RCVD_IN_PSBL has | been almost completely devoid of FP's in our weekly masschecks. I am | confident that PSBL performs safer than measured during the rescore masscheck Ok, I suggest we collect some manual fixes like the ones suggested here (with specific score suggestions), and wrap it up. Let's just go ahead with committing as jm suggested in Comment #153 and make the manual adjustments after that in separate commits each with explanations. RCVD_IN_PSBL I suggest 2.7 for both network sets. Adam Katz in Comment #153 makes a good argument for reducing those rules to informational. Any comments on that? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 Adam Katz antis...@khopis.com changed: What|Removed |Added Attachment #4564|0 |1 is obsolete|| --- Comment #153 from Adam Katz antis...@khopis.com 2009-11-09 15:40:31 UTC --- Created an attachment (id=4568) -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4568) Checker for rules that match more ham than spam Collected selections from several more runs of my script. I took the last three days' worth of masschecks plus the run last week, hand-picked rules with a high score (~1.0+) but low S/O (~0.250-), and then looked for repeat offenders. This is the list, with each rule's worst S/O of any run: S/O RANK HAM%SPAM% Score attachment 4565 Rule .002 .14 1.2650 0.0024 0.001 0.001 0.131 0.700 TVD_RCVD_SPACE_BRACKET .002 .23 0.4472 0.0008 0.000 2.099 0.001 1.711 MISSING_MIME_HB_SEP .019 .22 0.2529 0.0049 1.482 0.855 2.399 2.399 FUZZY_CPILL .019 .29 0.2809 0.0056 0.001 1.699 1.498 1.699 X_IP .046 .22 0.4010 0.0193 2.385 0.345 0.998 2.503 FRT_SOMA2 .077 .25 0.2643 0.0221 0.551 1.026 1.033 1.250 CTYPE_001C_B .092 .21 0.8712 0.0878 0.699 0.332 0.480 0.800 MIME_BASE64_BLANKS .095 .31 0.2735 0.0286 2.200 2.199 0.540 2.199 WEIRD_QUOTING .178 .28 0.4948 0.1069 0 0.973 0 2.385 SPF_HELO_FAIL .195 .29 0.8975 0.2173 1.799 0.579 0.901 0.882 HTML_IMAGE_RATIO_06 .241 .34 1.4248 0.4529 1.0 EXTRA_MPART_TYPE I don't think it wise to release with these scores quite so high. I propose we score them all 0.1 or 0.001 so as to not hold up the release and bookmark the issue (likely a bug in the GA, probably best registered as its own bugzilla bug) for dealing with later. Additionally, I've updated my script to do the reverse - seek out negatively scored rules that hit more spam than ham. This doesn't currently find anything beyond SPF_PASS (due to having =1% spam hits, while it was previously found for having hamspam), but it does prevent listing SPF_HELO_PASS and theoretically will help find poorly-written ham rules in the future. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #152 from Mark Martinec mark.marti...@ijs.si 2009-11-08 16:36:24 UTC --- A new run, this time I left the URIBL whitelists and similar fixed (at their relatively high manual scores) as they were in current 50_scores.cf Or to say it better: unlike my previous runs where I commented out most scores in the existing 50_scores.cf (thus making them mutable, regardless of a gen:mutable markup) except for a couple of exceptions, this time I did not comment-out scores, and let gen:mutable markup do its job. So this is now more like how it was intended to run GA. After a little examination, they look good to me! +1 to check in. Thanks. I'm sure we can can still do some manual tweaks and improvements, but perhaps we can indeed freeze the rest to automatically assigned scores in this run. btw if you feel like cranking up the max gens, go for it. fwiw, spamassassin2.zones has a very powerful CPU -- if it's taking too long on your own machine, try scping stuff up and running it there. My office workstation is quite beefy too, and I hope we won't need to do many further runs, so for now I'd just stick to what I'm familiar with. Btw, my set3 run at 14000 iterations takes 5 hours, similar for set1, the other two are much faster (less than 30 minutes each). I just let it run overnight, so it wouldn't matter if it takes half that time. I did some previous runs at 3 iterations, and a diagram (like the one attached earlier) does not show noticeable improvements beyond about 1, or even small worsening by the end, so the 14000 limit seems reasonable. And the GA algorithms are said to be prone to overfitting, so it's probably prudent not to go too far. RCVD_IN_XBL is still surprisingly low -- I bet there's some additive behaviour overlapping between XBL and PBL, though. RCVD_IN_SBL is _very_ low in set 3 too, bizarre! otherwise I can't see any issues | Please manually adjust the scores of RCVD_IN_PSBL up. At the time of the | rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a | number of major ISP's. As a result, for 5 weeks straight RCVD_IN_PSBL has | been almost completely devoid of FP's in our weekly masschecks. I am | confident that PSBL performs safer than measured during the rescore masscheck Ok, I suggest we collect some manual fixes like the ones suggested here (with specific score suggestions), and wrap it up. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #150 from Justin Mason j...@jmason.org 2009-11-07 13:33:19 UTC --- (In reply to comment #146) Created an attachment (id=4565) -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4565) [details] resulting 50_scores.cf from garescorer runs - V5 A new run, this time I left the URIBL whitelists and similar fixed (at their relatively high manual scores) as they were in current 50_scores.cf After a little examination, they look good to me! +1 to check in. RCVD_IN_XBL is still surprisingly low -- I bet there's some additive behaviour overlapping between XBL and PBL, though. RCVD_IN_SBL is _very_ low in set 3 too, bizarre! otherwise I can't see any issues btw if you feel like cranking up the max gens, go for it. fwiw, spamassassin2.zones has a very powerful CPU -- if it's taking too long on your own machine, try scping stuff up and running it there. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #151 from Warren Togami wtog...@redhat.com 2009-11-07 15:46:54 UTC --- Please manually adjust the scores of RCVD_IN_PSBL up. At the time of the rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a number of major ISP's. As a result, for 5 weeks straight RCVD_IN_PSBL has been almost completely devoid of FP's in our weekly masschecks. I am confident that PSBL performs safer than measured during the rescore masscheck. http://ruleqa.spamassassin.org/20090829-r809102-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20090905-r811608-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20090912-r814117-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20090919-r816871-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20090926-r819101-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20091003-r821273-n/RCVD_IN_PSBL/detail (below this point FP rate dropped to nearly zero) http://ruleqa.spamassassin.org/20091010-r823821-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20091017-r826198-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20091024-r829323-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20091031-r831520-n/RCVD_IN_PSBL/detail http://ruleqa.spamassassin.org/20091107-r833654-n/RCVD_IN_PSBL/detail You can plainly see steady and sustained improvement in FP safety in these past weeks. RCVD_IN_PSBL in the rescore masscheck was without lastexternal. Clearly with the added limitation of lastexternal it is safer than measured. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 Adam Katz antis...@khopis.com changed: What|Removed |Added Attachment #4561|0 |1 is obsolete|| --- Comment #145 from Adam Katz antis...@khopis.com 2009-11-04 15:52:15 UTC --- Created an attachment (id=4564) -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4564) Checker for rules that match more ham than spam Updated my checker to use S/O (now that I understand that stat). It also supports specifying the DateRev for the specific masscheck run. Since today's run was sparse, here are yesterday's results. $ ./sa33badrules.pl 20091103-r832343-n S/O RANK HAM%SPAM% Score in attachment 4558 Rule .008 .12 1.2401 0.0105 0.001MSGID_MULTIPLE_AT .011 .22 0.3066 0.0035 0OBSCURED_EMAIL .012 .25 0.2058 0.0025 0.000 2.099 0.001 1.212 MISSING_MIME_HB_SEP .014 .17 0.5822 0.0080 0.001 0.001 0.699 0.699 TVD_RCVD_SPACE_BRACKET .028 .20 0.4339 0.0125 unknown TVD_FUZZY_SECTOR .042 .28 0.1732 0.0075 0SUBJECT_FUZZY_TION .048 .77 4.4862 0.2279 -0.001 SPF_HELO_PASS .052 .29 0.1476 0.0080 1.494 1.699 1.591 1.516 X_IP .055 .22 0.3914 0.0226 2.205 0.174 1.299 1.806 FRT_SOMA2 .062 .74 5.1484 0.3424 -0.001 SPF_PASS .077 .25 0.2643 0.0221 0.987 0.750 0.943 1.318 CTYPE_001C_B .079 .36 0.0640 0.0055 0.001 0.001 0.605 0.378 HTML_NONELEMENT_30_40 .080 .28 0.1742 0.0151 0.001 2.499 0.268 0.516 DRUGS_MUSCLE .084 .36 0.0660 0.0060 0FORGED_IMS_TAGS .090 .32 0.1114 0.0110 0.033 0.001 0.365 0.413 WEIRD_PORT .092 .21 0.8712 0.0878 1.499 0.419 0.904 0.798 MIME_BASE64_BLANKS .102 .37 0.0577 0.0065 0HTML_IFRAME_SRC .123 .34 0.0821 0.0115 0.003 0.978 0.100 1.515 TVD_FW_GRAPHIC_NAME_LONG .128 .37 0.0614 0.0090 0RCVD_BAD_ID .130 .29 0.1851 0.0276 0.001 0.020 0.001 1.799 MIME_BASE64_TEXT .178 .28 0.4948 0.1069 0 1.200 0 2.514 SPF_HELO_FAIL .202 .32 0.1590 0.0402 0.1 ANY_BOUNCE_MESSAGE .205 .35 0.0817 0.0211 2.199 1.622 2.199 1.086 LONGWORDS .213 .34 0.1186 0.0321 0BLANK_LINES_80_90 .216 .32 0.1474 0.0407 2.199 2.199 1.246 2.090 WEIRD_QUOTING .218 .32 0.1445 0.0402 0.1 BOUNCE_MESSAGE .223 .30 0.7605 0.2179 1.799 0.572 1.182 1.138 HTML_IMAGE_RATIO_06 .241 .34 1.3973 0.4438 1.0 EXTRA_MPART_TYPE .254 .34 0.1222 0.0417 0.001 2.185 1.936 0.476 FRT_SOMA .283 .33 0.6883 0.2711 0.539 0.001 0.332 0.488 MIME_HTML_MOSTLY .299 .36 0.0908 0.0387 0.799 0.001 0.711 0.026 TVD_FW_GRAPHIC_NAME_MID .303 .34 0.4938 0.2143 1.899 0.496 0.950 0.445 HTML_IMAGE_RATIO_08 .367 .40 1.2775 0.7409 0.001TVD_SPACE_RATIO .379 .37 0.3182 0.1943 0.023 0.887 0.000 0.417 UPPERCASE_50_75 .434 .39 0.3261 0.2505 3.099 1.823 1.802 1.998 BAD_ENC_HEADER .436 .46 15.3798 11.8920 0.001FREEMAIL_FROM .454 .41 0.5503 0.4573 2.260 0.742 1.199 0.640 MPART_ALT_DIFF .516 .47 3.6581 3.9024 0.001MIME_QP_LONG_LINE .655 .51 1.9537 3.7036 1.154 1.677 1.198 1.453 SUBJ_ALL_CAPS .665 .49 42.2269 83.7383 0.001HTML_MESSAGE .692 .52 1.1850 2.6580 0.001UNPARSEABLE_RELAY .922 .58 1.1584 13.7423 0 1.322 0 1.237 RCVD_IN_BL_SPAMCOP_NET .935 .57 3.5421 50.6034 2.199 0.955 1.215 0.549 MIME_HTML_ONLY .970 .52 1.5729 51.1430 0 1.1 0 0.7 RDNS_NONE Note, I hacked RDNS_NONE so that it removes the Enron hits. Problem rules this week include X_IP, EXTRA_MPART_TYPE, FRT_SOMA2, and BAD_ENC_HEADER (scored 3.099?!). Food for thought: while it's good to create workarounds for the problematic outcomes from the genetic algorithm, I think that these should be examples with which to troubleshoot the algorithm itself while this might just be an early sign of over-fitting (which is largely fine as long as we comb through the results with scripts like this), it might also be indicative of a problem in the system's prioritization. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #144 from Warren Togami wtog...@redhat.com 2009-10-29 18:33:38 UTC --- What is the next step in order to move forward? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #141 from Mark Martinec mark.marti...@ijs.si 2009-10-28 09:02:40 UTC --- But I agree that more may need re-fixing. cool. In particular, some of the DNSBLs and most of the DNSWLs are good to 'lock down', I feel, as users tend to 'compensate' or correct their scores more frequently than other rules -- in my opinion. Also, if those are given low scores by the GA, their operators tend to be annoyed, and it's not good to annoy people who we're relying on ;) It also reflects that those rules are slightly different, and hopefully more reliable, than a typical body rule for example -- there's no way to indicate this to the GA yet, so locking the rules is as good as we can do. | It is quite possible that some of these hits are still false positives, | despite several iterations of cleaning I wonder how much is the low score for some ham rules affected by false positives present in the spam* corpora. Here is some statistics for the more prominent ham rules (i.e. the ones with negative scores). For each rule the table shows a number of hits of this rule for each corpus - both as a percentage of all entries in a file, and as absolute counts. The entries standing out from the crowd that may need re-checking are labeled with *** : score ALL_TRUSTED -1.000 0.046 % 1/2194 spam-bayes-net-bb-kmcgrail 0.017 %4/23761 spam-bayes-net-mmartinec 0.014 %5/36941 spam-bayes-net-hege 0.001 %1/81265 spam-bayes-net-bluestreak 0.000 % 1/931863 spam-bayes-net-dos score BAYES_00 0 0 -1.2 -1.9 5.652 % 104/1840 spam-bayes-net-bb-jhardin *** 1.805 % 429/23761 spam-bayes-net-mmartinec 1.606 %33/2055 spam-bayes-net-ahenry 0.439 % 357/81265 spam-bayes-net-bluestreak 0.374 % 138/36941 spam-bayes-net-hege 0.030 % 445/1489699 spam-bayes-net-jm 0.017 % 156/931863 spam-bayes-net-dos score DCC_REPUT_00_12 0 -0.8 0 -0.4 0.164 % 39/23761 spam-bayes-net-mmartinec score HABEAS_ACCREDITED_SOI 0 -1.634 0 -0.475 5.382 %76/1412 spam-bayes-net-bb-guenther_fraud *** 0.272 % 5/1840 spam-bayes-net-bb-jhardin 0.091 % 2/2194 spam-bayes-net-bb-kmcgrail 0.059 % 14/23761 spam-bayes-net-mmartinec 0.049 % 18/36941 spam-bayes-net-hege 0.037 % 558/1489699 spam-bayes-net-jm 0.030 % 2/6728 spam-bayes-net-wt-en1 0.018 % 15/81265 spam-bayes-net-bluestreak 0.000 % 1/931863 spam-bayes-net-dos score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8 0.163 % 3/1840 spam-bayes-net-bb-jhardin *** 0.091 % 2/2194 spam-bayes-net-bb-kmcgrail 0.071 % 1/1412 spam-bayes-net-bb-guenther_fraud 0.003 %1/36941 spam-bayes-net-hege 0.000 % 1/1489699 spam-bayes-net-jm score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2 1.250 %23/1840 spam-bayes-net-bb-jhardin *** (1.108 % 7/632 spam-bayes-net-binnocenti.OFF) 0.638 %14/2194 spam-bayes-net-bb-kmcgrail 0.469 % 381/81265 spam-bayes-net-bluestreak 0.438 % 9/2055 spam-bayes-net-ahenry 0.223 %15/6728 spam-bayes-net-wt-en1 0.214 % 79/36941 spam-bayes-net-hege 0.046 % 682/1489699 spam-bayes-net-jm 0.042 % 3/7185 spam-bayes-net-bb-zmi 0.013 %3/23761 spam-bayes-net-mmartinec 0.010 %2/19160 spam-bayes-net-wt-en4 0.003 % 29/931863 spam-bayes-net-dos score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1 16.153 % 240627/1489699 spam-bayes-net-jm *** (9.810 % 62/632 spam-bayes-net-binnocenti.OFF) 1.739 %32/1840 spam-bayes-net-bb-jhardin 1.600 % 591/36941 spam-bayes-net-hege 1.159 %78/6728 spam-bayes-net-wt-en1 1.133 %16/1412 spam-bayes-net-bb-guenther_fraud 0.925 %19/2055 spam-bayes-net-ahenry 0.365 % 8/2194 spam-bayes-net-bb-kmcgrail 0.107 % 87/81265 spam-bayes-net-bluestreak 0.097 % 7/7185 spam-bayes-net-bb-zmi 0.022 % 201/931863 spam-bayes-net-dos 0.021 %5/23761 spam-bayes-net-mmartinec 0.016 %3/19160 spam-bayes-net-wt-en4 score RCVD_IN_BSP_TRUSTED 0 -0.001 0 -0.001 5.312 %75/1412 spam-bayes-net-bb-guenther_fraud *** 0.030 % 2/6728 spam-bayes-net-wt-en1 0.029 %7/23761 spam-bayes-net-mmartinec 0.029 % 435/1489699 spam-bayes-net-jm 0.015 % 12/81265 spam-bayes-net-bluestreak 0.003 %1/36941 spam-bayes-net-hege 0.001 % 11/931863 spam-bayes-net-dos score RCVD_IN_IADB_DK 0 -0.044 0 -0.001 0.059 % 4/6728 spam-bayes-net-wt-en1 0.054 % 1/1840 spam-bayes-net-bb-jhardin 0.033 % 27/81265 spam-bayes-net-bluestreak 0.004 %1/23761 spam-bayes-net-mmartinec 0.001 % 21/1489699 spam-bayes-net-jm score RCVD_IN_IADB_RDNS 0 -0.018 0 -0.001 0.342 %23/6728 spam-bayes-net-wt-en1 *** 0.054 % 1/1840 spam-bayes-net-bb-jhardin 0.049 % 1/2055 spam-bayes-net-ahenry 0.033 % 27/81265 spam-bayes-net-bluestreak 0.004 %1/23761 spam-bayes-net-mmartinec 0.002 % 26/1489699 spam-bayes-net-jm score RCVD_IN_IADB_OPTIN 0 -3.265 0 -2.791 0.342 %23/6728 spam-bayes-net-wt-en1 *** 0.049 % 1/2055 spam-bayes-net-ahenry 0.000 % 4/1489699 spam-bayes-net-jm score
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #142 from Mark Martinec mark.marti...@ijs.si 2009-10-28 10:23:19 UTC --- Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly false positives are due to freelotto.com mail. I wonder whether such samples are rightfully in the spam* corpora - I'd say yes, but, as they say, spam is about consent, not content, and people receiving mail from freelotto.com most likely did register once, not realizing what they are dealing with. So there was a consent, at least initially. It is also about fraud and advertising, so, should one leave such mail samples in the spam corpus or not? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #143 from Mark Martinec mark.marti...@ijs.si 2009-10-28 10:41:31 UTC --- Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly false positives are due to freelotto.com mail. Same applies to RCVD_IN_BSP_TRUSTED spam hits. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #136 from Justin Mason j...@jmason.org 2009-10-27 07:09:36 UTC --- (In reply to comment #133) it looks like there might be a bit of a problem there -- definitely some rules that are in immutable sections, like the above, have been allowed to be mutable in ranges.data just wondering, Mark, did you do this deliberately? or is it just a bug in the tool that it's ignoring the non-mutable flag for those rules for some reason? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #137 from Mark Martinec mark.marti...@ijs.si 2009-10-27 14:18:14 UTC --- it looks like there might be a bit of a problem there -- definitely some rules that are in immutable sections, like the above, have been allowed to be mutable in ranges.data just wondering, Mark, did you do this deliberately? or is it just a bug in the tool that it's ignoring the non-mutable flag for those rules for some reason? Sort-of deliberately. Initially I followed the idea in wiki RescoreMassCheck section 4.2: 'comment out all score lines except for rules that you think the scores are accurate like carefully-vetted net rules, or 0.001 informational rules' which made perfect sense to me, so I did it for 50_scores.cf, except for a couple of rather obvious rules like _WHITELIST and similar, and the ones clearly indicated as 'indicators' only in the surrounding comments, or set to 0.001. Later I nailed a couple more. I followed a principle: when in doubt, leave it floating, it can be fixed later if necessary. It gives some insight into what GA 'thinks' about certain rules. I think at least for some rules GA makes perfect sense, like RDNS_NONE and RDNS_DYNAMIC. For some of them the GA result is close to the manually assigned score, or may indicate a need for reconsidering the assigned score. But I agree that more may need re-fixing. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #138 from Mark Martinec mark.marti...@ijs.si 2009-10-27 14:29:03 UTC --- (In reply to comment #134) Some of the spam in my corpora is from third parties. I do check it for correct classification before uploading, but I was wondering: how does masscheck determine the correct lastexternal for corpora containing messages from multiple different networks? Or does it assume all of the messages in a given contributor's corpora have the same network boundary? If the latter, I need to remove those third-party messages from my spam corpora... Might lastexternal confusion in the masschecks be contributing in some way to the odd RCVD_IN_* score generation? I believe the masschecks leaves internal/external/msa_networks to their defaults, unless one cares to configure it correctly for his corpus. And I believe that it is more likely than not that some corpora were scanned with unsuitable settings of networks. I know that configuring it for my mass checks runs it gave me a headache (but I did it right in the end). Which is why I posted the following note on the ML at that time: From: Mark Martinec mark.martinec...@ijs.si To: dev@spamassassin.apache.org Subject: Re: SpamAssassin 3.3.0 mass-checks now starting Date: Fri, 4 Sep 2009 21:46:59 +0200 Docs don't say where one is supposed to put a local.cf with options which are ignored in masses/spamassassin/user_prefs (like Bayes SQL options, DCC, Pyzor timeouts etc). I tried to place local.cf into masses/spamassassin/, with horror results (some directives in local.cf proclaimed as invalid, as apparently plugins have not yet been loaded at the time of parsing this file, but only later). I finally placed it into ../rules/ as mylocal.cf, which finally works as expected, but I wonder if the is the proper solution. Should be documented I guess... -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #139 from Justin Mason j...@jmason.org 2009-10-27 15:00:50 UTC --- (In reply to comment #137) Sort-of deliberately. Initially I followed the idea in wiki RescoreMassCheck section 4.2: 'comment out all score lines except for rules that you think the scores are accurate like carefully-vetted net rules, or 0.001 informational rules' which made perfect sense to me, so I did it for 50_scores.cf, except for a couple of rather obvious rules like _WHITELIST and similar, and the ones clearly indicated as 'indicators' only in the surrounding comments, or set to 0.001. Later I nailed a couple more. I followed a principle: when in doubt, leave it floating, it can be fixed later if necessary. It gives some insight into what GA 'thinks' about certain rules. That's true. It's good to hear it's not a bug in the masses scripts, anyway ;) I think at least for some rules GA makes perfect sense, like RDNS_NONE and RDNS_DYNAMIC. Yes, I agree, it's actually done a (surprisingly) good job with those. For some of them the GA result is close to the manually assigned score, or may indicate a need for reconsidering the assigned score. But I agree that more may need re-fixing. cool. In particular, some of the DNSBLs and most of the DNSWLs are good to 'lock down', I feel, as users tend to 'compensate' or correct their scores more frequently than other rules -- in my opinion. Also, if those are given low scores by the GA, their operators tend to be annoyed, and it's not good to annoy people who we're relying on ;) It also reflects that those rules are slightly different, and hopefully more reliable, than a typical body rule for example -- there's no way to indicate this to the GA yet, so locking the rules is as good as we can do. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #140 from Justin Mason j...@jmason.org 2009-10-27 15:04:51 UTC --- (In reply to comment #138) I believe the masschecks leaves internal/external/msa_networks to their defaults, unless one cares to configure it correctly for his corpus. And I believe that it is more likely than not that some corpora were scanned with unsuitable settings of networks. I know that configuring it for my mass checks runs it gave me a headache (but I did it right in the end). What should be happening, though, is that we're just underestimating the amount of -lastexternal rule hits -- the S/O should still be correct, but the overall number of hits will be less. Hopefully that will still provide a useful estimation of accuracy. Docs don't say where one is supposed to put a local.cf with options which are ignored in masses/spamassassin/user_prefs (like Bayes SQL options, DCC, Pyzor timeouts etc). I tried to place local.cf into masses/spamassassin/, with horror results (some directives in local.cf proclaimed as invalid, as apparently plugins have not yet been loaded at the time of parsing this file, but only later). I finally placed it into ../rules/ as mylocal.cf, which finally works as expected, but I wonder if the is the proper solution. Should be documented I guess... yuck. bug 6227. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 Mark Martinec mark.marti...@ijs.si changed: What|Removed |Added Attachment #4542|0 |1 is obsolete|| Attachment #4553|0 |1 is obsolete|| --- Comment #124 from Mark Martinec mark.marti...@ijs.si 2009-10-26 07:49:13 UTC --- Created an attachment (id=4558) -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4558) resulting 50_scores.cf from garescorer runs - V3 Attached is the latest 50_scores.cf file, obtained in a couple of iterations during the last few days. It takes into account the updated results files from the rsync submit area, in particular the updated net-wt* (Comment 99, 102, 103), and net-hege* files. The binnocenti* are still excluded. The rest of the corpora tweaks/decimation as per my previous run, Comment 96. The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101), otherwise the _MED stands out above the _HI due to its significantly higher hit rate. The KB_RATWARE_OUTLOOK_08, KB_RATWARE_OUTLOOK_12, KB_RATWARE_OUTLOOK_16 and KB_RATWARE_BOUNDARY were now zeroed-out according to Comment 115. I tried leaving RDNS_NONE and RDNS_DYNAMIC floating (Comment 116, 120, 122), and it seems to me the obtained score is perfectly sensible and useful, and still not too high to punish incompetent admins too hard: score RDNS_NONE 0 1.1 0 0.7 score RDNS_DYNAMIC 0 0.5 0 0.5 so I'm leaving these floating. According to Comment 122 I zeroed out (actually, 0.001'd out) the HTML_MESSAGE, MIME_QP_LONG_LINE, FREEMAIL_FROM, TVD_SPACE_RATIO, and MSGID_MULTIPLE_AT. Some further tweaks: I reduced the BAYES scores somewhat (e.g. from 4.5 to 3.5 for BAYES_99 scoreset3) and tamed down the BAYES_50, which was standing out from the crowd). For DCC_* rules I used the already described approach: obtain DCC_CHECK score from a GA run with all DCC_REPUT_* zeroed-out, then fix the obtained DCC_CHECK, and let DCC_REPUT_* float for the final run. The NML_ADSP_CUSTOM_MED was obtained from a GA run, but other (_LOW, _HIGH) were set manually (currently no hits). The DKIM_ADSP_ALL, DKIM_ADSP_DISCARD, and DKIM_ADSP_NXDOMAIN are based on GA runs, but hand-tweaked somewhat due to inconsistencies between corpora. A word about JM_SOUGHT_FRAUD_{1,2,3}: these three rules come out from a ga RUN with scores between 2 and 3, but are somewhat inconsistent between runs and corpora. As requested by Comment 38 their scores were fixed at zero for the final run, but I'd set these manually to 2.2 each for the published 50_scores.cf. After preparing my manual fixes from a couple of trial runs, I made a final run for each scoreset with these fixed scores, so as to allow other scores to adjust themselves to the new constraints. So here are the manual fixes: score SPF_PASS -0.001 score SPF_HELO_PASS -0.001 score BAYES_00 0 0 -1.2 -1.9 score BAYES_05 0 0 -0.2 -0.5 score BAYES_20 0 0 -0.001 -0.001 score BAYES_40 0 0 -0.001 -0.001 score BAYES_50 0 0 2.00.8 score BAYES_60 0 0 2.51.5 score BAYES_80 0 0 2.72.0 score BAYES_95 0 0 3.23.0 score BAYES_99 0 0 3.83.5 score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1 score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2 score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8 score HTML_MESSAGE 0.001 score NO_RELAYS -0.001 score UNPARSEABLE_RELAY 0.001 score NO_RECEIVED -0.001 score NO_HEADERS_MESSAGE 0.001 score DKIM_ADSP_ALL0 1.1 0 0.8 score DKIM_ADSP_DISCARD0 1.8 0 1.8 score DKIM_ADSP_NXDOMAIN 0 0.8 0 0.9 score NML_ADSP_CUSTOM_LOW 0 0.7 0 0.7 score NML_ADSP_CUSTOM_MED 0 1.2 0 0.9 score NML_ADSP_CUSTOM_HIGH 0 2.6 0 2.5 score JM_SOUGHT_FRAUD_1 0 score JM_SOUGHT_FRAUD_2 0 score JM_SOUGHT_FRAUD_3 0 score MIME_QP_LONG_LINE 0.001 score FREEMAIL_FROM 0.001 score TVD_SPACE_RATIO 0.001 score MSGID_MULTIPLE_AT 0.001 score EXTRA_MPART_TYPE 1.0 score RDNS_NONE 0 1.1 0 0.7 score RDNS_DYNAMIC 0 0.5 0 0.5 score KB_RATWARE_OUTLOOK_08 0 score KB_RATWARE_OUTLOOK_12 0 score KB_RATWARE_OUTLOOK_16 0 score KB_RATWARE_BOUNDARY0 -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #125 from Mark Martinec mark.marti...@ijs.si 2009-10-26 08:00:59 UTC --- $ head test scores = score set 3 (net, bayes) - gen-set3-20-5.0-12200-ga test (10%) # SUMMARY for threshold 5.0: # Correctly non-spam: 21172 99.93% # Correctly spam: 43597 98.78% # False positives:14 0.07% # False negatives: 537 1.22% # TCR(l=50): 35.678254 SpamRecall: 98.783% SpamPrec: 99.968% scores (90%): # SUMMARY for threshold 5.0: # Correctly non-spam: 168143 32.193% (99.979% of non-spam corpus) # Correctly spam: 349734 66.961% (98.763% of spam corpus) # False positives:36 0.007% (0.021% of nonspam, 8360 weighted) # False negatives: 4382 0.839% (1.237% of spam, 14401 weighted) # Average score for spam: 21.1nonspam: -2.2 # Average for false-pos: 5.5 false-neg: 3.3 # TOTAL: 522295 100.00% = score set 2 (no net, bayes) - gen-set2-10-5.0-12200-ga test: # SUMMARY for threshold 5.0: # Correctly non-spam: 21148 99.82% # Correctly spam: 41172 93.29% # False positives:38 0.18% # False negatives: 2962 6.71% # TCR(l=50): 9.077334 SpamRecall: 93.289% SpamPrec: 99.908% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 167953 32.157% (99.866% of non-spam corpus) # Correctly spam: 329931 63.169% (93.170% of spam corpus) # False positives: 226 0.043% (0.134% of nonspam, 26882 weighted) # False negatives: 24185 4.631% (6.830% of spam, 89229 weighted) # Average score for spam: 10.8nonspam: -0.7 # Average for false-pos: 5.6 false-neg: 3.7 # TOTAL: 522295 100.00% = score set 1 (net, no bayes) - gen-set1-10-5.0-12201-ga test: # SUMMARY for threshold 5.0: # Correctly non-spam: 21155 99.85% # Correctly spam: 43153 97.78% # False positives:31 0.15% # False negatives: 981 2.22% # TCR(l=50): 17.437377 SpamRecall: 97.777% SpamPrec: 99.928% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 168012 32.168% (99.901% of non-spam corpus) # Correctly spam: 346216 66.287% (97.769% of spam corpus) # False positives: 167 0.032% (0.099% of nonspam, 20194 weighted) # False negatives: 7900 1.513% (2.231% of spam, 23052 weighted) # Average score for spam: 19.8nonspam: -0.5 # Average for false-pos: 5.7 false-neg: 2.9 # TOTAL: 522295 100.00% = score set 0 (no net, no bayes) - gen-set0-5-5.0-12201-ga test: # SUMMARY for threshold 5.0: # Correctly non-spam: 20919 98.74% # Correctly spam: 34081 77.22% # False positives: 267 1.26% # False negatives: 10053 22.78% # TCR(l=50): 1.885827 SpamRecall: 77.222% SpamPrec: 99.223% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 166261 31.833% (98.860% of non-spam corpus) # Correctly spam: 271409 51.965% (76.644% of spam corpus) # False positives: 1918 0.367% (1.140% of nonspam, 126535 weighted) # False negatives: 82707 15.835% (23.356% of spam, 235514 weighted) # Average score for spam: 10.4nonspam: 0.6 # Average for false-pos: 6.3 false-neg: 2.8 # TOTAL: 522295 100.00% = In summary: set 3 # False positives:36 (0.021% of nonspam) # False negatives: 4382 (1.237% of spam) set 2 # False positives: 226 (0.134% of nonspam) # False negatives: 24185 (6.830% of spam) set 1 # False positives: 167 (0.099% of nonspam) # False negatives: 7900 (2.231% of spam) set 0 # False positives: 1918 (1.140% of nonspam) # False negatives: 82707 (23.356% of spam) -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #127 from Mark Martinec mark.marti...@ijs.si 2009-10-26 08:09:26 UTC --- Created an attachment (id=4560) -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4560) ranges.data on corpora used for score set 3 and 2 runs -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #128 from Karsten Bräckelmann guent...@rudersport.de 2009-10-26 09:57:28 UTC --- (In reply to comment #124) Created an attachment (id=4558) -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4558) [details] resulting 50_scores.cf from garescorer runs - V3 Now I am getting really nervous. :-/ From the scores: score KB_DATE_CONTAINS_TAB 3.799 3.799 3.315 2.871 score KB_FAKED_THE_BAT 1.447 2.273 2.452 3.799 The bad thing about this is, that onet.pl / onet.eu (a polish free-mailer AFAIK) actually munges the header, and injects the tab into the Date header on their outgoing SMTP servers. Apparently, they do that harm to all outgoing mail, not limited to their web-mailer. It is a very, very stupid thing to do for them, to munge MUA generated headers like that, but still they appear to do it. :( That means their customers will really be punished, and using them *and* The Bat! is a killer. FWIW, I once wrote these to counter a flood of low-scoreres -- but the above scores are scaring me. This is quite bad. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #129 from Matthias Leisi matth...@leisi.net 2009-10-26 10:36:56 UTC --- (In reply to comment #124) The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101), otherwise the _MED stands out above the _HI due to its significantly higher hit rate. [..] score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1 score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2 score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8 Is there a particular reason why these are so much different from those in https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf: | score RCVD_IN_DNSWL_LOW 0 -1 0 -1 | score RCVD_IN_DNSWL_MED 0 -4 0 -4 | score RCVD_IN_DNSWL_HI 0 -8 0 -8 -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #130 from Mark Martinec mark.marti...@ijs.si 2009-10-26 11:03:28 UTC --- The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101), otherwise the _MED stands out above the _HI due to its significantly higher hit rate. score RCVD_IN_DNSWL_LOW 0 -0.6 0 -1.1 score RCVD_IN_DNSWL_MED 0 -1.5 0 -1.2 score RCVD_IN_DNSWL_HI 0 -1.8 0 -1.8 Is there a particular reason why these are so much different from those in https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf: | score RCVD_IN_DNSWL_LOW 0 -1 0 -1 | score RCVD_IN_DNSWL_MED 0 -4 0 -4 | score RCVD_IN_DNSWL_HI 0 -8 0 -8 The -1/-4/-8 were manually provided (don't know the background on this decision). The RCVD_IN_DNSWL_MED in my GA results was obtained automatically, and the other two were manually adjusted to make some sense compared to _MED. Btw, the GA results on scoreset 3 from one of my previous runs were: RCVD_IN_DNSWL_LOW -2.761 RCVD_IN_DNSWL_MED -0.999 RCVD_IN_DNSWL_HI -0.966 -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #133 from Justin Mason j...@jmason.org 2009-10-26 13:51:54 UTC --- strange, some of the more trustworthy BLs are very low scoring. RCVD_IN_XBL: 0.404 and 0.722 these have been effectively zeroed, although are supposed to be immutable: RCVD_IN_SSC_TRUSTED_COI is 0 (with a 0.012 S/O, low hit rate though) HABEAS_ACCREDITED_COI is 0(ditto) RCVD_IN_BSP_TRUSTED is -0.001 (although with a 0.002 S/O) the HASHCASH rules likewise aren't supposed to be mutable. it looks like there might be a bit of a problem there -- definitely some rules that are in immutable sections, like the above, have been allowed to be mutable in ranges.data -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #134 from John Hardin jhar...@impsec.org 2009-10-26 14:31:20 UTC --- (In reply to comment #132) $ grep RCVD_IN_DNSWL_ freqs.full OVERALLSPAM% HAM% S/ORANK SCORE NAME 0.184 0.0005 0.57080.001 0.76 -1.80 RCVD_IN_DNSWL_HI 7.410 0.1094 22.75270.005 0.67 -1.20 RCVD_IN_DNSWL_MED 2.551 0.1810 7.53220.023 0.59 -1.10 RCVD_IN_DNSWL_LOW It is quite possible that some of these hits are still false positives, despite several iterations of cleaning: for j in spam*.log; do echo -n $j; grep RCVD_IN_DNSWL_HI $j | \ wc -l; done | sort -k2nr spam-bayes-net-bb-jhardin.log 3 same on _MED: spam-bayes-net-bb-jhardin.log 23 All but one of those are obvious spams, and I've removed the one questionable one from my corpora. Some of the spam in my corpora is from third parties. I do check it for correct classification before uploading, but I was wondering: how does masscheck determine the correct lastexternal for corpora containing messages from multiple different networks? Or does it assume all of the messages in a given contributor's corpora have the same network boundary? If the latter, I need to remove those third-party messages from my spam corpora... Might lastexternal confusion in the masschecks be contributing in some way to the odd RCVD_IN_* score generation? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #135 from Adam Katz antis...@khopis.com 2009-10-26 16:27:56 UTC --- Created an attachment (id=4561) -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4561) Checker for rules that match more ham than spam I've updated my checker to an actual perl script (still uses elinks as I don't feel like learning LWP and then parsing HTML). I've attached the checker, which can be run with custom parameters for a different ruleset, ham threshold, or minimum difference for ham:spam ratio. Here's the current output, listing all rules that hit 1+% of the ham corpus or that hit 0.05% more of the ham corpus than of the spam corpus. H^2/SHAM%SPAM%Score in attachment 4558 Rule 331.90.3319 0.0010 0 OBSCURED_EMAIL 117.44.8566 0.2009 -0.001 SPF_HELO_PASS 88.525.5735 0.3509 -0.001 SPF_PASS 85.610.2226 0.0026 0.000 2.099 0.001 1.212MISSING_MIME_HB_SEP 76.180.7085 0.0093 0.001 0.001 0.699 0.699TVD_RCVD_SPACE_BRACKET 66.190.2780 0.0042 1.145 1.542 1.912 2.400FUZZY_CPILL 49.981.0676 0.0228 0.001 MSGID_MULTIPLE_AT 31.820.1496 0.0047 1.494 1.699 1.591 1.516X_IP 21.860.1465 0.0067 0 SUBJECT_FUZZY_TION 20.40 15.6218 11.9604 0.001 FREEMAIL_FROM 20.00* 40.9055 83.6301 0.001 HTML_MESSAGE 17.100.1710 01.222 0.001 0.082 0.476MIME_BOUND_DIGITS_15 12.950.0609 0.0047 0 HTML_IFRAME_SRC 12.520.0714 0.0057 0 FORGED_IMS_TAGS 11.560.0659 0.0057 0.001 0.001 0.605 0.378HTML_NONELEMENT_30_40 10.830.1127 0.0104 0.033 0.001 0.365 0.413WEIRD_PORT 10.180.3494 0.0343 2.205 0.174 1.299 1.806FRT_SOMA2 9.7210.8934 0.0919 1.499 0.419 0.904 0.798MIME_BASE64_BLANKS 8.9960.2474 0.0275 0.987 0.750 0.943 1.318CTYPE_001C_B 8.9180.1525 0.0171 0.001 2.499 0.268 0.516DRUGS_MUSCLE 8.3730.0829 0.0099 0.003 0.978 0.100 1.515TVD_FW_GRAPHIC_NAME_LONG 8.0160.1956 0.0244 0.001 0.020 0.001 1.799MIME_BASE64_TEXT 6.8500.0685 00 HTML_NONELEMENT_40_50 5.4040.5356 0.0991 0 1.200 0 2.514SPF_HELO_FAIL 4.2370.1585 0.0374 2.199 2.199 1.246 2.090WEIRD_QUOTING 4.1593.8908 3.6392 0.001 MIME_QP_LONG_LINE 3.4830.8570 0.2460 1.799 0.572 1.182 1.138HTML_IMAGE_RATIO_06 3.2191.2399 0.4775 1.0EXTRA_MPART_TYPE 2.913* 12.1047 50.2891 0 1.1 0 0.7RDNS_NONE 2.8390.1164 0.0410 0.001 2.185 1.936 0.476FRT_SOMA 2.7510.1172 0.0426 0.1ANY_BOUNCE_MESSAGE 2.4170.6787 0.2808 0.539 0.001 0.332 0.488MIME_HTML_MOSTLY 2.3700.1010 0.0426 0.1BOUNCE_MESSAGE 2.0780.5534 0.2663 1.899 0.496 0.950 0.445HTML_IMAGE_RATIO_08 1.8991.2077 0.7677 0.001 TVD_SPACE_RATIO 1.7260.3227 0.1869 0.023 0.887 0.000 0.417UPPERCASE_50_75 1.5170.9658 0.6364 2.801 2.080 1.780 3.387DATE_IN_PAST_96_XX 1.2690.4224 0.3327 0.000 0.001 0.264 0.001HTML_FONT_SIZE_LARGE 1.1510.5492 0.4770 2.260 0.742 1.199 0.640MPART_ALT_DIFF 0.913* 1.8488 3.7425 1.154 1.677 1.198 1.453SUBJ_ALL_CAPS 0.703* 1.3317 2.5216 0.001 UNPARSEABLE_RELAY 0.278* 3.7480 50.4848 2.199 0.955 1.215 0.549MIME_HTML_ONLY 0.121* 1.2540 12.9472 0 1.322 0 1.237RCVD_IN_BL_SPAMCOP_NET (Anything asterisked is included because it matched 1% of the ham corpus but matched a larger percent of the spam corpus while everything else matched a larger percent of the ham corpus than the spam corpus.) Mark's fixes solved the immediate issues raised earlier, so I decided to order this by the ratio of percentage of ham corpus hit to percentage of spam corpus hit, but that under-emphasized the ham hits, so I then multiplied that by the ham percentage again (unless the percent was under 1). It's easy enough to browse for non-zero ham% hits. Any rule with a ratio over 1.000 is a problem when scored positively unless it is exempted for applying to popular spam patterns that the corpus is known to lack. For completeness, this list includes all tests that hit at least 1% of the ham corpus (thus the presence of HTML_MESSAGE, RDNS_NONE, and the four tests with ratios under 1.0). -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
Re: [Bug 6155] generate new scores for 3.3.0 release
On Wed, Oct 21, 2009 at 06:34:47PM -0700, Michael Peddemors wrote: On October 20, 2009, bugzilla-dae...@bugzilla.spamassassin.org wrote: Getting back to this issue: I don't see any problem with prejudice against poorly constructed network infrastructures that can't bother to adhere to the SMTP standard (RFC1912 section 2.1). This is something that any network admin who should legitimately be managing a mail server should be able to fix with a single phone call (please correct me if this sentence is prejudiced in any way). The SMTP standard requires a server's rDNS must match the server's reported name (thus the IP must have rDNS), and most allocated IPs have them anyway (even if they're wrong or ~dynamic, e.g. RDNS_DYNAMIC). There is also a growing number of deployments that block improper FCrDNS at the door (RDNS_NONE is a subset of failing FCrDNS). MagicMail Servers have been blocking all email at the connection level that do not have rDNS now for the past couple of years, except when SMTP AUTH is presented, and we haven't had an F/P reported in over a year. Maybe I'm beating a dead horse but.. http://ruleqa.spamassassin.org/20091021-r826376-n/RDNS_NONE/detail Hopefully you didn't mean that MagicMail somehow is an authority on the stats or a good example to follow. Even if this isn't users list, you should never imply that RDNS_NONE is safe to block at general 2% ham rate. Of course it's up to the site policy, but be prepared to.. - Listen to user complains - Create a large whitelist - Deal with imbeciles and hope they fix the DNS _some_ day ;-)
Re: [Bug 6155] generate new scores for 3.3.0 release
On Wed, 2009-10-21 at 23:35 -0400, Warren Togami wrote: On 10/21/2009 10:46 PM, Karsten Bräckelmann wrote: s/ Warren /SA devs, contributors and mass-check contributors/x # There is something seriously disturbing with the above comment. # Fix using a trivial substitution. What's disturbing about it is, that despite the recent discussion, Michael still seems to perceive the entire process of distributed mass-checks to be writing a rule, and reduces it to that. This is not about Warren. He just happens to dump random BLs for a short time in his granted sandbox. It is everyone else, who does the heavy weight lifting. While I agree it is unfortunate that he used my name there, don't you think what you wrote here a bit unnecessarily insulting? This suggests that dumping random BL's into the sandbox is all I do. Granted, I could have phrased that better. Though just as in my previous post, this is not about you ;) but the unfortunate depiction in the original post. I do not question your contributions and effort. guenther -- char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: [Bug 6155] generate new scores for 3.3.0 release
On October 21, 2009, Karsten Bräckelmann wrote: SA is ALL about scores, and NOT absolute. If you want absolute, reject BEFORE even passing the mail to SA. Easy. Lots of cycles spared. But since you're a regular on the user's list, I assume you've read that before... hehe.. no, not on the users list.. And I think you missed the point, SA is about scores, so the really 'prejudiced' rules might belong in a place other than SA.. That is one line that is blurry in SA discussions.. at what point is a rule prejudiced enough to consider that it is almost an absolute. Some rules score extremely, high.. No rDNS goes past the idea of scoring.. so does it belong in a scoring system? Just a topic for discussion.. -- -- Catch the Magic of Linux... Michael Peddemors - President/CEO - LinuxMagic Products, Services, Support and Development Visit us at http://www.linuxmagic.com A Wizard IT Company - For More Info http://www.wizard.ca LinuxMagic is a Registered TradeMark of Wizard Tower TechnoServices Ltd. 604-589-0037 Beautiful British Columbia, Canada This email and any electronic data contained are confidential and intended solely for the use of the individual or entity to which they are addressed. Please note that any views or opinions presented in this email are solely those of the author and are not intended to represent those of the company.
Re: [Bug 6155] generate new scores for 3.3.0 release
On Thu, 22 Oct 2009 09:34:13 -0700, Michael Peddemors wrote: I am curious to the large HAM rate.. Again, I think the testing of this rule against a corpus might be affecting this.. I tend to agree. AOL announced wholesale blocking of anyone with NXDOMAIN rDNS a few years back now, and that caused big changes in people thinking it was OK to mail from an IP with NXDOMAIN rDNS. Matt. __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __
Re: [Bug 6155] generate new scores for 3.3.0 release
Henrik Krohns wrote: I only have to look at my mail logs from today, and I see dozen of legimate RDNS_NONE hits originating from real people. I'm happy to greylist it at MTA, but not block directly. As said, it's a site policy. Some people use high FP BLs also happily. Many people might not report FPs for one reason or another, but it doesn't mean they don't exist.. I like to be on the safe side. The question is what defines safe and why is the score pinned to 0.1? Isn't the whole point of the genetic algorithm to determine what safe value to assign it? Who's to say that 0.2 isn't safe? (I suppose there's no way to *cap* a GA score rather than just pin it?) SA is a system of probabilities. We don't define ham as having 0 or fewer points. Again, I cite the masscheck results. Is 1.7% of the ham corpus bad? What about MIME_HTML_ONLY's 3.7% ham, or RCVD_IN_SPAMCOP_BL's 1.3% ham or SUBJ_ALL_CAPS's 1.8%, ...? All of those have GA-generated scores over 0.1. What about the fact that this only scores 0.8528% corpus overlap for ham scoring 4+? (like RDNS_NONE, MIME_HTML_ONLY's 3.7% ham overlap is mostly low-scoring ham, with only 1.5625% matching corpus ham at 4+). Even the latest scoring proposal here has this line: score HTML_MESSAGE 2.199 0.838 1.473 0.511 despite HTML_MESSAGE hitting 40.9% of the ham corpus. Here are some that hit a larger portion of the ham corpus than of the spam corpus despite having positive scores in bugzilla attachment 4553 (the latest scoring proposal) at https://issues.apache.org/SpamAssassin/attachment.cgi?id=4553 MIME_QP_LONG_LINE FREEMAIL_FROM TVD_SPACE_RATIO EXTRA_MPART_TYPE (among others) These were found by applying this search to the front page at http://ruleqa.spamassassin.org (using a firefox regexp search add-on) /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w/ In shell (guess who's bourne scripting is better than his perl?), elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 'print if /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/' |tee rules.txt for rule in `perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }' rules.txt`; do grep ^[^#]* $rule /tmp/50_scores_newest.cf; done
Re: [Bug 6155] generate new scores for 3.3.0 release
On Thu, Oct 22, 2009 at 20:35, Adam Katz antis...@khopis.com wrote: Henrik Krohns wrote: I only have to look at my mail logs from today, and I see dozen of legimate RDNS_NONE hits originating from real people. I'm happy to greylist it at MTA, but not block directly. As said, it's a site policy. Some people use high FP BLs also happily. Many people might not report FPs for one reason or another, but it doesn't mean they don't exist.. I like to be on the safe side. The question is what defines safe and why is the score pinned to 0.1? Isn't the whole point of the genetic algorithm to determine what safe value to assign it? Who's to say that 0.2 isn't safe? (I suppose there's no way to *cap* a GA score rather than just pin it?) One thing we need to take into account is that some rules are harder for senders to fix than others. Whether or not their ISP gives them rDNS is quite tricky to fix. The GA can't take that into account, but we can, by setting a score manually and locking it as non-mutable. --j. SA is a system of probabilities. We don't define ham as having 0 or fewer points. Again, I cite the masscheck results. Is 1.7% of the ham corpus bad? What about MIME_HTML_ONLY's 3.7% ham, or RCVD_IN_SPAMCOP_BL's 1.3% ham or SUBJ_ALL_CAPS's 1.8%, ...? All of those have GA-generated scores over 0.1. What about the fact that this only scores 0.8528% corpus overlap for ham scoring 4+? (like RDNS_NONE, MIME_HTML_ONLY's 3.7% ham overlap is mostly low-scoring ham, with only 1.5625% matching corpus ham at 4+). Even the latest scoring proposal here has this line: score HTML_MESSAGE 2.199 0.838 1.473 0.511 despite HTML_MESSAGE hitting 40.9% of the ham corpus. agh! that's a bug. Here are some that hit a larger portion of the ham corpus than of the spam corpus despite having positive scores in bugzilla attachment 4553 (the latest scoring proposal) at https://issues.apache.org/SpamAssassin/attachment.cgi?id=4553 MIME_QP_LONG_LINE FREEMAIL_FROM TVD_SPACE_RATIO EXTRA_MPART_TYPE (among others) These were found by applying this search to the front page at http://ruleqa.spamassassin.org (using a firefox regexp search add-on) /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w/ In shell (guess who's bourne scripting is better than his perl?), elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 'print if /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/' |tee rules.txt for rule in `perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }' rules.txt`; do grep ^[^#]* $rule /tmp/50_scores_newest.cf; done Could you add a comment to the rescoring bug (bug 6155) noting those over-high scores? HTML_MESSAGE at least should NOT be mutable like that :( -- --j.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #122 from Adam Katz antis...@khopis.com 2009-10-22 13:32:40 UTC --- Some bugs in the auto-generated rules from attachment 4553 HTML_MESSAGE scores WAY too high. There are others too. Full list as of right now: MSECSSPAM% HAM% S/ORANK SCORE NAME 0 0.1848 4.8675 0.0370.780.00 SPF_HELO_PASS 0 0.3294 5.5859 0.0560.740.00 SPF_PASS 0 12.2476 1.2568 0.9070.580.00 RCVD_IN_BL_SPAMCOP_NET 0 50.4453 3.7391 0.9310.572.30 MIME_HTML_ONLY 0 49.9300 12.1231 0.8050.520.10 RDNS_NONE 0 3.8466 1.8427 0.6760.512.30 SUBJ_ALL_CAPS 0 2.3989 1.3218 0.6450.500.00 UNPARSEABLE_RELAY 0 83.7769 40.8865 0.6720.490.00 HTML_MESSAGE 0 3.4477 3.8932 0.4700.472.50 MIME_QP_LONG_LINE 0 12.2361 15.6252 0.4390.460.00 FREEMAIL_FROM 0 0.7695 1.2102 0.3890.412.90 TVD_SPACE_RATIO 0 0.4610 1.2409 0.2710.351.00 EXTRA_MPART_TYPE 0 0.0271 1.0700 0.0250.151.22 MSGID_MULTIPLE_AT score SPF_HELO_PASS -0.001 score SPF_PASS -0.001 score RCVD_IN_BL_SPAMCOP_NET 0 1.725 0 1.180 # n=2 score MIME_HTML_ONLY 1.474 0.737 0.829 0.462 score RDNS_NONE 0.1 score SUBJ_ALL_CAPS 0.264 1.568 0.593 1.045 score UNPARSEABLE_RELAY 0.001 score HTML_MESSAGE 2.199 0.838 1.473 0.511 score MIME_QP_LONG_LINE 0.074 0.242 0.116 0.002 score FREEMAIL_FROM 0.817 1.020 0.401 0.856 score TVD_SPACE_RATIO 0.001 0.201 0.398 0.001 score MSGID_MULTIPLE_AT 0.001 0.001 0.598 0.000 To fetch them for yourself (so as to get something more up-to-date or from a different URL, etc), here's the code I ran (sorry, I know posix shell better than perl, so I dip into both): elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 'print if /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/' |tee rules.txt for rule in $(perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }' rules.txt); do grep ^[^#]* $rule /tmp/50_scores_newest.cf; done That could probably be written better, e.g. looking for ham% spam% in addition to ham% 0.%, but this is a good first-pass. Obviously, /removing/ fixed scores for things like RDNS_NONE can't possibly be considered until the GA is a little more apt at figuring this sort of thing out. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #123 from Adam Katz antis...@khopis.com 2009-10-22 13:47:40 UTC --- (In reply to comment #122) sorry, that should be: elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 'print if /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/' |tee rules.txt for rule in $(perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }' rules.txt); do grep ^[^#]* $rule /tmp/50_scores_newest.cf || echo score $rule UNKNOWN; done With each of those two stanzas living on just one line. Obviously, ignore the genuine ham rules. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
Re: [Bug 6155] generate new scores for 3.3.0 release
On October 20, 2009, bugzilla-dae...@bugzilla.spamassassin.org wrote: Getting back to this issue: I don't see any problem with prejudice against poorly constructed network infrastructures that can't bother to adhere to the SMTP standard (RFC1912 section 2.1). This is something that any network admin who should legitimately be managing a mail server should be able to fix with a single phone call (please correct me if this sentence is prejudiced in any way). The SMTP standard requires a server's rDNS must match the server's reported name (thus the IP must have rDNS), and most allocated IPs have them anyway (even if they're wrong or ~dynamic, e.g. RDNS_DYNAMIC). There is also a growing number of deployments that block improper FCrDNS at the door (RDNS_NONE is a subset of failing FCrDNS). MagicMail Servers have been blocking all email at the connection level that do not have rDNS now for the past couple of years, except when SMTP AUTH is presented, and we haven't had an F/P reported in over a year. However, this SHOULD be the MTA responsibility, and not the filtering system. Of course there are some MTA's still out there where this may help, but it is better to reject those during SMTP phase, so that the clueless admin can get reverse DNS up as soon as possible.. HOWEVER.. Please note, you have to watch this.. we have seen too many times where temporary DNS failures resulted in email blockages, and you dont' want to be dropping those messages on the floor when that happens.. Better to reject them, or at least send back temporary deferrals... Another point, is that the SMTP 'standard' is not yet a standard.. In the real world, just be happy they have any sort of reverse DNS.. We are trying to adopt a standard where at least the reverse DNS resolves to a domain owned by the email operator, (and not his upstream providers generic addressing scheme) and we still get some push back on that.. to get the average MS Exchange operator to set up the servers' reported name.. how many times do we see HELO localhost.localnet still :) And there are many operators who have reasons NOT to do this.. (Email Clusters, Server with Internal Naming Conventions et al) It would be nice to see SA having to do less of the 'Best Practices' stuff.. leave that to MTA's.. Just thought I would put my two bits in SA 'could' go farther with 'prejudiced' rules, but if they are sufficiently prejudiced, should they not be absolutes, instead of scores? PS, since I am posting.. Warren, have you done any 'testing' with the SPAM-RATS RBL's against the corpus? would be interested in numbers.. even with the variables of aged dating, and not checking SMTP Authed messages.. -- -- Catch the Magic of Linux... Michael Peddemors - President/CEO - LinuxMagic Products, Services, Support and Development Visit us at http://www.linuxmagic.com A Wizard IT Company - For More Info http://www.wizard.ca LinuxMagic is a Registered TradeMark of Wizard Tower TechnoServices Ltd. 604-589-0037 Beautiful British Columbia, Canada This email and any electronic data contained are confidential and intended solely for the use of the individual or entity to which they are addressed. Please note that any views or opinions presented in this email are solely those of the author and are not intended to represent those of the company.
Re: [Bug 6155] generate new scores for 3.3.0 release
On Wed, 2009-10-21 at 18:34 -0700, Michael Peddemors wrote: MagicMail Servers have been blocking all email at the connection level that do not have rDNS now for the past couple of years, except when SMTP AUTH is presented, and we haven't had an F/P reported in over a year. Funnily enough, there are ISPs out there advertising targeted towards small businesses, handing over static IPs with NO rDNS whatsoever. Dialup customers do have (generic) rDNS. Not made up. Political decision. Just thought I would put my two bits in SA 'could' go farther with 'prejudiced' rules, but if they are sufficiently prejudiced, should they not be absolutes, instead of scores? SA is ALL about scores, and NOT absolute. If you want absolute, reject BEFORE even passing the mail to SA. Easy. Lots of cycles spared. But since you're a regular on the user's list, I assume you've read that before... -- char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: [Bug 6155] generate new scores for 3.3.0 release
On Wed, 2009-10-21 at 22:03 -0400, Warren Togami wrote: On 10/21/2009 09:34 PM, Michael Peddemors wrote: Warren, have you done any 'testing' with the SPAM-RATS RBL's against the corpus? would be interested in numbers.. even with the variables of aged dating, and not checking SMTP Authed messages.. s/ Warren /SA devs, contributors and mass-check contributors/x # There is something seriously disturbing with the above comment. # Fix using a trivial substitution. This is not about Warren. He just happens to dump random BLs for a short time in his granted sandbox. It is everyone else, who does the heavy weight lifting. I have never seen this RBL before. You might want to catch up on years of user's list archives, first. There are opinions, and folks who tested it. Nothing new, really. I assume this is your service, and you give us permission to swamp it with hundreds of thousands of rapid queries every Saturday? If so I'll give sufficient warning to the list here and add it before Saturday masscheck. Warning, or a brief discussion, if it might actually be worthwhile. Or not. -- char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: [Bug 6155] generate new scores for 3.3.0 release
On 10/21/2009 10:46 PM, Karsten Bräckelmann wrote: s/ Warren /SA devs, contributors and mass-check contributors/x # There is something seriously disturbing with the above comment. # Fix using a trivial substitution. This is not about Warren. He just happens to dump random BLs for a short time in his granted sandbox. It is everyone else, who does the heavy weight lifting. While I agree it is unfortunate that he used my name there, don't you think what you wrote here a bit unnecessarily insulting? This suggests that dumping random BL's into the sandbox is all I do. Warning, or a brief discussion, if it might actually be worthwhile. Or not. Sure. Warren
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #110 from Justin Mason j...@jmason.org 2009-10-20 03:46:49 UTC --- (In reply to comment #109) (In reply to comment #108) The important questions are, where is KB_RATWARE_BOUNDARY, which was specifically pushed right before the deadline to supersede these? Argh! It is in freqs.full, attachment 4541 [details]. However, it appears we've been using inconsistent rule-sets, with most contributors using one outdated rule-set or the other. :-( 10.830 14.1437 0.19010.987 0.670.00 T_KB_RATWARE_BOUNDARY 0.025 0.0327 0.1.000 0.651.00 KB_RATWARE_BOUNDARY mysterious: : exit=[130] uid=jm Tue Oct 20 10:40:30 GMT 2009; cd /export/home/corpus-rsync/corpus/submit : 6...; grep KB_RATWARE_BOUNDARY *.log | grep -v T_KB_RATWARE_BOUNDARY : exit=[0 1] uid=jm Tue Oct 20 10:43:41 GMT 2009; cd /export/home/corpus-rsync/corpus/submit I can't find any non-T_ hits in the submit logs. Mark? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #111 from Justin Mason j...@jmason.org 2009-10-20 03:48:45 UTC --- (In reply to comment #110) (In reply to comment #109) (In reply to comment #108) The important questions are, where is KB_RATWARE_BOUNDARY, which was specifically pushed right before the deadline to supersede these? anyway it doesn't look like that rules is good enough to supersede them: 10.830 14.1437 0.19010.987 0.670.00 T_KB_RATWARE_BOUNDARY vs 9.846 12.9126 0.00031.000 0.981.00 KB_RATWARE_OUTLOOK_08 9.836 12.8985 0.00031.000 0.981.00 KB_RATWARE_OUTLOOK_MID 9.835 12.8976 0.00031.000 0.981.00 KB_RATWARE_OUTLOOK_16 9.835 12.8976 0.00031.000 0.981.00 KB_RATWARE_OUTLOOK_12 that's a much higher FP rate! -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #112 from Karsten Bräckelmann guent...@rudersport.de 2009-10-20 04:15:03 UTC --- anyway it doesn't look like that rules is good enough to supersede them: that's a much higher FP rate! Yes. It's all Warren's fault! ;) Seriously, the new BOUNDARY one does indeed have quite some FPs, all in Warren's corpus, and he kindly provided me with the samples. Appears these are all entirely legit, though auto-generated messages. I wish MS wouldn't re-use their code like that. X-Mailer: Microsoft CDO for Windows 2000 Anyway, I agree -- RATWARE_BOUNDARY is bad, I screwed up with too low a range between headers. One of the previous rules needs to be kept. (The massive overlap along with the introduced FNs made it drop off of the active rules.) Still wondering why there are different rule names in freqs. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #113 from Karsten Bräckelmann guent...@rudersport.de 2009-10-20 04:43:31 UTC --- 9.836 12.8985 0.00031.000 0.981.00 KB_RATWARE_OUTLOOK_MID Proposing the MID variant for inclusion, and dropping the other variants. The BOUNDARY one is bad, and the variants do have an almost 100% overlap with the MID one. It's also the most strict one. (Funny side-effect of the additional constraint is actually catching a spam or two more... Go figure.) The ham hit probably is not really ham (no FP in nightlies). -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #114 from Justin Mason j...@jmason.org 2009-10-20 08:26:26 UTC --- (In reply to comment #113) 9.836 12.8985 0.00031.000 0.981.00 KB_RATWARE_OUTLOOK_MID Proposing the MID variant for inclusion, and dropping the other variants. can you list exactly which rules you want zeroed, before Mark reruns the GA accordingly? minimize the work he has to do ;) -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #115 from Karsten Bräckelmann guent...@rudersport.de 2009-10-20 08:46:55 UTC --- Err, sure. :) The following variations should just be dropped. score KB_RATWARE_OUTLOOK_08 0 score KB_RATWARE_OUTLOOK_12 0 score KB_RATWARE_OUTLOOK_16 0 score KB_RATWARE_BOUNDARY0 Keep KB_RATWARE_OUTLOOK_MID (instead of the above) and KB_RATWARE_MSGID (which is an unrelated rule anyway). -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 Adam Katz antis...@khopis.com changed: What|Removed |Added CC||antis...@khopis.com --- Comment #116 from Adam Katz antis...@khopis.com 2009-10-20 13:08:15 UTC --- Standing up for RDNS_NONE ... http://ruleqa.spamassassin.org/week/RDNS_NONE/detail bb_trec_enron has 98.9497% of its ham match RDNS_NONE, which is to say that it's bogus. Discounting that corpus, RDNS_NONE matches 58.7244% of the total spam corpus and 1.7463% of the total ham corpus (down from 12.1273%), which makes it far more interesting. Many of the people on the sa-users list have manually scored RDNS_NONE higher than the default 0.1. I score it at 0.9 on my own production servers. (Not sure if this is the right venue -- or if I'm an approved kibitzer) -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #117 from Karsten Bräckelmann guent...@rudersport.de 2009-10-20 13:17:26 UTC --- bb_trec_enron has 98.9497% of its ham match RDNS_NONE, which is to say that it's bogus. Indeed. From the dev list earlier today, that's a corpus with generated (synthetic) headers [...], only useful for body hits, and is not included in the re-scoring. Many of the people on the sa-users list have manually scored RDNS_NONE higher than the default 0.1. FWIW, nailed to 0.1 as per comment 56. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #119 from Warren Togami wtog...@redhat.com 2009-10-20 13:47:28 UTC --- (In reply to comment #118) ... despite the current corpus data (unless 1.7% is a high ham hit-rate)? http://ruleqa.spamassassin.org/20091017-r826198-n/RDNS_NONE/detail The most recent weekly run has pretty substantial hits even outside of the synthetic corpus. Adam, this like your RCVD_IN_APNIC are examples of inherently prejudiced rules. It might work for the most part, and you might accept the risk of accidental FP's because the score alone wont push it above the threshold. However the combined risks of multiple prejudiced rules is too great. Prejudiced rules should be up to the sysadmin if they want to enable. We should not highly score any known prejudiced rules in the default ruleset. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #118 from Adam Katz antis...@khopis.com 2009-10-20 13:38:04 UTC --- (In reply to comment #117) bb_trec_enron has 98.9497% of its ham match RDNS_NONE, which is to say that it's bogus. Indeed. From the dev list earlier today, that's a corpus with generated (synthetic) headers [...], only useful for body hits, and is not included in the re-scoring. Ah, I thought I saw that corpus mentions somewhere ... only thought to search the bug. I had assumed that if the rulesqa page mentioned it, it was factored in everywhere. Many of the people on the sa-users list have manually scored RDNS_NONE higher than the default 0.1. FWIW, nailed to 0.1 as per comment 56. I saw that but did not understand it ... It says most of these are already documented and labeled as [fixed/immutable] but it doesn't say where. Is this because it triggers when rDNS checks aren't performed by the first trusted relay, and if so, can we work around that somehow (wasn't that bug 5586 )? Or is this a remnant of Justin's checkin r497852 from 2007 which states: move 20_dynrdns.cf from sandbox into main ruleset, so RDNS_DYNAMIC and RDNS_NONE are core rules; lock their scores to an informational 0.1, however, since they still have a high ham hit-rate alone ... despite the current corpus data (unless 1.7% is a high ham hit-rate)? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #120 from Adam Katz antis...@khopis.com 2009-10-20 16:25:36 UTC --- (In reply to comment #119) (In reply to comment #118) ... despite the current corpus data (unless 1.7% is a high ham hit-rate)? http://ruleqa.spamassassin.org/20091017-r826198-n/RDNS_NONE/detail The most recent weekly run has pretty substantial hits even outside of the synthetic corpus. Your link is just a longer version of mine. It still results in a 1.7% total ham hit-rate. Is that too substantial? Is there detail on what each corpus is (specifically nbebout, since that's the only other corpus that hit 4+% of spam)? Looking only at ham scoring 4 or higher (including enron since I can't remove it), RDNS_NONE hit 0.8528% of the total ham corpus. Of the ham scoring JUST 4 (4.0-4.9), we're looking at 0.5865% that would become FPs assuming a score of 1.1 (increasing the 0.1 by 1), and I'm not even proposing my own implementation's 0.9. Adam, this [... and] your RCVD_IN_APNIC are examples of inherently prejudiced rules. It might work for the most part, and you might accept the risk of accidental FP's because the score alone wont push it above the threshold. However the combined risks of multiple prejudiced rules is too great. Prejudiced rules should be up to the sysadmin if they want to enable. We should not highly score any known prejudiced rules in the default ruleset. I agree that RCVD_VIA_APNIC is a prejudiced rule, and my channels have had it rated 0.001 ever since you called me out on it (RCVD_VIA_APNIC accidentally came in when I migrated from an internal-only propagation to a published channel). KHOP_NO_FIRST_NAME, my other poorly-considered published test, pre-dates my more thorough testing mechanism (which has limited new rules' entry quite considerably). My rules will get even more cleaned up once I get an svn account to test them here. (Some of them, like the biased RCVD_IN_APNIC and quasi-biased/unfair KHOP_SC_CIDR8, would either never get pushed up for testing or would get the nopublish flag, depending on the guidelines ... that nobody has yet pointed me to.) (Side note: I see __RCVD_VIA_APNIC is already in your own sandbox, hitting 86% of all Japanese ham.) Getting back to this issue: I don't see any problem with prejudice against poorly constructed network infrastructures that can't bother to adhere to the SMTP standard (RFC1912 section 2.1). This is something that any network admin who should legitimately be managing a mail server should be able to fix with a single phone call (please correct me if this sentence is prejudiced in any way). The SMTP standard requires a server's rDNS must match the server's reported name (thus the IP must have rDNS), and most allocated IPs have them anyway (even if they're wrong or ~dynamic, e.g. RDNS_DYNAMIC). There is also a growing number of deployments that block improper FCrDNS at the door (RDNS_NONE is a subset of failing FCrDNS). SA already has built-in prejudices against poorly constructed email clients (e.g. MISSING_HEADERS) and relays (e.g. DATE_IN_FUTURE_48_96), so why not the network? Isn't SPF_FAIL a prejudiced test against network configuration? SA at its core is merely a system of probabilities. Even without bayes, the masscheck mechanism and its points are awarded based on statistical significance. Very few rules are actually free of FPs (or FNs for negative rules). My question still stands: what does SA deem statistically significant when it comes to FPs? Why does RDNS_NONE need to be immutable rather than dictated by the masscheck results? What would the automated system score RDNS_NONE if it were allowed to? I'm guessing something between 0.2 and 0.7. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #121 from Warren Togami wtog...@redhat.com 2009-10-20 19:00:36 UTC --- (In reply to comment #120) I agree that RCVD_VIA_APNIC is a prejudiced rule, and my channels have had it rated 0.001 ever since you called me out on it (RCVD_VIA_APNIC accidentally OK glad to hear that you reduced it. I didn't look at your scores after that first time. You should really get a spamassassin account so your rules can be more thoroughly tested against a more varied corpa. nobody has yet pointed me to.) (Side note: I see __RCVD_VIA_APNIC is already in your own sandbox, hitting 86% of all Japanese ham.) Yes, I'm using it as a softener to exclude from the extremely prejudiced CN_NUMBER rules. It just so happens that the majority of CN_NUMBER spam comes from !APNIC, and APNIC is prejudiced in exactly the way to make CN_NUMBER rules less dangerous. Even though those rules have high spam hit rates and zero FP's across our nightly masscheck corpa, it is still too prejudiced to be safe as a default rule. SA at its core is merely a system of probabilities. Even without bayes, the masscheck mechanism and its points are awarded based on statistical significance. Very few rules are actually free of FPs (or FNs for negative rules). My question still stands: what does SA deem statistically significant when it comes to FPs? Why does RDNS_NONE need to be immutable rather than dictated by the masscheck results? What would the automated system score RDNS_NONE if it were allowed to? I'm guessing something between 0.2 and 0.7. That is an interesting question. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #101 from Justin Mason j...@jmason.org 2009-10-19 07:53:59 UTC --- (In reply to comment #98) The RCVD_IN_DNSWL_* scores are again unusual: score RCVD_IN_DNSWL_HI 0 -0.466 0 -0.001 score RCVD_IN_DNSWL_LOW 0 -0.292 0 -0.760 score RCVD_IN_DNSWL_MED 0 -1.703 0 -0.727 probably because of their low frequency, especially the _HI rule: OVERALLSPAM% HAM% S/ORANK SCORE NAME 0.184 0.0007 0.57070.001 0.76 -1.00 RCVD_IN_DNSWL_HI 7.408 0.1096 22.75090.005 0.67 -1.00 RCVD_IN_DNSWL_MED 2.553 0.1816 7.53650.024 0.59 -1.00 RCVD_IN_DNSWL_LOW and resulting zero ranges (tmp/ranges.data): 0.000 0.000 0 RCVD_IN_DNSWL_HI 0.000 0.000 0 RCVD_IN_DNSWL_MED 0.000 0.000 0 RCVD_IN_DNSWL_LOW Don't know what a clean solution is, apart from fixing their scores manually. feel free to fix them; it's hard for the GA to be mostly right about network rules. tbh I'm surprised the ranges were zeroed (for _MED at least). -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #102 from Justin Mason j...@jmason.org 2009-10-19 07:55:57 UTC --- (In reply to comment #99) I just discovered that I was falsely triggering rules like RCVD_IN_SORBS_DUL, RCVD_IN_PBL or RDNS_DYNAMIC on some of my corpus ham due to a misconfiguration on my server. My users delivering mail directly to other users on my server from their home ISP or mobile phone were lacking authenticated user within the Received header causing many hits on these and unknown other rules. Roughly ~150-170 of my FP's on these three rules should not count against those rules. Nearly all of my RCVD_IN_SORBS_DUL and RCVD_IN_PBL should have been AllTrusted instead. Is this enough to throw off the GA scoring? if you want, feel free to sed the log files to fix this, or just remove the lines entirely, and reupload. 170 FPs for those DUL rules is quite strong imo. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #103 from Warren Togami wtog...@redhat.com 2009-10-19 10:31:26 UTC --- if you want, feel free to sed the log files to fix this, or just remove the lines entirely, and reupload. 170 FPs for those DUL rules is quite strong imo. Removed the majority of the offending lines and reuploaded ham-rescore-wt*.log. I also zeroed out *wt-en6.log because they were found to be too corrupted to trust the results. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #104 from Mark Martinec mark.marti...@ijs.si 2009-10-19 11:28:49 UTC --- (In reply to comment #103) Removed the majority of the offending lines and reuploaded ham-rescore-wt*.log. I also zeroed out *wt-en6.log because they were found to be too corrupted to trust the results. Thanks. Seems you did it in the 'corpus' rsync directory. Please also update them in the 'submit' directory using existing names, otherwise in few weeks time we'll all forget which file came from where - after all, the 'submit' directory is the official source for rescoring runs. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #105 from Karsten Bräckelmann guent...@rudersport.de 2009-10-19 12:21:56 UTC --- Argh, late to the show, sorry. :-/ From the second GA re-score run, attachment 4553 (aligned for readability): score KB_RATWARE_MSGID 4.099 3.315 4.095 1.475 This is awesome! :) Though unrelated, so let me move on to the issue. score KB_RATWARE_OUTLOOK_08 1.100 3.232 0.776 0.025 score KB_RATWARE_OUTLOOK_12 2.734 2.826 1.654 0.041 score KB_RATWARE_OUTLOOK_16 1.725 3.331 2.532 0.887 score KB_RATWARE_OUTLOOK_MID 2.259 2.485 3.121 0.001 This is also awesome -- kind of. But frankly, it also is a total mess. They are essentially the same, just slightly differing in strictness or fuzziness. They are almost *exactly* overlapping -- *all* four of them (see ruleqa). These rules are really redundant, and there should be only one instead. FWIW, that *should* be KB_RATWARE_BOUNDARY, which was added specifically for this. This rule seems to be missing entirely, though. :( Looking at the scores, I don't think simply adding them would do. Also, I'm kind of un-satisfied with the score-set 3 scores. The FP rate is 0! (Almost, I'll challenge the ham hits.) For all five rules above. Net tests or not... -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #106 from Warren Togami wtog...@redhat.com 2009-10-19 12:35:30 UTC --- Thanks. Seems you did it in the 'corpus' rsync directory. Please also update them in the 'submit' directory using existing names, otherwise in few weeks time we'll all forget which file came from where - after all, the 'submit' directory is the official source for rescoring runs. Fixed in 'submit'. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #107 from Justin Mason j...@jmason.org 2009-10-19 14:26:25 UTC --- (In reply to comment #105) score KB_RATWARE_OUTLOOK_08 1.100 3.232 0.776 0.025 score KB_RATWARE_OUTLOOK_12 2.734 2.826 1.654 0.041 score KB_RATWARE_OUTLOOK_16 1.725 3.331 2.532 0.887 score KB_RATWARE_OUTLOOK_MID 2.259 2.485 3.121 0.001 This is also awesome -- kind of. But frankly, it also is a total mess. They are essentially the same, just slightly differing in strictness or fuzziness. They are almost *exactly* overlapping -- *all* four of them (see ruleqa). These rules are really redundant, and there should be only one instead. FWIW, that *should* be KB_RATWARE_BOUNDARY, which was added specifically for this. This rule seems to be missing entirely, though. :( Looking at the scores, I don't think simply adding them would do. Also, I'm kind of un-satisfied with the score-set 3 scores. The FP rate is 0! (Almost, I'll challenge the ham hits.) For all five rules above. Net tests or not... it looks like they overlap a lot with some other rules. But yes, if they were just 1 rule, it probably would have gotten a better single score. I'm not sure if it's too late to fix this or not. :( -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #108 from Karsten Bräckelmann guent...@rudersport.de 2009-10-19 14:49:17 UTC --- (In reply to comment #107) it looks like they overlap a lot with some other rules. But yes, if they were just 1 rule, it probably would have gotten a better single score. I'm not sure if it's too late to fix this or not. :( Frankly, pretty much either one could be used and all other variants simply be dropped for the next re-score run. Keeping all of them is just a waste of cycles. The important questions are, where is KB_RATWARE_BOUNDARY, which was specifically pushed right before the deadline to supersede these? And of course, why do the scores drop that drastically with score-set 3, if there is *no* FP? Regardless of the spam already scoring above 5, there is no FP reason to lower the score. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #109 from Karsten Bräckelmann guent...@rudersport.de 2009-10-19 15:37:16 UTC --- (In reply to comment #108) The important questions are, where is KB_RATWARE_BOUNDARY, which was specifically pushed right before the deadline to supersede these? Argh! It is in freqs.full, attachment 4541. However, it appears we've been using inconsistent rule-sets, with most contributors using one outdated rule-set or the other. :-( 10.830 14.1437 0.19010.987 0.670.00 T_KB_RATWARE_BOUNDARY 0.025 0.0327 0.1.000 0.651.00 KB_RATWARE_BOUNDARY -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #100 from Mark Martinec mark.marti...@ijs.si 2009-10-15 11:56:23 UTC --- Btw, I added a Target Milestone 3.3.1, so that a triage on 3.3.0 bugs may be more selective, choosing between Future/Undefined/3.3.1 -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 Mark Martinec mark.marti...@ijs.si changed: What|Removed |Added Attachment #4550|0 |1 is obsolete|| --- Comment #96 from Mark Martinec mark.marti...@ijs.si 2009-10-14 16:21:44 UTC --- Created an attachment (id=4553) -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4553) resulting 50_scores.cf from garescorer runs - V2 Here is now a 50_scores.cf from my second attempt after cleaning some logs: removed binnocenti and wt-en6 logs as per Comment 93, removed DKIM_ADSP_DISCARD hits from ham-bayes-net-bluestreak.log. I have also limited the log entries to fewer months following the RescoreMassCheck (wiki): -m 6 for spam, and -m 25 for ham (after 25th month there is a large gap in data till the next peak, too far in the past). This leaves us with the following number of entries in merged logs: score set 1 (no data from score set 3), provides data for set0 and set1: 360070 ham-full-set1.log 472682 spam-full-set1.log score set 3, provides data for set2 and set3: 210603 ham-full-set3.log 442709 spam-full-set3.log For DCC_ rules, I took the DCC_CHECK value of 1.1 from a preliminary run which had all the DCC_REPUT_* scores fixed at 0, then for the next run I fixed the DCC_CHECK, but left the DCC_REPUT_* scores floating. This should cope with both types of sites: those with a commercial license that do receive reputation results from DCC servers, and those who don't. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #97 from Mark Martinec mark.marti...@ijs.si 2009-10-14 16:29:29 UTC --- gen-set0-5-5.0-1-ga test (10%) # SUMMARY for threshold 5.0: # Correctly non-spam: 35461 98.50% # Correctly spam: 38357 81.35% # False positives: 541 1.50% # False negatives: 8794 18.65% # TCR(l=50): 1.315450 SpamRecall: 81.349% SpamPrec: 98.609% scores (90%): # SUMMARY for threshold 5.0: # Correctly non-spam: 283119 42.494% (98.304% of non-spam corpus) # Correctly spam: 306367 45.984% (80.997% of spam corpus) # False positives: 4886 0.733% (1.696% of nonspam, 179777 weighted) # False negatives: 71879 10.789% (19.003% of spam, 231331 weighted) # Average score for spam: 10.4nonspam: 1.7 # Average for false-pos: 5.6 false-neg: 3.2 # TOTAL: 666251 100.00% gen-set1-10-5.0-1-ga test: # SUMMARY for threshold 5.0: # Correctly non-spam: 35942 99.83% # Correctly spam: 45983 97.52% # False positives:60 0.17% # False negatives: 1168 2.48% # TCR(l=50): 11.312620 SpamRecall: 97.523% SpamPrec: 99.870% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 287639 43.173% (99.873% of non-spam corpus) # Correctly spam: 368783 55.352% (97.498% of spam corpus) # False positives: 366 0.055% (0.127% of nonspam, 27040 weighted) # False negatives: 9463 1.420% (2.502% of spam, 29645 weighted) # Average score for spam: 20.3nonspam: 0.2 # Average for false-pos: 5.6 false-neg: 3.1 # TOTAL: 666251 100.00% gen-set2-10-5.0-1-ga test: # SUMMARY for threshold 5.0: # Correctly non-spam: 35949 99.85% # Correctly spam: 44538 94.46% # False positives:53 0.15% # False negatives: 2613 5.54% # TCR(l=50): 8.958959 SpamRecall: 94.458% SpamPrec: 99.881% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 287557 43.160% (99.844% of non-spam corpus) # Correctly spam: 357656 53.682% (94.556% of spam corpus) # False positives: 448 0.067% (0.156% of nonspam, 33456 weighted) # False negatives: 20590 3.090% (5.444% of spam, 73371 weighted) # Average score for spam: 12.3nonspam: 0.8 # Average for false-pos: 5.7 false-neg: 3.6 # TOTAL: 666251 100.00% gen-set3-20-5.0-1-ga test: # SUMMARY for threshold 5.0: # Correctly non-spam: 21173 99.92% # Correctly spam: 43749 99.08% # False positives:17 0.08% # False negatives: 404 0.92% # TCR(l=50): 35.209729 SpamRecall: 99.085% SpamPrec: 99.961% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 168159 32.186% (99.976% of non-spam corpus) # Correctly spam: 350875 67.159% (99.046% of spam corpus) # False positives:40 0.008% (0.024% of nonspam, 9039 weighted) # False negatives: 3379 0.647% (0.954% of spam, 11476 weighted) # Average score for spam: 19.3nonspam: -0.8 # Average for false-pos: 5.4 false-neg: 3.4 # TOTAL: 522453 100.00% === In summary, the essential data: score set 0 (no net, no bayes): # False positives: 4886 0.733% (1.696% of nonspam, 179777 weighted) # False negatives: 71879 10.789% (19.003% of spam, 231331 weighted) score set 1 (net, no bayes): # False positives: 366 0.055% (0.127% of nonspam, 27040 weighted) # False negatives: 9463 1.420% (2.502% of spam, 29645 weighted) score set 2 (no net, bayes): # False positives: 448 0.067% (0.156% of nonspam, 33456 weighted) # False negatives: 20590 3.090% (5.444% of spam, 73371 weighted) score set 3 (net, bayes): # False positives:40 0.008% (0.024% of nonspam, 9039 weighted) # False negatives: 3379 0.647% (0.954% of spam, 11476 weighted) -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #98 from Mark Martinec mark.marti...@ijs.si 2009-10-14 16:48:26 UTC --- The RCVD_IN_DNSWL_* scores are again unusual: score RCVD_IN_DNSWL_HI 0 -0.466 0 -0.001 score RCVD_IN_DNSWL_LOW 0 -0.292 0 -0.760 score RCVD_IN_DNSWL_MED 0 -1.703 0 -0.727 probably because of their low frequency, especially the _HI rule: OVERALLSPAM% HAM% S/ORANK SCORE NAME 0.184 0.0007 0.57070.001 0.76 -1.00 RCVD_IN_DNSWL_HI 7.408 0.1096 22.75090.005 0.67 -1.00 RCVD_IN_DNSWL_MED 2.553 0.1816 7.53650.024 0.59 -1.00 RCVD_IN_DNSWL_LOW and resulting zero ranges (tmp/ranges.data): 0.000 0.000 0 RCVD_IN_DNSWL_HI 0.000 0.000 0 RCVD_IN_DNSWL_MED 0.000 0.000 0 RCVD_IN_DNSWL_LOW Don't know what a clean solution is, apart from fixing their scores manually. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #99 from Warren Togami wtog...@redhat.com 2009-10-14 21:58:58 UTC --- I just discovered that I was falsely triggering rules like RCVD_IN_SORBS_DUL, RCVD_IN_PBL or RDNS_DYNAMIC on some of my corpus ham due to a misconfiguration on my server. My users delivering mail directly to other users on my server from their home ISP or mobile phone were lacking authenticated user within the Received header causing many hits on these and unknown other rules. Roughly ~150-170 of my FP's on these three rules should not count against those rules. Nearly all of my RCVD_IN_SORBS_DUL and RCVD_IN_PBL should have been AllTrusted instead. Is this enough to throw off the GA scoring? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #93 from Warren Togami wtog...@redhat.com 2009-10-11 00:01:01 UTC --- Bad news. Please remove the binnocenti logs from the rescore masschecks. Working with him we discovered 50+ additional spam in his ham folders and there is certainly more. Furthermore his ham contains lots of automated low quality sources like Bugzilla, trac, cron and log monitoring daemons that should probably be removed from ham corpa. It seems incorrect FP's and bias introduced by this corpus can be large enough to possibly throw off scoring. Did you also remove wt-en6 after we discovered that copying mail from a Yahoo account corrupts the messages? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 Matthias Leisi matth...@leisi.net changed: What|Removed |Added CC||matth...@leisi.net --- Comment #94 from Matthias Leisi matth...@leisi.net 2009-10-11 02:19:21 UTC --- (In reply to comment #56) Here is a set of rules in 50_scores.cf that I ended up as fixed (immutable) for the GA run (score set 3). Most of these are already documented and labeled as such, but it doesn't hurt to post it here as a double-check. I suspect that RCVD_IN_DNSWL_* should be immutable as well; in generated scores, there are counter-intuitive scores assigned (expected _HI _MED _LOW, observed _MED _HI _LOW). https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf has the following outside the gen:mutable section: | score RCVD_IN_DNSWL_LOW 0 -1 0 -1 | score RCVD_IN_DNSWL_MED 0 -4 0 -4 | score RCVD_IN_DNSWL_HI 0 -8 0 -8 The DNSWL stats posted by Warren to the users list seem to indicate that this should be the correct ordering (at least based on safety): | SPAM% HAM%RANK RULE | 0.0016% 4.2489% 0.91 RCVD_IN_DNSWL_HI | 0.0281% 6.9639% 0.90 RCVD_IN_DNSWL_MED | 0.1147% 3.9169% 0.81 RCVD_IN_DNSWL_LOW -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #95 from Warren Togami wtog...@redhat.com 2009-10-11 07:03:21 UTC --- (In reply to comment #94) The DNSWL stats posted by Warren to the users list seem to indicate that this should be the correct ordering (at least based on safety): | SPAM% HAM%RANK RULE | 0.0016% 4.2489% 0.91 RCVD_IN_DNSWL_HI | 0.0281% 6.9639% 0.90 RCVD_IN_DNSWL_MED | 0.1147% 3.9169% 0.81 RCVD_IN_DNSWL_LOW These were yesterday's weekly results, not the rescore masscheck. Weekly results are a smaller sample size and lower confidence. http://ruleqa.spamassassin.org/20090930-r808953-n SPAM% HAM% RANK RULE 0.0002% 0.3651% 0.75 RCVD_IN_DNSWL_HI 0.0288% 18.6970% 0.79 RCVD_IN_DNSWL_MED 0.0753% 8.1433% 0.68 RCVD_IN_DNSWL_LOW This was the rescore masscheck. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #88 from Mark Martinec mark.marti...@ijs.si 2009-10-09 06:23:06 PDT --- The release notes could then say that one should lower the DKIM_ADSP_* scores on installations where it is known that mail is not reaching SpamAssassin in its pristine form (as received by the MTA). This case or old ham where the sender subsequently changed their DKIM policy is only an issue for masscheck, not production scanning. True for the case of old ham where the sender subsequently changed their DKIM policy, or for the case of expired signatures - these are only an issue with masscheck. ...but not the case of wt-en6, where mail is transformed by its path through webmail. This is an issue both for masschecks, as well as for production runs. Lowering the DKIM scores makes no sense then? If one knows that mail reaching SpamAssassin will be modified by his mail path, then one must disable rules targeting mail forgery and depending on a pristine mail, such as the DKIM_ADSP_DISCARD rule. Otherwise the rule would generate FP score points for legitimate mail from domains publishing ADSP (explicitly or through overrides). -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #89 from Mark Martinec mark.marti...@ijs.si 2009-10-09 06:38:09 PDT --- Created an attachment (id=4550) -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4550) resulting 50_scores.cf from garescorer runs Ok, here it is as last, the auto-generated 50_scores.cf from garescorer runs on all four sets, with no hand-tweaking of results (yet) ... to give us something to digest and comment on, and can serve as the first approximation. Some values are surprising or plain wrong, I'll comment on some later. I used the submitted logs (tweaked as per Comment 78), with all the recent updates to them as posted so far in this ticket. I left the BAYES scores fully floating. I fixed at zero the DCC_REPUT_* scores and JM_SOUGHT_FRAUD_*, as was discussed previously (as can be seen by the end of the attached file). Eventually these will need to be set to some manually determined score. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #90 from Mark Martinec mark.marti...@ijs.si 2009-10-09 06:49:27 PDT --- To assess the quality and repeatability of results, here are the summaries on all four score sets, each pair consists of a normal run on 90% of entries, and a test run on remaining 10% of log entries. The most interesting figures are the FP and FN percents, e.g. 0.028% and 0.961%, in this clipping: # False positives: 65 0.011% (0.028% of nonspam, 10580 weighted) # False negatives: 3411 0.578% (0.961% of spam, 12054 weighted) == gen-set0-5-5.0-25000-ga SCORESET 0 : (no net, not bayes) test (10%): # SUMMARY for threshold 5.0: # Correctly non-spam: 45335 98.03% # Correctly spam: 39320 81.61% # False positives: 913 1.97% # False negatives: 8860 18.39% # TCR(l=50): 0.883875 SpamRecall: 81.611% SpamPrec: 97.731% scores (90%): # SUMMARY for threshold 5.0: # Correctly non-spam: 365397 48.193% (98.401% of non-spam corpus) # Correctly spam: 314466 41.476% (81.286% of spam corpus) # False positives: 5936 0.783% (1.599% of nonspam, 173347 weighted) # False negatives: 72396 9.548% (18.714% of spam, 226867 weighted) # Average score for spam: 10.0nonspam: 1.4 # Average for false-pos: 5.6 false-neg: 3.1 # TOTAL: 758195 100.00% == gen-set1-10-5.0-3-ga SCORESET 1: (net, no bayes) test: # SUMMARY for threshold 5.0: # Correctly non-spam: 46183 99.86% # Correctly spam: 46648 96.82% # False positives:65 0.14% # False negatives: 1532 3.18% # TCR(l=50): 10.075282 SpamRecall: 96.820% SpamPrec: 99.861% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 370804 48.906% (99.858% of non-spam corpus) # Correctly spam: 374579 49.404% (96.825% of spam corpus) # False positives: 529 0.070% (0.142% of nonspam, 31804 weighted) # False negatives: 12283 1.620% (3.175% of spam, 39385 weighted) # Average score for spam: 17.4nonspam: 0.4 # Average for false-pos: 5.8 false-neg: 3.2 # TOTAL: 758195 100.00% == gen-set2-10-5.0-3-ga SCORESET 2: (no net, bayes) test: # SUMMARY for threshold 5.0: # Correctly non-spam: 29308 99.78% # Correctly spam: 42344 95.69% # False positives:64 0.22% # False negatives: 1907 4.31% # TCR(l=50): 8.664774 SpamRecall: 95.690% SpamPrec: 99.849% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 234375 39.745% (99.864% of non-spam corpus) # Correctly spam: 339736 57.612% (95.700% of spam corpus) # False positives: 320 0.054% (0.136% of nonspam, 26164 weighted) # False negatives: 15265 2.589% (4.300% of spam, 58794 weighted) # Average score for spam: 10.4nonspam: 0.6 # Average for false-pos: 5.4 false-neg: 3.9 # TOTAL: 589696 100.00% == gen-set3-20-5.0-2-ga SCORESET 3: (net, bayes) test: # SUMMARY for threshold 5.0: # Correctly non-spam: 29342 99.90% # Correctly spam: 43843 99.08% # False positives:30 0.10% # False negatives: 408 0.92% # TCR(l=50): 23.192348 SpamRecall: 99.078% SpamPrec: 99.932% scores: # SUMMARY for threshold 5.0: # Correctly non-spam: 234630 39.788% (99.972% of non-spam corpus) # Correctly spam: 351590 59.622% (99.039% of spam corpus) # False positives:65 0.011% (0.028% of nonspam, 10580 weighted) # False negatives: 3411 0.578% (0.961% of spam, 12054 weighted) # Average score for spam: 18.5nonspam: -0.1 # Average for false-pos: 5.4 false-neg: 3.5 # TOTAL: 589696 100.00% -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #92 from Warren Togami wtog...@redhat.com 2009-10-09 20:22:24 UTC --- (In reply to comment #89) Created an attachment (id=4550) -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4550) [details] resulting 50_scores.cf from garescorer runs Ok, here it is as last, the auto-generated 50_scores.cf from garescorer runs on all four sets, with no hand-tweaking of results (yet) ... to give us something to digest and comment on, and can serve as the first approximation. Some values are surprising or plain wrong, I'll comment on some later. Bug #6156 RCVD_IN_PSBL We should manually adjust this score something between 2.0 through 2.5 for these reasons. * Rescore masschecks were with deep parsing. We have subsequently changed it to lastexternal which should be much safer. Even with deep parsing it proved to be very good. * At the time of the rescore masschecks, PSBL's recent whitelist filtering of gmail, yahoo, rr.com and several other major ISP's had not yet timed out legitimate MTA's. Safety should be improved further now. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #83 from Henrik Krohns h...@hege.li 2009-10-08 01:02:43 PDT --- Cleaned up my DKIM_ADSP_DISCARD hits (old 2005 ebay mails removed) and some other old stuff, logs sent.. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #84 from Mark Martinec mark.marti...@ijs.si 2009-10-08 06:50:37 PDT --- These are all legitimate looking paypal mail delivered to a Yahoo account from mid-2008 through recently. Thanks Warren for your out-of-band mail. Apart from some general comments from my previous posting, there is a real problem regarding your method of fetching mail for a Yahoo account. You are using the FetchYahoo to download these messages from the Yahoo webmail interface. The FetchYahoo has to jump hoops to be able to retrieve a message as close to its original form as possible, but there are some real obstacles there. Glancing at its source code, it has to pull attachments separately and splice them back together into a message, necessarily reinventing the MIME boundaries. This is enough to render DomainKeys and DKIM signatures invalid. Apart from this, it also converts QP and base64 encoded messages into UTF-8 binary, which again is a sufficient reason for signature breakage. Moreover, it has to repair some damage to header field folding and empty lines, which are broken either due to bugs in Yahoo HTML rendering (indicated by comments in the FetchYahoo code), or details are simply lost because of a conversion to HTML and back to mail. This method of fetching mail is bound to cause trouble. It may quite easily cause some other low-level SpamAssassin rules to misfire or to fail triggering, not just the signature verification failures. -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #85 from Warren Togami wtog...@redhat.com 2009-10-08 10:15:55 PDT --- I guess we have no choice but to drop wt-en6 from the rescore GA. Should I drop it from nightly masscheck as well? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #86 from Mark Martinec mark.marti...@ijs.si 2009-10-08 10:37:23 PDT --- I guess we have no choice but to drop wt-en6 from the rescore GA. Should I drop it from nightly masscheck as well? I can imagine such problem could also affect other users, especially those not running SpamAssassin close to their MTA. I guess we can keep the wt-en6 corpus (and similar, if identified), but keep in mind that FP hits on DKIM_ADSP_DISCARD (and possibly on some other rule if identified) should be disregarded. I already removed the DKIM_ADSP_DISCARD hit from my copy of wt-en6 log. If it turns out the undesired mail modifications are more common in submitted corpora, we could perhaps re-run the GA on a subset of logs know not to be suffering from the problem, and just fetch the DKIM_* scores from results as obtained from this run. The release notes could then say that one should lower the DKIM_ADSP_* scores on installations where it is known that mail is not reaching SpamAssassin in its pristine form (as received by the MTA). -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.
[Bug 6155] generate new scores for 3.3.0 release
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155 --- Comment #87 from Warren Togami wtog...@redhat.com 2009-10-08 13:51:31 PDT --- (In reply to comment #86) The release notes could then say that one should lower the DKIM_ADSP_* scores on installations where it is known that mail is not reaching SpamAssassin in its pristine form (as received by the MTA). This case or old ham where the sender subsequently changed their DKIM policy is only an issue for masscheck, not production scanning. Lowering the DKIM scores makes no sense then? -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are the assignee for the bug.