[Bug 6155] generate new scores for 3.3.0 release

2010-01-05 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #185 from Henrik Krohns h...@hege.li 2010-01-05 10:47:51 UTC ---

I have a hunch that FREEMAIL_ENVFROM_END_DIGIT has a bit too high score
(1.553). Probably there wasn't enough nicedude90 ham in corpora. Strangely
FREEMAIL_REPLYTO_END_DIGIT has a lower score, one would think it would be safer
FP wise..

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-12-02 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #180 from Mark Martinec mark.marti...@ijs.si 2009-12-02 07:31:01 
UTC ---
 Mark, please correct me if I am wrong.  But it seems only you can complete the
 final steps since we don't know exactly which subset of data you used.

I'm doing it right now. The config.set* is already checked in, logs are
being transferred, ...

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-12-02 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #181 from Mark Martinec mark.marti...@ijs.si 2009-12-02 10:48:45 
UTC ---
Ok, I think I'm done now (RescoreMassCheck):

5. generate scores for score sets
svn commit -m runGA config files used masses/config.set*
  r886173 | mmartinec | 2009-12-02 16:24:32 +0100 (Wed, 02 Dec 2009) | 1 line
  runGA config files used
tar cvf rescore-logs.tar gen-set{0,1,2,3}-*

6. upload the test logs to zone (spamassassin.zones.apache.org):
sudo mkdir /home/corpus-rsync/ARCHIVE/3.3.0
sudo mv rescore-logs.tar.bz2 \
  /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tar.bz2
ls -l /home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tar.bz2
  -rw-r--r--   1 mmartinec other20380424 Dec  2 18:23
/home/corpus-rsync/ARCHIVE/3.3.0/rescore-logs-bug6155.tar.bz2

6.5. mark evolved-score rules as 'always published'
./force-publish-active-rules ../rules/active.list ../rulesrc/10_force_active.cf
svn commit -m force publish of rescored rules ../rulesrc/10_force_active.cf
  r886212 | mmartinec | 2009-12-02 18:33:57 +0100 (Wed, 02 Dec 2009) | 3 lines
  Bug 6155: generated new rulesrc/10_force_active.cf
  as per step 6.5 in RescoreMassCheck

6.6. fix test failures
nothing to tweak, all tests pass

7. upload proposed new scores
done some time ago, some tweaks later:
  r881159 | wtogami | 2009-11-17 06:35:00 +0100 (Tue, 17 Nov 2009) | 2 lines
  Bug #6155 commit raw scores from Comment #146 as documented in #162.
To view the diffs: svn diff -r 881158:886232 rules/50_scores.cf

8. Make the stats files
cp config.set0 config ; bash ./runGA stats
cp config.set1 config ; bash ./runGA stats
cp config.set2 config ; bash ./runGA stats
cp config.set3 config ; bash ./runGA stats

8(.1) upload new stats files
  r886232 | mmartinec | 2009-12-02 19:11:35 +0100 (Wed, 02 Dec 2009) | 2 lines
  rules/STATISTICS-set*.txt
 Attach the new proposed STATISTICS*.txt as a patch to the rescoring bug
too many differences, just do a: svn diff -c886232

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-12-02 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #183 from Warren Togami wtog...@redhat.com 2009-12-02 11:43:16 
UTC ---
Why is active.list (the result of auto-promotion) relevant as input to this
script?  Seems kind of like circular logic that makes no sense.

+ SPAMMY_MIME_BDRY_01

force-publish-active-rules added a few lines like this that have no scores
assigned in rules/50_scores.cf.

It seems what I already did by copying rule names from rules/50_scores.cf into
rulesrc/10_force_active.cf is more correct?

If so, then it appears we are ready for beta if we can clear up the GPG key
issue in Bug #6223.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-12-01 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #175 from Justin Mason j...@jmason.org 2009-12-01 05:08:47 UTC ---
10_force_active.cf is generated at this step in the RescoreMassCheck process
(see https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155#c3):

6.5. mark evolved-score rules as 'always published'

sounds like we could be missing a few steps if that got missed...

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-12-01 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Mark Thomas ma...@apache.org changed:

   What|Removed |Added

 CC|ma...@apache.org|

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-12-01 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #176 from Warren Togami wtog...@redhat.com 2009-12-01 08:50:38 
UTC ---
http://wiki.apache.org/spamassassin/RescoreMassCheck

Mark, did you do these steps?

6. upload the test logs to zone
8. Make the stats files
8. upload new stats files

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-12-01 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #177 from Mark Martinec mark.marti...@ijs.si 2009-12-01 09:17:58 
UTC ---
 Mark, did you do these steps?
 6. upload the test logs to zone
 8. Make the stats files
 8. upload new stats files

No, I left at the '5. generate scores for score sets',
I only attached the score file for considerations.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-12-01 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #178 from Warren Togami wtog...@redhat.com 2009-12-01 10:28:26 
UTC ---
Mark, it appears that only you can do those steps?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-30 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Mark Thomas ma...@apache.org changed:

   What|Removed |Added

 CC||ma...@apache.org

--- Comment #174 from Mark Thomas ma...@apache.org 2009-11-30 13:40:07 UTC ---
Restoring comment originally made by Mark Martinec

(In reply to comment #171)
 Btw, the:
   prove xt/10_rule_test_suite.t
 is failing for several rules. Can someone more familiar with rules
 please check where the reported problems lie?

Actually it's just two rules failing on multiple tests: FM_FRM_RN_L_BRACK and
TVD_SPACE_RATIO. Luckily their score is zero or near zero: score
TVD_SPACE_RATIO 0.001 score FM_FRM_RN_L_BRACK 0

| Changed score of FM_FRM_RN_L_BRACK from 0 into 0.001, | to make
xt/10_rule_test_suite.t happy. | Sending rules/50_scores.cf | Committed
revision 884927.

So that leaves the TVD_SPACE_RATIO. Is it something to worry about?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-27 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #173 from Warren Togami wtog...@redhat.com 2009-11-27 09:13:25 
UTC ---
Sendingrulesrc/10_force_active.cf
Transmitting file data .
Committed revision 884912.

Please review.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-27 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #174 from Mark Martinec mark.marti...@ijs.si 2009-11-27 10:03:42 
UTC ---
(In reply to comment #171)
 Btw, the:
   prove xt/10_rule_test_suite.t
 is failing for several rules. Can someone more familiar with rules
 please check where the reported problems lie?

Actually it's just two rules failing on multiple tests:
  FM_FRM_RN_L_BRACK and TVD_SPACE_RATIO.
Luckily their score is zero or near zero:
  score TVD_SPACE_RATIO 0.001
  score FM_FRM_RN_L_BRACK 0

| Changed score of FM_FRM_RN_L_BRACK from 0 into 0.001,
| to make xt/10_rule_test_suite.t happy.
| Sendingrules/50_scores.cf
| Committed revision 884927.

So that leaves the TVD_SPACE_RATIO. Is it something to worry about?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-26 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #172 from Daryl C. W. O'Shea spamassas...@dostech.ca 2009-11-26 
17:24:49 UTC ---
Warren,

The file was originally used to list all *rules from sandboxes* that had scores
assigned by the GA so that they didn't get auto-demoted leaving a score line
but no rule.

I don't think its use has changed, but I'm not completely up-to-date on the
re-org of the rules source structure.

jm might have a script to generate the file... although it's been a long time.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-23 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #169 from Warren Togami wtog...@redhat.com 2009-11-23 20:08:06 
UTC ---
spamassassin/trunk/rulesrc/10_force_active.cf

It seems this file needs to be updated after the rescoring.  Should all the
rules in 50_scores.cf be listed in 10_force_active.cf?

Even the rules that are zeroed out in 50_scores.cf?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #168 from Justin Mason j...@jmason.org 2009-11-20 15:10:05 UTC ---
(In reply to comment #167)
 locally, I've have lowered the MISSING_HB_SEP score to 0.5
 
 lottsa funky ERP stuff seems to have a talent to FP on it.
 its great for metas but usually triggers scores close to FP with the usual
 suspects  their very ugly HTML formatting.
 (sorry, cannot supply samples)
 
 I'd say 2.5 is sorta high

ok -- I was under the impression it was FP-free.  0.5 works for me in that
case.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-17 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #164 from Mark Martinec mark.marti...@ijs.si 2009-11-17 03:03:22 
UTC ---
 It appears that tests here are failing after commit because rules required by
 this test were zeroed out.  It seems these rules have almost zero hits in
 masscheck.  What should we do about this?

  Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO
  for the test
  Sending t/missing_hb_separator.t
  Committed revision 881240.

I hope this is the right approach. Alternative would be to introduce
a file similar to t/data/01_test_rules.cf to hold score overrides, but
with a name like 51_test_rules.cf to be sorted after the 50_scores.cf.
Btw, is the 01_ in the name intentional, or could the existing file
just be renamed to something like 99_test_rules.cf ?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-17 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #165 from Mark Martinec mark.marti...@ijs.si 2009-11-17 03:18:15 
UTC ---
(In reply to comment #161)
 -score RDNS_NONE 0.1
 -score RDNS_DYNAMIC  0.1
 +# score RDNS_NONE 0 1.1 0 0.7
 +# score RDNS_DYNAMIC  0 0.5 0 0.5

 Doesn't commented out mean 1 point?

It would mean 1 point, if there were no other score lines for these two rules:
score RDNS_DYNAMIC 2.639 0.363 1.663 0.982
score RDNS_NONE2.399 1.274 1.228 0.793

 These are supposed to be informational rules according to the comment.
 Is this supposed to become commented out?

Comment 116, 120, 124, 137, 139.
I left it mutable, I think it still makes sense - it's kind of a poor man's
Botnet plugin.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-17 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #166 from Justin Mason j...@jmason.org 2009-11-17 07:41:11 UTC ---
(In reply to comment #164)
  It appears that tests here are failing after commit because rules required 
  by
  this test were zeroed out.  It seems these rules have almost zero hits in
  masscheck.  What should we do about this?
 
   Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO
   for the test
   Sending t/missing_hb_separator.t
   Committed revision 881240.
 
 I hope this is the right approach. Alternative would be to introduce
 a file similar to t/data/01_test_rules.cf to hold score overrides, but
 with a name like 51_test_rules.cf to be sorted after the 50_scores.cf.
 Btw, is the 01_ in the name intentional, or could the existing file
 just be renamed to something like 99_test_rules.cf ?

X_MESSAGE_INFO can be dropped, but MISSING_HB_SEP should not have been made
mutable; I'd say lock to 2.5.

btw it is to be expected that with less mutability the scores become slightly
less optimal for the rescoring corpus; this always happens.  If scores are
allowed to wander without locking down the unsafe rules, the GA will overfit
to the training data and produce great FP/FN figures, but scores that are risky
for real world usage.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-17 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

AXB alex.ur...@gmail.com changed:

   What|Removed |Added

 CC||alex.ur...@gmail.com

--- Comment #167 from AXB alex.ur...@gmail.com 2009-11-17 07:56:17 UTC ---
(In reply to comment #166)
 (In reply to comment #164)
   It appears that tests here are failing after commit because rules 
   required by
   this test were zeroed out.  It seems these rules have almost zero hits in
   masscheck.  What should we do about this?
  
Bug 6155 #163: force nonzero scores on MISSING_HB_SEP and X_MESSAGE_INFO
for the test
Sending t/missing_hb_separator.t
Committed revision 881240.
  
  I hope this is the right approach. Alternative would be to introduce
  a file similar to t/data/01_test_rules.cf to hold score overrides, but
  with a name like 51_test_rules.cf to be sorted after the 50_scores.cf.
  Btw, is the 01_ in the name intentional, or could the existing file
  just be renamed to something like 99_test_rules.cf ?
 
 X_MESSAGE_INFO can be dropped, but MISSING_HB_SEP should not have been made
 mutable; I'd say lock to 2.5.
 
 btw it is to be expected that with less mutability the scores become slightly
 less optimal for the rescoring corpus; this always happens.  If scores are
 allowed to wander without locking down the unsafe rules, the GA will overfit
 to the training data and produce great FP/FN figures, but scores that are 
 risky
 for real world usage.

locally, I've have lowered the MISSING_HB_SEP score to 0.5

lottsa funky ERP stuff seems to have a talent to FP on it.
its great for metas but usually triggers scores close to FP with the usual
suspects  their very ugly HTML formatting.
(sorry, cannot supply samples)

I'd say 2.5 is sorta high

Axb

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-16 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #159 from Justin Mason j...@jmason.org 2009-11-16 16:27:51 UTC ---
will we go ahead and check in those scores, anyway?  that would allow another
beta (soon).

re: HTML_IMAGE_RATIO_* -- it's very common for that kind of multi-valued set
of rules to wind up with nonintuitive scoring.  This happens from either low
hitrates or hitting alongside other (better) rules.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-16 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #160 from Warren Togami wtog...@redhat.com 2009-11-16 18:28:03 
UTC ---
(In reply to comment #142)
 Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly
 false positives are due to freelotto.com mail. I wonder whether such
 samples are rightfully in the spam* corpora - I'd say yes, but,
 as they say, spam is about consent, not content, and people receiving
 mail from freelotto.com most likely did register once, not realizing
 what they are dealing with. So there was a consent, at least initially.
 It is also about fraud and advertising, so, should one leave such
 mail samples in the spam corpus or not?

Perhaps we should explicitly exclude known sketchy senders like freelotto.com
from HABEAS_ACCREDITED_SOI.  This would allow us to more easily monitor for
clear violators by not being distracted by the common FP's.  Exclusion in this
case only brings the listed back to neutral which is pretty clearly a good
idea.

Any objections?  Otherwise I'll file a separate bug for this.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-16 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #161 from Warren Togami wtog...@redhat.com 2009-11-16 19:27:50 
UTC ---
-score RDNS_NONE 0.1
-score RDNS_DYNAMIC  0.1
+# score RDNS_NONE 0 1.1 0 0.7
+# score RDNS_DYNAMIC  0 0.5 0 0.5

These are supposed to be informational rules according to the comment.  Is this
supposed to become commented out?  Doesn't commented out mean 1 point?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-16 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #162 from Warren Togami wtog...@redhat.com 2009-11-16 21:28:44 
UTC ---
fp-fn-statistics across the entire rescore logs.

Set 3 Before
===
# SUMMARY for threshold 5.0:
# Correctly non-spam: 703647  99.90%
# Correctly spam: 2559525  98.28%
# False positives:   719  0.10%
# False negatives: 44795  1.72%
# TCR(l=50): 32.253638  SpamRecall: 98.280%  SpamPrec: 99.972%

Set 3 Raw Rescoring from Comment #146
==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 703520  99.88%
# Correctly spam: 2548134  97.84%
# False positives:   846  0.12%
# False negatives: 56186  2.16%
# TCR(l=50): 26.443555  SpamRecall: 97.843%  SpamPrec: 99.967%

Doesn't look like an improvement.

Set 3 + Rescore + Reductions
==
# SUMMARY for threshold 5.0:
# Correctly non-spam: 704002  99.95%
# Correctly spam: 2558896  98.26%
# False positives:   364  0.05%
# False negatives: 45424  1.74%
# TCR(l=50): 40.932981  SpamRecall: 98.256%  SpamPrec: 99.986%

Looks like a statistically insignificant improvement over the old scores.  I
only hope our corpora was sufficiently varied.

Rules Made Informational
==
TVD_RCVD_SPACE_BRACKET
MISSING_MIME_HB_SEP
FUZZY_CPILL
X_IP Bug #5920 appears not fixed as claimed.
FRT_SOMA2
CTYPE_001C_B
MIME_BASE64_BLANKS
WEIRD_QUOTING
SPF_HELO_FAIL
HTML_IMAGE_RATIO_06
HTML_IMAGE_RATIO_08

Other Changes

* EXTRA_MPART_TYPE was left as 1.0 because while it does relatively poorly in
the weeky masscheck, it did far better in rescore masscheck.
* I am increasing the scores of PSBL *after* the above fp-fn-statistics run
because the old logs do not reflect its current safety level.

I am committing these changes now.  I suspect the key to these reductions is
getting rid of the rules that wouldn't have passed our ruleqa auto-promotion
criteria?  There might be additional tweaks to make.  Please comment here.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-16 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #163 from Warren Togami wtog...@redhat.com 2009-11-16 22:58:57 
UTC ---
http://hudson.zones.apache.org/hudson/job/SpamAssassin-trunk/4344/testReport/
-score MISSING_HB_SEP 2.5
+# score MISSING_HB_SEP 2.5
+score MISSING_HB_SEP 0 # n=0 n=1 n=2

-score X_MESSAGE_INFO 3.499 3.496 3.330 1.597
+score X_MESSAGE_INFO 0 # n=0 n=1 n=2 n=3

It appears that tests here are failing after commit because rules required by
this test were zeroed out.  It seems these rules have almost zero hits in
masscheck.  What should we do about this?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-12 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #157 from Warren Togami wtog...@redhat.com 2009-11-12 10:07:55 
UTC ---
TVD_RCVD_SPACE_BRACKET
MISSING_MIME_HB_SEP
FUZZY_CPILL
X_IP Bug #5920 appears not fixed as claimed.
FRT_SOMA2
CTYPE_001C_B
MIME_BASE64_BLANKS
WEIRD_QUOTING
SPF_HELO_FAIL
EXTRA_MPART_TYPE

It appears to be correct to zero out these rules, or at least make them
informational.

spamassassin-3.2.5
score HTML_IMAGE_RATIO_02 1.518 0.550 0.573 0.383
score HTML_IMAGE_RATIO_04 1.561 0.170 0.863 0.172
score HTML_IMAGE_RATIO_06 0.401 0.001 0.501 0.001
score HTML_IMAGE_RATIO_08 0.203 0.001 0.179 0.001

attachment 4565
resulting 50_scores.cf from garescorer runs - V5
score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437
score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556
score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882
score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021

The old scores showed a more linear relationship, with a sharp drop-off between
_04 and _06.  Our masscheck results indicate _02 and _04 hit on more spam than
ham, but _06 and _08 are pretty worthless.  I think we should zero out _06 and
_08 while reducing the scores of _02 and _04.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-12 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #158 from Adam Katz antis...@khopis.com 2009-11-12 16:20:15 UTC 
---
(In reply to comment #157)
 spamassassin-3.2.5
 score HTML_IMAGE_RATIO_02 1.518 0.550 0.573 0.383
 score HTML_IMAGE_RATIO_04 1.561 0.170 0.863 0.172
 score HTML_IMAGE_RATIO_06 0.401 0.001 0.501 0.001
 score HTML_IMAGE_RATIO_08 0.203 0.001 0.179 0.001
 
 attachment 4565 [details]
 resulting 50_scores.cf from garescorer runs - V5
 score HTML_IMAGE_RATIO_02 2.199 0.805 1.200 0.437
 score HTML_IMAGE_RATIO_04 2.089 0.610 0.607 0.556
 score HTML_IMAGE_RATIO_06 1.799 0.579 0.901 0.882
 score HTML_IMAGE_RATIO_08 1.410 0.351 0.874 0.021
 
 The old scores showed a more linear relationship, with a sharp drop-off
 between _04 and _06.  Our masscheck results indicate _02 and _04 hit on
 more spam than ham, but _06 and _08 are pretty worthless.  I think we
 should zero out _06 and _08 while reducing the scores of _02 and _04.

I didn't mention _08 because it wasn't a remarkable enough margin of HAM  SPAM
(my script only reports if HAM% + 0.05  SPAM%) and my hand-sampling utilized
S/O ratios under .250 while this rule is .320.  Still, it has the problem:

SPAM%   HAM%S/ORANK  SCORE NAMEDateRev
0.2709  0.5491  0.330  0.34  0.20  HTML_IMAGE_RATIO_08 2009-r834803-n
0.2717  0.5492  0.331  0.34  0.20  HTML_IMAGE_RATIO_08 20091110-r834389-n
0.2672  0.5493  0.327  0.34  0.20  HTML_IMAGE_RATIO_08 20091109-r833997-n
0.2075  0.4995  0.294  0.34  0.20  HTML_IMAGE_RATIO_08 20091104-r832683-n
0.2548  0.5476  0.318  0.34  0.20  HTML_IMAGE_RATIO_08 20091028-r830464-n

Here are the results from the 2009-r834803-n set, pruning only rules
scoring under 0.2 (all hits from my last report are present and asterisked):

 S/O RANK HAM%SPAM%   Score in attachment 4565 Rule
.014 .15  0.6328  0.0093  0.001 0.001 0.131 0.700  TVD_RCVD_SPACE_BRACKET*
.015 .24  0.1927  0.0029  0.000 2.099 0.001 1.711  MISSING_MIME_HB_SEP*
.019 .22  0.2528  0.0049  1.482 0.855 2.399 2.399  FUZZY_CPILL*
.043 .29  0.1298  0.0059  0.001 1.699 1.498 1.699  X_IP*
.075 .35  0.0603  0.0049  0.000 0.001 0.308 0.001  HTML_NONELEMENT_30_40
.092 .21  0.8123  0.0825  0.699 0.332 0.480 0.800  MIME_BASE64_BLANKS*
.106 .25  0.2483  0.0293  0.551 1.026 1.033 1.250  CTYPE_001C_B*
.123 .33  0.0837  0.0117  0.001 0.648 0.836 1.293  TVD_FW_GRAPHIC_NAME_LONG
.123 .28  0.1632  0.0229  0.001 2.499 0.392 0.164  DRUGS_MUSCLE(*)
.130 .25  0.3663  0.0547  2.385 0.345 0.998 2.503  FRT_SOMA2*
.155 .29  0.1736  0.0317  0.001 0.001 0.001 1.741  MIME_BASE64_TEXT
.188 .27  0.4622  0.1069  0 0.973 0 2.385  SPF_HELO_FAIL*
.214 .31  0.1449  0.0395  2.200 2.199 0.540 2.199  WEIRD_QUOTING*
.239 .30  0.8321  0.2612  1.799 0.579 0.901 0.882  HTML_IMAGE_RATIO_06*
.254 .34  1.3070  0.4442  1.0  EXTRA_MPART_TYPE*
.330 .34  0.5491  0.2709  1.410 0.351 0.874 0.021  HTML_IMAGE_RATIO_08
.363 .38  1.0856  0.6194  2.600 2.070 1.233 3.405  DATE_IN_PAST_96_XX
.368 .36  0.3029  0.1767  0.001 0.791 0.001 0.008  UPPERCASE_50_75
.381 .37  0.6473  0.3983  0.354 0.001 0.725 0.428  MIME_HTML_MOSTLY
.660 .51  1.8514  3.5893  0.518 1.625 1.197 1.506  SUBJ_ALL_CAPS
.905 .58  1.0822 10.2987  0 1.246 0 1.347  RCVD_IN_BL_SPAMCOP_NET
.934 .56  3.6172 51.2001  2.199 1.105 1.199 0.723  MIME_HTML_ONLY
.957 .52  2.2200 50.3063  2.399 1.274 1.228 0.793  RDNS_NONE

DRUGS_MUSCLE met all the requirements I set for my last report, but I removed
it because it had almost no hits anyway, and it scored very very low except on
net+no-bayes, so I was assuming it had some justification there somehow.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-11 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #154 from Warren Togami wtog...@redhat.com 2009-11-11 11:38:13 
UTC ---
(In reply to comment #152)
 
 | Please manually adjust the scores of RCVD_IN_PSBL up.  At the time of the
 | rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a
 | number of major ISP's.  As a result, for 5 weeks straight RCVD_IN_PSBL has
 | been almost completely devoid of FP's in our weekly masschecks.  I am
 | confident that PSBL performs safer than measured during the rescore 
 masscheck
 
 Ok, I suggest we collect some manual fixes like the ones suggested here
 (with specific score suggestions), and wrap it up.

Let's just go ahead with committing as jm suggested in Comment #153 and make
the manual adjustments after that in separate commits each with explanations.

RCVD_IN_PSBL I suggest 2.7 for both network sets.

Adam Katz in Comment #153 makes a good argument for reducing those rules to
informational.  Any comments on that?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-09 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Adam Katz antis...@khopis.com changed:

   What|Removed |Added

   Attachment #4564|0   |1
is obsolete||

--- Comment #153 from Adam Katz antis...@khopis.com 2009-11-09 15:40:31 UTC 
---
Created an attachment (id=4568)
 -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4568)
Checker for rules that match more ham than spam

Collected selections from several more runs of my script.  I took the last
three days' worth of masschecks plus the run last week, hand-picked rules with
a high score (~1.0+) but low S/O (~0.250-), and then looked for repeat
offenders.  This is the list, with each rule's worst S/O of any run:

 S/O RANK HAM%SPAM%   Score attachment 4565 Rule
.002 .14  1.2650  0.0024  0.001 0.001 0.131 0.700  TVD_RCVD_SPACE_BRACKET
.002 .23  0.4472  0.0008  0.000 2.099 0.001 1.711  MISSING_MIME_HB_SEP
.019 .22  0.2529  0.0049  1.482 0.855 2.399 2.399  FUZZY_CPILL
.019 .29  0.2809  0.0056  0.001 1.699 1.498 1.699  X_IP
.046 .22  0.4010  0.0193  2.385 0.345 0.998 2.503  FRT_SOMA2
.077 .25  0.2643  0.0221  0.551 1.026 1.033 1.250  CTYPE_001C_B
.092 .21  0.8712  0.0878  0.699 0.332 0.480 0.800  MIME_BASE64_BLANKS
.095 .31  0.2735  0.0286  2.200 2.199 0.540 2.199  WEIRD_QUOTING
.178 .28  0.4948  0.1069  0 0.973 0 2.385  SPF_HELO_FAIL
.195 .29  0.8975  0.2173  1.799 0.579 0.901 0.882  HTML_IMAGE_RATIO_06
.241 .34  1.4248  0.4529  1.0  EXTRA_MPART_TYPE

I don't think it wise to release with these scores quite so high.  I propose we
score them all 0.1 or 0.001 so as to not hold up the release and bookmark the
issue (likely a bug in the GA, probably best registered as its own bugzilla
bug) for dealing with later.


Additionally, I've updated my script to do the reverse - seek out negatively
scored rules that hit more spam than ham.  This doesn't currently find anything
beyond SPF_PASS (due to having =1% spam hits, while it was previously found
for having hamspam), but it does prevent listing SPF_HELO_PASS and
theoretically will help find poorly-written ham rules in the future.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-08 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #152 from Mark Martinec mark.marti...@ijs.si 2009-11-08 16:36:24 
UTC ---
  A new run, this time I left the URIBL whitelists and similar fixed
  (at their relatively high manual scores) as they were in current
  50_scores.cf

Or to say it better: unlike my previous runs where I commented out most
scores in the existing 50_scores.cf (thus making them mutable, regardless
of a gen:mutable markup) except for a couple of exceptions, this time
I did not comment-out scores, and let gen:mutable markup do its job.
So this is now more like how it was intended to run GA.

 After a little examination, they look good to me!  +1 to check in.

Thanks. I'm sure we can can still do some manual tweaks and improvements,
but perhaps we can indeed freeze the rest to automatically assigned scores
in this run.

 btw if you feel like cranking up the max gens, go for it.  fwiw,
 spamassassin2.zones has a very powerful CPU -- if it's taking too long
 on your own machine, try scping stuff up and running it there.

My office workstation is quite beefy too, and I hope we won't need to do
many further runs, so for now I'd just stick to what I'm familiar with.
Btw, my set3 run at 14000 iterations takes 5 hours, similar for set1, the
other two are much faster (less than 30 minutes each). I just let it run
overnight, so it wouldn't matter if it takes half that time. I did some
previous runs at 3 iterations, and a diagram (like the one attached
earlier) does not show noticeable improvements beyond about 1, or even
small worsening by the end, so the 14000 limit seems reasonable. And the
GA algorithms are said to be prone to overfitting, so it's probably prudent
not to go too far.



 RCVD_IN_XBL is still surprisingly low -- I bet there's some additive
 behaviour overlapping between XBL and PBL, though.
 RCVD_IN_SBL is _very_ low in set 3 too, bizarre!
 otherwise I can't see any issues

| Please manually adjust the scores of RCVD_IN_PSBL up.  At the time of the
| rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a
| number of major ISP's.  As a result, for 5 weeks straight RCVD_IN_PSBL has
| been almost completely devoid of FP's in our weekly masschecks.  I am
| confident that PSBL performs safer than measured during the rescore masscheck

Ok, I suggest we collect some manual fixes like the ones suggested here
(with specific score suggestions), and wrap it up.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-07 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #150 from Justin Mason j...@jmason.org 2009-11-07 13:33:19 UTC ---
(In reply to comment #146)
 Created an attachment (id=4565)
 -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4565) [details]
 resulting 50_scores.cf from garescorer runs - V5
 
 A new run, this time I left the URIBL whitelists and similar fixed
 (at their relatively high manual scores) as they were in current 50_scores.cf

After a little examination, they look good to me!  +1 to check in.

RCVD_IN_XBL is still surprisingly low -- I bet there's some additive behaviour
overlapping between XBL and PBL, though.  

RCVD_IN_SBL is _very_ low in set 3 too, bizarre!

otherwise I can't see any issues



btw if you feel like cranking up the max gens, go for it.  fwiw,
spamassassin2.zones has a very powerful CPU -- if it's taking too long on your
own machine, try scping stuff up and running it there.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-07 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #151 from Warren Togami wtog...@redhat.com 2009-11-07 15:46:54 
UTC ---
Please manually adjust the scores of RCVD_IN_PSBL up.  At the time of the
rescore masscheck PSBL had not yet whitelisted hotmail, yahoo, gmail and a
number of major ISP's.  As a result, for 5 weeks straight RCVD_IN_PSBL has been
almost completely devoid of FP's in our weekly masschecks.  I am confident that
PSBL performs safer than measured during the rescore masscheck.

http://ruleqa.spamassassin.org/20090829-r809102-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090905-r811608-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090912-r814117-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090919-r816871-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20090926-r819101-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091003-r821273-n/RCVD_IN_PSBL/detail
(below this point FP rate dropped to nearly zero)
http://ruleqa.spamassassin.org/20091010-r823821-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091017-r826198-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091024-r829323-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091031-r831520-n/RCVD_IN_PSBL/detail
http://ruleqa.spamassassin.org/20091107-r833654-n/RCVD_IN_PSBL/detail
You can plainly see steady and sustained improvement in FP safety in these past
weeks.

RCVD_IN_PSBL in the rescore masscheck was without lastexternal.  Clearly with
the added limitation of lastexternal it is safer than measured.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-11-04 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Adam Katz antis...@khopis.com changed:

   What|Removed |Added

   Attachment #4561|0   |1
is obsolete||

--- Comment #145 from Adam Katz antis...@khopis.com 2009-11-04 15:52:15 UTC 
---
Created an attachment (id=4564)
 -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4564)
Checker for rules that match more ham than spam

Updated my checker to use S/O (now that I understand that stat).  It also
supports specifying the DateRev for the specific masscheck run.  Since today's
run was sparse, here are yesterday's results.

$ ./sa33badrules.pl 20091103-r832343-n
 S/O RANK HAM%SPAM%   Score in attachment 4558 Rule
.008 .12  1.2401  0.0105  0.001MSGID_MULTIPLE_AT
.011 .22  0.3066  0.0035  0OBSCURED_EMAIL
.012 .25  0.2058  0.0025  0.000 2.099 0.001 1.212  MISSING_MIME_HB_SEP
.014 .17  0.5822  0.0080  0.001 0.001 0.699 0.699  TVD_RCVD_SPACE_BRACKET
.028 .20  0.4339  0.0125  unknown  TVD_FUZZY_SECTOR
.042 .28  0.1732  0.0075  0SUBJECT_FUZZY_TION
.048 .77  4.4862  0.2279  -0.001   SPF_HELO_PASS
.052 .29  0.1476  0.0080  1.494 1.699 1.591 1.516  X_IP
.055 .22  0.3914  0.0226  2.205 0.174 1.299 1.806  FRT_SOMA2
.062 .74  5.1484  0.3424  -0.001   SPF_PASS
.077 .25  0.2643  0.0221  0.987 0.750 0.943 1.318  CTYPE_001C_B
.079 .36  0.0640  0.0055  0.001 0.001 0.605 0.378  HTML_NONELEMENT_30_40
.080 .28  0.1742  0.0151  0.001 2.499 0.268 0.516  DRUGS_MUSCLE
.084 .36  0.0660  0.0060  0FORGED_IMS_TAGS
.090 .32  0.1114  0.0110  0.033 0.001 0.365 0.413  WEIRD_PORT
.092 .21  0.8712  0.0878  1.499 0.419 0.904 0.798  MIME_BASE64_BLANKS
.102 .37  0.0577  0.0065  0HTML_IFRAME_SRC
.123 .34  0.0821  0.0115  0.003 0.978 0.100 1.515  TVD_FW_GRAPHIC_NAME_LONG
.128 .37  0.0614  0.0090  0RCVD_BAD_ID
.130 .29  0.1851  0.0276  0.001 0.020 0.001 1.799  MIME_BASE64_TEXT
.178 .28  0.4948  0.1069  0 1.200 0 2.514  SPF_HELO_FAIL
.202 .32  0.1590  0.0402  0.1  ANY_BOUNCE_MESSAGE
.205 .35  0.0817  0.0211  2.199 1.622 2.199 1.086  LONGWORDS
.213 .34  0.1186  0.0321  0BLANK_LINES_80_90
.216 .32  0.1474  0.0407  2.199 2.199 1.246 2.090  WEIRD_QUOTING
.218 .32  0.1445  0.0402  0.1  BOUNCE_MESSAGE
.223 .30  0.7605  0.2179  1.799 0.572 1.182 1.138  HTML_IMAGE_RATIO_06
.241 .34  1.3973  0.4438  1.0  EXTRA_MPART_TYPE
.254 .34  0.1222  0.0417  0.001 2.185 1.936 0.476  FRT_SOMA
.283 .33  0.6883  0.2711  0.539 0.001 0.332 0.488  MIME_HTML_MOSTLY
.299 .36  0.0908  0.0387  0.799 0.001 0.711 0.026  TVD_FW_GRAPHIC_NAME_MID
.303 .34  0.4938  0.2143  1.899 0.496 0.950 0.445  HTML_IMAGE_RATIO_08
.367 .40  1.2775  0.7409  0.001TVD_SPACE_RATIO
.379 .37  0.3182  0.1943  0.023 0.887 0.000 0.417  UPPERCASE_50_75
.434 .39  0.3261  0.2505  3.099 1.823 1.802 1.998  BAD_ENC_HEADER
.436 .46 15.3798 11.8920  0.001FREEMAIL_FROM
.454 .41  0.5503  0.4573  2.260 0.742 1.199 0.640  MPART_ALT_DIFF
.516 .47  3.6581  3.9024  0.001MIME_QP_LONG_LINE
.655 .51  1.9537  3.7036  1.154 1.677 1.198 1.453  SUBJ_ALL_CAPS
.665 .49 42.2269 83.7383  0.001HTML_MESSAGE
.692 .52  1.1850  2.6580  0.001UNPARSEABLE_RELAY
.922 .58  1.1584 13.7423  0 1.322 0 1.237  RCVD_IN_BL_SPAMCOP_NET
.935 .57  3.5421 50.6034  2.199 0.955 1.215 0.549  MIME_HTML_ONLY
.970 .52  1.5729 51.1430  0 1.1 0 0.7  RDNS_NONE

Note, I hacked RDNS_NONE so that it removes the Enron hits.

Problem rules this week include X_IP, EXTRA_MPART_TYPE, FRT_SOMA2, and
BAD_ENC_HEADER (scored 3.099?!).

Food for thought:  while it's good to create workarounds for the problematic
outcomes from the genetic algorithm, I think that these should be examples with
which to troubleshoot the algorithm itself while this might just be an early
sign of over-fitting (which is largely fine as long as we comb through the
results with scripts like this), it might also be indicative of a problem in
the system's prioritization.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-29 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #144 from Warren Togami wtog...@redhat.com 2009-10-29 18:33:38 
UTC ---
What is the next step in order to move forward?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-28 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #141 from Mark Martinec mark.marti...@ijs.si 2009-10-28 09:02:40 
UTC ---
 But I agree that more may need re-fixing.
 
 cool.
 In particular, some of the DNSBLs and most of the DNSWLs are good to 'lock
 down', I feel, as users tend to 'compensate' or correct their scores more
 frequently than other rules -- in my opinion.  Also, if those are given low
 scores by the GA, their operators tend to be annoyed, and it's not good to
 annoy people who we're relying on ;)
 
 It also reflects that those rules are slightly different, and hopefully 
 more reliable, than a typical body rule for example -- there's no way to
 indicate this to the GA yet, so locking the rules is as good as we can do.

| It is quite possible that some of these hits are still false positives,
| despite several iterations of cleaning

I wonder how much is the low score for some ham rules affected by false
positives present in the spam* corpora. Here is some statistics for
the more prominent ham rules (i.e. the ones with negative scores).

For each rule the table shows a number of hits of this rule for each
corpus - both as a percentage of all entries in a file, and as absolute
counts. The entries standing out from the crowd that may need re-checking
are labeled with *** :

score ALL_TRUSTED -1.000
 0.046 % 1/2194 spam-bayes-net-bb-kmcgrail
 0.017 %4/23761 spam-bayes-net-mmartinec
 0.014 %5/36941 spam-bayes-net-hege
 0.001 %1/81265 spam-bayes-net-bluestreak
 0.000 %   1/931863 spam-bayes-net-dos

score BAYES_00  0 0 -1.2 -1.9
 5.652 %   104/1840 spam-bayes-net-bb-jhardin  ***
 1.805 %  429/23761 spam-bayes-net-mmartinec
 1.606 %33/2055 spam-bayes-net-ahenry
 0.439 %  357/81265 spam-bayes-net-bluestreak
 0.374 %  138/36941 spam-bayes-net-hege
 0.030 % 445/1489699 spam-bayes-net-jm
 0.017 % 156/931863 spam-bayes-net-dos

score DCC_REPUT_00_12  0 -0.8 0 -0.4
 0.164 %   39/23761 spam-bayes-net-mmartinec

score HABEAS_ACCREDITED_SOI 0 -1.634 0 -0.475
 5.382 %76/1412 spam-bayes-net-bb-guenther_fraud  ***
 0.272 % 5/1840 spam-bayes-net-bb-jhardin
 0.091 % 2/2194 spam-bayes-net-bb-kmcgrail
 0.059 %   14/23761 spam-bayes-net-mmartinec
 0.049 %   18/36941 spam-bayes-net-hege
 0.037 % 558/1489699 spam-bayes-net-jm
 0.030 % 2/6728 spam-bayes-net-wt-en1
 0.018 %   15/81265 spam-bayes-net-bluestreak
 0.000 %   1/931863 spam-bayes-net-dos

score RCVD_IN_DNSWL_HI  0 -1.8 0 -1.8
 0.163 % 3/1840 spam-bayes-net-bb-jhardin  ***
 0.091 % 2/2194 spam-bayes-net-bb-kmcgrail
 0.071 % 1/1412 spam-bayes-net-bb-guenther_fraud
 0.003 %1/36941 spam-bayes-net-hege
 0.000 %  1/1489699 spam-bayes-net-jm

score RCVD_IN_DNSWL_MED  0 -1.5 0 -1.2
 1.250 %23/1840 spam-bayes-net-bb-jhardin  ***
(1.108 %  7/632 spam-bayes-net-binnocenti.OFF)
 0.638 %14/2194 spam-bayes-net-bb-kmcgrail
 0.469 %  381/81265 spam-bayes-net-bluestreak
 0.438 % 9/2055 spam-bayes-net-ahenry
 0.223 %15/6728 spam-bayes-net-wt-en1
 0.214 %   79/36941 spam-bayes-net-hege
 0.046 % 682/1489699 spam-bayes-net-jm
 0.042 % 3/7185 spam-bayes-net-bb-zmi
 0.013 %3/23761 spam-bayes-net-mmartinec
 0.010 %2/19160 spam-bayes-net-wt-en4
 0.003 %  29/931863 spam-bayes-net-dos

score RCVD_IN_DNSWL_LOW  0 -0.6 0 -1.1
 16.153 % 240627/1489699 spam-bayes-net-jm  ***
(9.810 % 62/632 spam-bayes-net-binnocenti.OFF)
 1.739 %32/1840 spam-bayes-net-bb-jhardin
 1.600 %  591/36941 spam-bayes-net-hege
 1.159 %78/6728 spam-bayes-net-wt-en1
 1.133 %16/1412 spam-bayes-net-bb-guenther_fraud
 0.925 %19/2055 spam-bayes-net-ahenry
 0.365 % 8/2194 spam-bayes-net-bb-kmcgrail
 0.107 %   87/81265 spam-bayes-net-bluestreak
 0.097 % 7/7185 spam-bayes-net-bb-zmi
 0.022 % 201/931863 spam-bayes-net-dos
 0.021 %5/23761 spam-bayes-net-mmartinec
 0.016 %3/19160 spam-bayes-net-wt-en4

score RCVD_IN_BSP_TRUSTED 0 -0.001 0 -0.001
 5.312 %75/1412 spam-bayes-net-bb-guenther_fraud  ***
 0.030 % 2/6728 spam-bayes-net-wt-en1
 0.029 %7/23761 spam-bayes-net-mmartinec
 0.029 % 435/1489699 spam-bayes-net-jm
 0.015 %   12/81265 spam-bayes-net-bluestreak
 0.003 %1/36941 spam-bayes-net-hege
 0.001 %  11/931863 spam-bayes-net-dos

score RCVD_IN_IADB_DK 0 -0.044 0 -0.001
 0.059 % 4/6728 spam-bayes-net-wt-en1
 0.054 % 1/1840 spam-bayes-net-bb-jhardin
 0.033 %   27/81265 spam-bayes-net-bluestreak
 0.004 %1/23761 spam-bayes-net-mmartinec
 0.001 % 21/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_RDNS 0 -0.018 0 -0.001
 0.342 %23/6728 spam-bayes-net-wt-en1  ***
 0.054 % 1/1840 spam-bayes-net-bb-jhardin
 0.049 % 1/2055 spam-bayes-net-ahenry
 0.033 %   27/81265 spam-bayes-net-bluestreak
 0.004 %1/23761 spam-bayes-net-mmartinec
 0.002 % 26/1489699 spam-bayes-net-jm

score RCVD_IN_IADB_OPTIN 0 -3.265 0 -2.791
 0.342 %23/6728 spam-bayes-net-wt-en1  ***
 0.049 % 1/2055 spam-bayes-net-ahenry
 0.000 %  4/1489699 spam-bayes-net-jm

score 

[Bug 6155] generate new scores for 3.3.0 release

2009-10-28 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #142 from Mark Martinec mark.marti...@ijs.si 2009-10-28 10:23:19 
UTC ---
Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly
false positives are due to freelotto.com mail. I wonder whether such
samples are rightfully in the spam* corpora - I'd say yes, but,
as they say, spam is about consent, not content, and people receiving
mail from freelotto.com most likely did register once, not realizing
what they are dealing with. So there was a consent, at least initially.
It is also about fraud and advertising, so, should one leave such
mail samples in the spam corpus or not?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-28 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #143 from Mark Martinec mark.marti...@ijs.si 2009-10-28 10:41:31 
UTC ---
 Seems to me that many / most(?) HABEAS_ACCREDITED_SOI supposedly
 false positives are due to freelotto.com mail.

Same applies to RCVD_IN_BSP_TRUSTED spam hits.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-27 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #136 from Justin Mason j...@jmason.org 2009-10-27 07:09:36 UTC ---
(In reply to comment #133)
 it looks like there might be a bit of a problem there -- definitely some rules
 that are in immutable sections, like the above, have been allowed to be 
 mutable
 in ranges.data

just wondering, Mark, did you do this deliberately?  or is it just a bug in the
tool that it's ignoring the non-mutable flag for those rules for some reason?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-27 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #137 from Mark Martinec mark.marti...@ijs.si 2009-10-27 14:18:14 
UTC ---
  it looks like there might be a bit of a problem there -- definitely some
  rules that are in immutable sections, like the above, have been allowed
  to be mutable in ranges.data
 
 just wondering, Mark, did you do this deliberately?  or is it just a bug
 in the tool that it's ignoring the non-mutable flag for those rules for
 some reason?

Sort-of deliberately. Initially I followed the idea in wiki RescoreMassCheck
section 4.2: 'comment out all score lines except for rules that you think
the scores are accurate like carefully-vetted net rules, or 0.001 informational
rules' which made perfect sense to me, so I did it for 50_scores.cf, except
for a couple of rather obvious rules like _WHITELIST and similar, and the ones
clearly indicated as 'indicators' only in the surrounding comments, or set to
0.001. Later I nailed a couple more. I followed a principle: when in doubt,
leave it floating, it can be fixed later if necessary. It gives some insight
into what GA 'thinks' about certain rules.

I think at least for some rules GA makes perfect sense, like RDNS_NONE
and RDNS_DYNAMIC. For some of them the GA result is close to the manually
assigned score, or may indicate a need for reconsidering the assigned score.
But I agree that more may need re-fixing.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-27 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #138 from Mark Martinec mark.marti...@ijs.si 2009-10-27 14:29:03 
UTC ---
(In reply to comment #134)
 Some of the spam in my corpora is from third parties. I do check it for 
 correct
 classification before uploading, but I was wondering: how does masscheck
 determine the correct lastexternal for corpora containing messages from
 multiple different networks? Or does it assume all of the messages in a given
 contributor's corpora have the same network boundary? If the latter, I need to
 remove those third-party messages from my spam corpora...
 
 Might lastexternal confusion in the masschecks be contributing in some way to
 the odd RCVD_IN_* score generation?

I believe the masschecks leaves internal/external/msa_networks to their
defaults, unless one cares to configure it correctly for his corpus. And
I believe that it is more likely than not that some corpora were scanned
with unsuitable settings of networks. I know that configuring it for my
mass checks runs it gave me a headache (but I did it right in the end).
Which is why I posted the following note on the ML at that time:


  From: Mark Martinec mark.martinec...@ijs.si
  To: dev@spamassassin.apache.org
  Subject: Re: SpamAssassin 3.3.0 mass-checks now starting
  Date: Fri, 4 Sep 2009 21:46:59 +0200

  Docs don't say where one is supposed to put a local.cf with
  options which are ignored in masses/spamassassin/user_prefs
  (like Bayes SQL options, DCC, Pyzor timeouts etc).

  I tried to place local.cf into masses/spamassassin/, with
  horror results (some directives in local.cf proclaimed as
  invalid, as apparently plugins have not yet been loaded
  at the time of parsing this file, but only later).

  I finally placed it into ../rules/ as mylocal.cf, which
  finally works as expected, but I wonder if the is the proper
  solution. Should be documented I guess...

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-27 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #139 from Justin Mason j...@jmason.org 2009-10-27 15:00:50 UTC ---
(In reply to comment #137)
 Sort-of deliberately. Initially I followed the idea in wiki RescoreMassCheck
 section 4.2: 'comment out all score lines except for rules that you think
 the scores are accurate like carefully-vetted net rules, or 0.001 
 informational
 rules' which made perfect sense to me, so I did it for 50_scores.cf, except
 for a couple of rather obvious rules like _WHITELIST and similar, and the ones
 clearly indicated as 'indicators' only in the surrounding comments, or set to
 0.001. Later I nailed a couple more. I followed a principle: when in doubt,
 leave it floating, it can be fixed later if necessary. It gives some insight
 into what GA 'thinks' about certain rules.

That's true.  It's good to hear it's not a bug in the masses scripts, anyway ;)

 I think at least for some rules GA makes perfect sense, like RDNS_NONE
 and RDNS_DYNAMIC.

Yes, I agree, it's actually done a (surprisingly) good job with those.

 For some of them the GA result is close to the manually
 assigned score, or may indicate a need for reconsidering the assigned score.
 But I agree that more may need re-fixing.

cool.

In particular, some of the DNSBLs and most of the DNSWLs are good to 'lock
down', I feel, as users tend to 'compensate' or correct their scores more
frequently than other rules -- in my opinion.  Also, if those are given low
scores by the GA, their operators tend to be annoyed, and it's not good to
annoy people who we're relying on ;)

It also reflects that those rules are slightly different, and hopefully 
more reliable, than a typical body rule for example -- there's no way to
indicate this to the GA yet, so locking the rules is as good as we can do.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-27 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #140 from Justin Mason j...@jmason.org 2009-10-27 15:04:51 UTC ---
(In reply to comment #138)
 I believe the masschecks leaves internal/external/msa_networks to their
 defaults, unless one cares to configure it correctly for his corpus. And
 I believe that it is more likely than not that some corpora were scanned
 with unsuitable settings of networks. I know that configuring it for my
 mass checks runs it gave me a headache (but I did it right in the end).

What should be happening, though, is that we're just underestimating the amount
of -lastexternal rule hits -- the S/O should still be correct, but the overall
number of hits will be less.  Hopefully that will still provide a useful
estimation of accuracy.


   Docs don't say where one is supposed to put a local.cf with
   options which are ignored in masses/spamassassin/user_prefs
   (like Bayes SQL options, DCC, Pyzor timeouts etc).
 
   I tried to place local.cf into masses/spamassassin/, with
   horror results (some directives in local.cf proclaimed as
   invalid, as apparently plugins have not yet been loaded
   at the time of parsing this file, but only later).
 
   I finally placed it into ../rules/ as mylocal.cf, which
   finally works as expected, but I wonder if the is the proper
   solution. Should be documented I guess...

yuck.  bug 6227.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-26 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Mark Martinec mark.marti...@ijs.si changed:

   What|Removed |Added

   Attachment #4542|0   |1
is obsolete||
   Attachment #4553|0   |1
is obsolete||

--- Comment #124 from Mark Martinec mark.marti...@ijs.si 2009-10-26 07:49:13 
UTC ---
Created an attachment (id=4558)
 -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4558)
resulting 50_scores.cf from garescorer runs - V3

Attached is the latest 50_scores.cf file, obtained in a couple of iterations
during the last few days. It takes into account the updated results files
from the rsync submit area, in particular the updated net-wt* (Comment 99,
102, 103), and net-hege* files. The binnocenti* are still excluded.
The rest of the corpora tweaks/decimation as per my previous run, Comment 96.

The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101),
otherwise the _MED stands out above the _HI due to its significantly higher
hit rate.

The KB_RATWARE_OUTLOOK_08, KB_RATWARE_OUTLOOK_12, KB_RATWARE_OUTLOOK_16
and KB_RATWARE_BOUNDARY were now zeroed-out according to Comment 115.

I tried leaving RDNS_NONE and RDNS_DYNAMIC floating (Comment 116, 120, 122),
and it seems to me the obtained score is perfectly sensible and useful,
and still not too high to punish incompetent admins too hard:
  score RDNS_NONE 0 1.1 0 0.7
  score RDNS_DYNAMIC  0 0.5 0 0.5
so I'm leaving these floating.

According to Comment 122 I zeroed out (actually, 0.001'd out) the
HTML_MESSAGE, MIME_QP_LONG_LINE, FREEMAIL_FROM, TVD_SPACE_RATIO,
and MSGID_MULTIPLE_AT.

Some further tweaks: I reduced the BAYES scores somewhat (e.g. from 4.5
to 3.5 for BAYES_99 scoreset3) and tamed down the BAYES_50, which was
standing out from the crowd).

For DCC_* rules I used the already described approach: obtain DCC_CHECK score
from a GA run with all DCC_REPUT_* zeroed-out, then fix the obtained DCC_CHECK,
and let DCC_REPUT_* float for the final run.

The NML_ADSP_CUSTOM_MED was obtained from a GA run, but other (_LOW, _HIGH)
were set manually (currently no hits). The DKIM_ADSP_ALL, DKIM_ADSP_DISCARD,
and DKIM_ADSP_NXDOMAIN are based on GA runs, but hand-tweaked somewhat due
to inconsistencies between corpora.

A word about JM_SOUGHT_FRAUD_{1,2,3}: these three rules come out from
a ga RUN with scores between 2 and 3, but are somewhat inconsistent
between runs and corpora. As requested by Comment 38 their scores
were fixed at zero for the final run, but I'd set these manually
to 2.2 each for the published 50_scores.cf.

After preparing my manual fixes from a couple of trial runs, I made a
final run for each scoreset with these fixed scores, so as to allow other
scores to adjust themselves to the new constraints.

So here are the manual fixes:

score SPF_PASS -0.001
score SPF_HELO_PASS -0.001

score BAYES_00  0  0 -1.2   -1.9
score BAYES_05  0  0 -0.2   -0.5
score BAYES_20  0  0 -0.001 -0.001
score BAYES_40  0  0 -0.001 -0.001
score BAYES_50  0  0  2.00.8
score BAYES_60  0  0  2.51.5
score BAYES_80  0  0  2.72.0
score BAYES_95  0  0  3.23.0
score BAYES_99  0  0  3.83.5

score RCVD_IN_DNSWL_LOW  0 -0.6 0 -1.1
score RCVD_IN_DNSWL_MED  0 -1.5 0 -1.2
score RCVD_IN_DNSWL_HI   0 -1.8 0 -1.8

score HTML_MESSAGE 0.001
score NO_RELAYS -0.001
score UNPARSEABLE_RELAY 0.001
score NO_RECEIVED -0.001
score NO_HEADERS_MESSAGE 0.001

score DKIM_ADSP_ALL0 1.1 0 0.8
score DKIM_ADSP_DISCARD0 1.8 0 1.8
score DKIM_ADSP_NXDOMAIN   0 0.8 0 0.9
score NML_ADSP_CUSTOM_LOW  0 0.7 0 0.7
score NML_ADSP_CUSTOM_MED  0 1.2 0 0.9
score NML_ADSP_CUSTOM_HIGH 0 2.6 0 2.5

score JM_SOUGHT_FRAUD_1 0
score JM_SOUGHT_FRAUD_2 0
score JM_SOUGHT_FRAUD_3 0

score MIME_QP_LONG_LINE 0.001
score FREEMAIL_FROM 0.001
score TVD_SPACE_RATIO   0.001
score MSGID_MULTIPLE_AT 0.001
score EXTRA_MPART_TYPE 1.0
score RDNS_NONE 0 1.1 0 0.7
score RDNS_DYNAMIC  0 0.5 0 0.5

score KB_RATWARE_OUTLOOK_08  0
score KB_RATWARE_OUTLOOK_12  0
score KB_RATWARE_OUTLOOK_16  0
score KB_RATWARE_BOUNDARY0

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-26 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #125 from Mark Martinec mark.marti...@ijs.si 2009-10-26 08:00:59 
UTC ---
$ head test scores

=
score set 3 (net, bayes) - gen-set3-20-5.0-12200-ga

test (10%)
# SUMMARY for threshold 5.0:
# Correctly non-spam:  21172  99.93%
# Correctly spam:  43597  98.78%
# False positives:14  0.07%
# False negatives:   537  1.22%
# TCR(l=50): 35.678254  SpamRecall: 98.783%  SpamPrec: 99.968%

scores (90%):
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168143  32.193%  (99.979% of non-spam corpus)
# Correctly spam: 349734  66.961%  (98.763% of spam corpus)
# False positives:36  0.007%  (0.021% of nonspam,   8360 weighted)
# False negatives:  4382  0.839%  (1.237% of spam,  14401 weighted)
# Average score for spam:  21.1nonspam: -2.2
# Average for false-pos:   5.5  false-neg: 3.3
# TOTAL:  522295  100.00%

=
score set 2 (no net, bayes) - gen-set2-10-5.0-12200-ga

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  21148  99.82%
# Correctly spam:  41172  93.29%
# False positives:38  0.18%
# False negatives:  2962  6.71%
# TCR(l=50): 9.077334  SpamRecall: 93.289%  SpamPrec: 99.908%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 167953  32.157%  (99.866% of non-spam corpus)
# Correctly spam: 329931  63.169%  (93.170% of spam corpus)
# False positives:   226  0.043%  (0.134% of nonspam,  26882 weighted)
# False negatives: 24185  4.631%  (6.830% of spam,  89229 weighted)
# Average score for spam:  10.8nonspam: -0.7
# Average for false-pos:   5.6  false-neg: 3.7
# TOTAL:  522295  100.00%

=
score set 1 (net, no bayes) - gen-set1-10-5.0-12201-ga

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  21155  99.85%
# Correctly spam:  43153  97.78%
# False positives:31  0.15%
# False negatives:   981  2.22%
# TCR(l=50): 17.437377  SpamRecall: 97.777%  SpamPrec: 99.928%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168012  32.168%  (99.901% of non-spam corpus)
# Correctly spam: 346216  66.287%  (97.769% of spam corpus)
# False positives:   167  0.032%  (0.099% of nonspam,  20194 weighted)
# False negatives:  7900  1.513%  (2.231% of spam,  23052 weighted)
# Average score for spam:  19.8nonspam: -0.5
# Average for false-pos:   5.7  false-neg: 2.9
# TOTAL:  522295  100.00%

=
score set 0 (no net, no bayes) - gen-set0-5-5.0-12201-ga

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  20919  98.74%
# Correctly spam:  34081  77.22%
# False positives:   267  1.26%
# False negatives: 10053  22.78%
# TCR(l=50): 1.885827  SpamRecall: 77.222%  SpamPrec: 99.223%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 166261  31.833%  (98.860% of non-spam corpus)
# Correctly spam: 271409  51.965%  (76.644% of spam corpus)
# False positives:  1918  0.367%  (1.140% of nonspam, 126535 weighted)
# False negatives: 82707  15.835%  (23.356% of spam, 235514 weighted)
# Average score for spam:  10.4nonspam: 0.6
# Average for false-pos:   6.3  false-neg: 2.8
# TOTAL:  522295  100.00%

=




In summary:
set 3
# False positives:36  (0.021% of nonspam)
# False negatives:  4382  (1.237% of spam)

set 2
# False positives:   226  (0.134% of nonspam)
# False negatives: 24185  (6.830% of spam)

set 1
# False positives:   167  (0.099% of nonspam)
# False negatives:  7900  (2.231% of spam)

set 0
# False positives:  1918  (1.140% of nonspam)
# False negatives: 82707  (23.356% of spam)

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-26 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #127 from Mark Martinec mark.marti...@ijs.si 2009-10-26 08:09:26 
UTC ---
Created an attachment (id=4560)
 -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4560)
ranges.data on corpora used for score set 3 and 2 runs

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-26 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #128 from Karsten Bräckelmann guent...@rudersport.de 2009-10-26 
09:57:28 UTC ---
(In reply to comment #124)
 Created an attachment (id=4558)
 -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4558) [details]
 resulting 50_scores.cf from garescorer runs - V3

Now I am getting really nervous. :-/  From the scores:

 score KB_DATE_CONTAINS_TAB  3.799 3.799 3.315 2.871
 score KB_FAKED_THE_BAT  1.447 2.273 2.452 3.799

The bad thing about this is, that onet.pl / onet.eu (a polish free-mailer
AFAIK) actually munges the header, and injects the tab into the Date header on
their outgoing SMTP servers. Apparently, they do that harm to all outgoing
mail, not limited to their web-mailer.

It is a very, very stupid thing to do for them, to munge MUA generated headers
like that, but still they appear to do it. :(  That means their customers will
really be punished, and using them *and* The Bat! is a killer.

FWIW, I once wrote these to counter a flood of low-scoreres -- but the above
scores are scaring me. This is quite bad.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

2009-10-26 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #129 from Matthias Leisi matth...@leisi.net 2009-10-26 10:36:56 
UTC ---
(In reply to comment #124)

 The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101),
 otherwise the _MED stands out above the _HI due to its significantly higher
 hit rate.
 [..]

 score RCVD_IN_DNSWL_LOW  0 -0.6 0 -1.1
 score RCVD_IN_DNSWL_MED  0 -1.5 0 -1.2
 score RCVD_IN_DNSWL_HI   0 -1.8 0 -1.8

Is there a particular reason why these are so much different from those in 
https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf:

| score RCVD_IN_DNSWL_LOW 0 -1 0 -1
| score RCVD_IN_DNSWL_MED 0 -4 0 -4
| score RCVD_IN_DNSWL_HI 0 -8 0 -8

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-26 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #130 from Mark Martinec mark.marti...@ijs.si 2009-10-26 11:03:28 
UTC ---
  The RCVD_IN_DNSWL_* scores are hand-tweaked (according to Comment 101),
  otherwise the _MED stands out above the _HI due to its significantly higher
  hit rate.
  score RCVD_IN_DNSWL_LOW  0 -0.6 0 -1.1
  score RCVD_IN_DNSWL_MED  0 -1.5 0 -1.2
  score RCVD_IN_DNSWL_HI   0 -1.8 0 -1.8
 
 Is there a particular reason why these are so much different from those in 
 https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf:
 
 | score RCVD_IN_DNSWL_LOW 0 -1 0 -1
 | score RCVD_IN_DNSWL_MED 0 -4 0 -4
 | score RCVD_IN_DNSWL_HI  0 -8 0 -8

The -1/-4/-8 were manually provided (don't know the background on this
decision).

The RCVD_IN_DNSWL_MED in my GA results was obtained automatically, and the
other two were manually adjusted to make some sense compared to _MED.
Btw, the GA results on scoreset 3 from one of my previous runs were:
  RCVD_IN_DNSWL_LOW -2.761
  RCVD_IN_DNSWL_MED -0.999
  RCVD_IN_DNSWL_HI  -0.966

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-26 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #133 from Justin Mason j...@jmason.org 2009-10-26 13:51:54 UTC ---
strange, some of the more trustworthy BLs are very low scoring.

RCVD_IN_XBL: 0.404 and 0.722

these have been effectively zeroed, although are supposed to be immutable:
RCVD_IN_SSC_TRUSTED_COI is 0  (with a 0.012 S/O, low hit rate though)
HABEAS_ACCREDITED_COI is 0(ditto)
RCVD_IN_BSP_TRUSTED is -0.001  (although with a 0.002 S/O)

the HASHCASH rules likewise aren't supposed to be mutable.

it looks like there might be a bit of a problem there -- definitely some rules
that are in immutable sections, like the above, have been allowed to be mutable
in ranges.data

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-26 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #134 from John Hardin jhar...@impsec.org 2009-10-26 14:31:20 UTC 
---
(In reply to comment #132)

 $ grep RCVD_IN_DNSWL_ freqs.full
 OVERALLSPAM% HAM% S/ORANK   SCORE  NAME
   0.184   0.0005   0.57080.001   0.76   -1.80  RCVD_IN_DNSWL_HI
   7.410   0.1094  22.75270.005   0.67   -1.20  RCVD_IN_DNSWL_MED
   2.551   0.1810   7.53220.023   0.59   -1.10  RCVD_IN_DNSWL_LOW
 
 It is quite possible that some of these hits are still false positives,
 despite several iterations of cleaning:
 
 for j in spam*.log; do echo -n $j; grep RCVD_IN_DNSWL_HI $j | \
   wc -l; done | sort -k2nr
 
 spam-bayes-net-bb-jhardin.log 3
 
 same on _MED:
 
 spam-bayes-net-bb-jhardin.log  23

All but one of those are obvious spams, and I've removed the one questionable
one from my corpora.

Some of the spam in my corpora is from third parties. I do check it for correct
classification before uploading, but I was wondering: how does masscheck
determine the correct lastexternal for corpora containing messages from
multiple different networks? Or does it assume all of the messages in a given
contributor's corpora have the same network boundary? If the latter, I need to
remove those third-party messages from my spam corpora...

Might lastexternal confusion in the masschecks be contributing in some way to
the odd RCVD_IN_* score generation?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-26 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #135 from Adam Katz antis...@khopis.com 2009-10-26 16:27:56 UTC 
---
Created an attachment (id=4561)
 -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4561)
Checker for rules that match more ham than spam

I've updated my checker to an actual perl script (still uses elinks as I don't
feel like learning LWP and then parsing HTML).  I've attached the checker,
which can be run with custom parameters for a different ruleset, ham threshold,
or minimum difference for ham:spam ratio.  Here's the current output, listing
all rules that hit 1+% of the ham corpus or that hit 0.05% more of the ham
corpus than of the spam corpus.

H^2/SHAM%SPAM%Score in attachment 4558   Rule
331.90.3319  0.0010   0  OBSCURED_EMAIL
117.44.8566  0.2009   -0.001 SPF_HELO_PASS
88.525.5735  0.3509   -0.001 SPF_PASS
85.610.2226  0.0026   0.000 2.099 0.001 1.212MISSING_MIME_HB_SEP
76.180.7085  0.0093   0.001 0.001 0.699 0.699TVD_RCVD_SPACE_BRACKET
66.190.2780  0.0042   1.145 1.542 1.912 2.400FUZZY_CPILL
49.981.0676  0.0228   0.001  MSGID_MULTIPLE_AT
31.820.1496  0.0047   1.494 1.699 1.591 1.516X_IP
21.860.1465  0.0067   0  SUBJECT_FUZZY_TION
20.40   15.6218 11.9604   0.001  FREEMAIL_FROM
20.00*  40.9055 83.6301   0.001  HTML_MESSAGE
17.100.1710  01.222 0.001 0.082 0.476MIME_BOUND_DIGITS_15
12.950.0609  0.0047   0  HTML_IFRAME_SRC
12.520.0714  0.0057   0  FORGED_IMS_TAGS
11.560.0659  0.0057   0.001 0.001 0.605 0.378HTML_NONELEMENT_30_40
10.830.1127  0.0104   0.033 0.001 0.365 0.413WEIRD_PORT
10.180.3494  0.0343   2.205 0.174 1.299 1.806FRT_SOMA2
9.7210.8934  0.0919   1.499 0.419 0.904 0.798MIME_BASE64_BLANKS
8.9960.2474  0.0275   0.987 0.750 0.943 1.318CTYPE_001C_B
8.9180.1525  0.0171   0.001 2.499 0.268 0.516DRUGS_MUSCLE
8.3730.0829  0.0099   0.003 0.978 0.100 1.515TVD_FW_GRAPHIC_NAME_LONG
8.0160.1956  0.0244   0.001 0.020 0.001 1.799MIME_BASE64_TEXT
6.8500.0685  00  HTML_NONELEMENT_40_50
5.4040.5356  0.0991   0 1.200 0 2.514SPF_HELO_FAIL
4.2370.1585  0.0374   2.199 2.199 1.246 2.090WEIRD_QUOTING
4.1593.8908  3.6392   0.001  MIME_QP_LONG_LINE
3.4830.8570  0.2460   1.799 0.572 1.182 1.138HTML_IMAGE_RATIO_06
3.2191.2399  0.4775   1.0EXTRA_MPART_TYPE
2.913*  12.1047 50.2891   0 1.1 0 0.7RDNS_NONE
2.8390.1164  0.0410   0.001 2.185 1.936 0.476FRT_SOMA
2.7510.1172  0.0426   0.1ANY_BOUNCE_MESSAGE
2.4170.6787  0.2808   0.539 0.001 0.332 0.488MIME_HTML_MOSTLY
2.3700.1010  0.0426   0.1BOUNCE_MESSAGE
2.0780.5534  0.2663   1.899 0.496 0.950 0.445HTML_IMAGE_RATIO_08
1.8991.2077  0.7677   0.001  TVD_SPACE_RATIO
1.7260.3227  0.1869   0.023 0.887 0.000 0.417UPPERCASE_50_75
1.5170.9658  0.6364   2.801 2.080 1.780 3.387DATE_IN_PAST_96_XX
1.2690.4224  0.3327   0.000 0.001 0.264 0.001HTML_FONT_SIZE_LARGE
1.1510.5492  0.4770   2.260 0.742 1.199 0.640MPART_ALT_DIFF
0.913*   1.8488  3.7425   1.154 1.677 1.198 1.453SUBJ_ALL_CAPS
0.703*   1.3317  2.5216   0.001  UNPARSEABLE_RELAY
0.278*   3.7480 50.4848   2.199 0.955 1.215 0.549MIME_HTML_ONLY
0.121*   1.2540 12.9472   0 1.322 0 1.237RCVD_IN_BL_SPAMCOP_NET

(Anything asterisked is included because it matched 1% of the ham corpus but
matched a larger percent of the spam corpus while everything else matched a
larger percent of the ham corpus than the spam corpus.)

Mark's fixes solved the immediate issues raised earlier, so I decided to order
this by the ratio of percentage of ham corpus hit to percentage of spam corpus
hit, but that under-emphasized the ham hits, so I then multiplied that by the
ham percentage again (unless the percent was under 1).  It's easy enough to
browse for non-zero ham% hits.

Any rule with a ratio over 1.000 is a problem when scored positively unless it
is exempted for applying to popular spam patterns that the corpus is known to
lack.  For completeness, this list includes all tests that hit at least 1% of
the ham corpus (thus the presence of HTML_MESSAGE, RDNS_NONE, and the four
tests with ratios under 1.0).

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


Re: [Bug 6155] generate new scores for 3.3.0 release

2009-10-22 Thread Henrik Krohns
On Wed, Oct 21, 2009 at 06:34:47PM -0700, Michael Peddemors wrote:
 On October 20, 2009, bugzilla-dae...@bugzilla.spamassassin.org wrote:
  Getting back to this issue:  I don't see any problem with prejudice against
  poorly constructed network infrastructures that can't bother to adhere to
   the SMTP standard (RFC1912 section 2.1).  This is something that any
   network admin who should legitimately be managing a mail server should be
   able to fix with a single phone call (please correct me if this sentence
   is prejudiced in any way).
  
  The SMTP standard requires a server's rDNS must match the server's reported
  name (thus the IP must have rDNS), and most allocated IPs have them anyway
  (even if they're wrong or ~dynamic, e.g. RDNS_DYNAMIC).  There is also a
  growing number of deployments that block improper FCrDNS at the door
   (RDNS_NONE is a subset of failing FCrDNS).
  
 
 MagicMail Servers have been blocking all email at the connection level that 
 do 
 not have rDNS now for the past couple of years, except when SMTP AUTH is 
 presented, and we haven't had an F/P reported in over a year.

Maybe I'm beating a dead horse but..

http://ruleqa.spamassassin.org/20091021-r826376-n/RDNS_NONE/detail

Hopefully you didn't mean that MagicMail somehow is an authority on the
stats or a good example to follow. Even if this isn't users list, you should
never imply that RDNS_NONE is safe to block at general 2% ham rate.

Of course it's up to the site policy, but be prepared to..

- Listen to user complains
- Create a large whitelist
- Deal with imbeciles and hope they fix the DNS _some_ day

;-)



Re: [Bug 6155] generate new scores for 3.3.0 release

2009-10-22 Thread Karsten Bräckelmann
On Wed, 2009-10-21 at 23:35 -0400, Warren Togami wrote:
 On 10/21/2009 10:46 PM, Karsten Bräckelmann wrote:
 
  s/ Warren /SA devs, contributors and mass-check contributors/x
 
  # There is something seriously disturbing with the above comment.
  # Fix using a trivial substitution.

What's disturbing about it is, that despite the recent discussion,
Michael still seems to perceive the entire process of distributed
mass-checks to be writing a rule, and reduces it to that.


  This is not about Warren. He just happens to dump random BLs for a short
  time in his granted sandbox. It is everyone else, who does the heavy
  weight lifting.
 
 While I agree it is unfortunate that he used my name there, don't you 
 think what you wrote here a bit unnecessarily insulting?  This suggests 
 that dumping random BL's into the sandbox is all I do.

Granted, I could have phrased that better. Though just as in my previous
post, this is not about you ;)  but the unfortunate depiction in the
original post. I do not question your contributions and effort.

  guenther


-- 
char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: [Bug 6155] generate new scores for 3.3.0 release

2009-10-22 Thread Michael Peddemors
On October 21, 2009, Karsten Bräckelmann wrote:
 SA is ALL about scores, and NOT absolute.
 
 If you want absolute, reject BEFORE even passing the mail to SA. Easy.
 Lots of cycles spared. But since you're a regular on the user's list, I
 assume you've read that before...
 

hehe.. no, not on the users list.. And I think you missed the point, SA is 
about scores, so the really 'prejudiced' rules might belong in a place other 
than SA.. That is one line that is blurry in SA discussions.. at what point is 
a rule prejudiced enough to consider that it is almost an absolute.  Some 
rules score extremely, high.. 

No rDNS goes past the idea of scoring.. so does it belong in a scoring system?

Just a topic for discussion.. 


-- 
--
Catch the Magic of Linux...

Michael Peddemors - President/CEO - LinuxMagic
Products, Services, Support and Development
Visit us at http://www.linuxmagic.com

A Wizard IT Company - For More Info http://www.wizard.ca
LinuxMagic is a Registered TradeMark of Wizard Tower TechnoServices Ltd.

604-589-0037 Beautiful British Columbia, Canada

This email and any electronic data contained are confidential and intended 
solely for the use of the individual or entity to which they are addressed. 
Please note that any views or opinions presented in this email are solely 
those of the author and are not intended to  represent those of the company.


Re: [Bug 6155] generate new scores for 3.3.0 release

2009-10-22 Thread Matt Sergeant
On Thu, 22 Oct 2009 09:34:13 -0700, Michael Peddemors wrote:
 I am curious to the large HAM rate..  Again, I think the testing of 
 this rule 
 against a corpus might be affecting this.. 

I tend to agree. AOL announced wholesale blocking of anyone with 
NXDOMAIN rDNS a few years back now, and that caused big changes in 
people thinking it was OK to mail from an IP with NXDOMAIN rDNS.

Matt.

__
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
__


Re: [Bug 6155] generate new scores for 3.3.0 release

2009-10-22 Thread Adam Katz
Henrik Krohns wrote:
 I only have to look at my mail logs from today, and I see dozen of legimate
 RDNS_NONE hits originating from real people. I'm happy to greylist it at
 MTA, but not block directly.
 
 As said, it's a site policy. Some people use high FP BLs also happily. Many
 people might not report FPs for one reason or another, but it doesn't mean
 they don't exist.. I like to be on the safe side.

The question is what defines safe and why is the score pinned to
0.1?  Isn't the whole point of the genetic algorithm to determine what
safe value to assign it?  Who's to say that 0.2 isn't safe?  (I
suppose there's no way to *cap* a GA score rather than just pin it?)

SA is a system of probabilities.  We don't define ham as having 0 or
fewer points.  Again, I cite the masscheck results.  Is 1.7% of the
ham corpus bad?  What about MIME_HTML_ONLY's 3.7% ham, or
RCVD_IN_SPAMCOP_BL's 1.3% ham or SUBJ_ALL_CAPS's 1.8%, ...?  All of
those have GA-generated scores over 0.1.

What about the fact that this only scores 0.8528% corpus overlap for
ham scoring 4+? (like RDNS_NONE, MIME_HTML_ONLY's 3.7% ham overlap is
mostly low-scoring ham, with only 1.5625% matching corpus ham at 4+).

Even the latest scoring proposal here has this line:

  score HTML_MESSAGE 2.199 0.838 1.473 0.511

despite HTML_MESSAGE hitting 40.9% of the ham corpus.

Here are some that hit a larger portion of the ham corpus than of the
spam corpus despite having positive scores in bugzilla attachment 4553
(the latest scoring proposal) at
https://issues.apache.org/SpamAssassin/attachment.cgi?id=4553

MIME_QP_LONG_LINE
FREEMAIL_FROM
TVD_SPACE_RATIO
EXTRA_MPART_TYPE

(among others)

These were found by applying this search to the front page at
http://ruleqa.spamassassin.org (using a firefox regexp search add-on)

/(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w/

In shell (guess who's bourne scripting is better than his perl?),

elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 'print if
/(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/' |tee
rules.txt

for rule in `perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }'
rules.txt`; do grep ^[^#]* $rule  /tmp/50_scores_newest.cf; done


Re: [Bug 6155] generate new scores for 3.3.0 release

2009-10-22 Thread Justin Mason
On Thu, Oct 22, 2009 at 20:35, Adam Katz antis...@khopis.com wrote:
 Henrik Krohns wrote:
 I only have to look at my mail logs from today, and I see dozen of legimate
 RDNS_NONE hits originating from real people. I'm happy to greylist it at
 MTA, but not block directly.

 As said, it's a site policy. Some people use high FP BLs also happily. Many
 people might not report FPs for one reason or another, but it doesn't mean
 they don't exist.. I like to be on the safe side.

 The question is what defines safe and why is the score pinned to
 0.1?  Isn't the whole point of the genetic algorithm to determine what
 safe value to assign it?  Who's to say that 0.2 isn't safe?  (I
 suppose there's no way to *cap* a GA score rather than just pin it?)

One thing we need to take into account is that some rules are harder
for senders to fix than others.  Whether or not their ISP gives them
rDNS is quite tricky to fix.  The GA can't take that into account, but
we can, by setting a score manually and locking it as non-mutable.

--j.

 SA is a system of probabilities.  We don't define ham as having 0 or
 fewer points.  Again, I cite the masscheck results.  Is 1.7% of the
 ham corpus bad?  What about MIME_HTML_ONLY's 3.7% ham, or
 RCVD_IN_SPAMCOP_BL's 1.3% ham or SUBJ_ALL_CAPS's 1.8%, ...?  All of
 those have GA-generated scores over 0.1.

 What about the fact that this only scores 0.8528% corpus overlap for
 ham scoring 4+? (like RDNS_NONE, MIME_HTML_ONLY's 3.7% ham overlap is
 mostly low-scoring ham, with only 1.5625% matching corpus ham at 4+).

 Even the latest scoring proposal here has this line:

  score HTML_MESSAGE 2.199 0.838 1.473 0.511

 despite HTML_MESSAGE hitting 40.9% of the ham corpus.

agh!  that's a bug.

 Here are some that hit a larger portion of the ham corpus than of the
 spam corpus despite having positive scores in bugzilla attachment 4553
 (the latest scoring proposal) at
 https://issues.apache.org/SpamAssassin/attachment.cgi?id=4553

 MIME_QP_LONG_LINE
 FREEMAIL_FROM
 TVD_SPACE_RATIO
 EXTRA_MPART_TYPE

 (among others)

 These were found by applying this search to the front page at
 http://ruleqa.spamassassin.org (using a firefox regexp search add-on)

 /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w/

 In shell (guess who's bourne scripting is better than his perl?),

 elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 'print if
 /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/' |tee
 rules.txt

 for rule in `perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }'
 rules.txt`; do grep ^[^#]* $rule  /tmp/50_scores_newest.cf; done


Could you add a comment  to the rescoring bug (bug 6155) noting those
over-high scores?  HTML_MESSAGE at least should NOT be mutable like
that :(

-- 
--j.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-22 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #122 from Adam Katz antis...@khopis.com 2009-10-22 13:32:40 UTC 
---
Some bugs in the auto-generated rules from attachment 4553

HTML_MESSAGE scores WAY too high.  There are others too.

Full list as of right now:


   MSECSSPAM% HAM% S/ORANK   SCORE  NAME
   0   0.1848   4.8675   0.0370.780.00  SPF_HELO_PASS
   0   0.3294   5.5859   0.0560.740.00  SPF_PASS
   0  12.2476   1.2568   0.9070.580.00  RCVD_IN_BL_SPAMCOP_NET
   0  50.4453   3.7391   0.9310.572.30  MIME_HTML_ONLY
   0  49.9300  12.1231   0.8050.520.10  RDNS_NONE
   0   3.8466   1.8427   0.6760.512.30  SUBJ_ALL_CAPS
   0   2.3989   1.3218   0.6450.500.00  UNPARSEABLE_RELAY
   0  83.7769  40.8865   0.6720.490.00  HTML_MESSAGE
   0   3.4477   3.8932   0.4700.472.50  MIME_QP_LONG_LINE
   0  12.2361  15.6252   0.4390.460.00  FREEMAIL_FROM
   0   0.7695   1.2102   0.3890.412.90  TVD_SPACE_RATIO
   0   0.4610   1.2409   0.2710.351.00  EXTRA_MPART_TYPE
   0   0.0271   1.0700   0.0250.151.22  MSGID_MULTIPLE_AT

score SPF_HELO_PASS -0.001
score SPF_PASS -0.001
score RCVD_IN_BL_SPAMCOP_NET 0 1.725 0 1.180 # n=2
score MIME_HTML_ONLY 1.474 0.737 0.829 0.462
score RDNS_NONE 0.1
score SUBJ_ALL_CAPS 0.264 1.568 0.593 1.045
score UNPARSEABLE_RELAY 0.001
score HTML_MESSAGE 2.199 0.838 1.473 0.511
score MIME_QP_LONG_LINE 0.074 0.242 0.116 0.002
score FREEMAIL_FROM 0.817 1.020 0.401 0.856
score TVD_SPACE_RATIO 0.001 0.201 0.398 0.001
score MSGID_MULTIPLE_AT 0.001 0.001 0.598 0.000


To fetch them for yourself (so as to get something more up-to-date or from a
different URL, etc), here's the code I ran (sorry, I know posix shell better
than perl, so I dip into both):

elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 
  'print if /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/'
  |tee rules.txt

for rule in $(perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }'
   rules.txt); do grep ^[^#]* $rule  /tmp/50_scores_newest.cf; done


That could probably be written better, e.g. looking for ham%  spam% in
addition to ham%  0.%, but this is a good first-pass.

Obviously, /removing/ fixed scores for things like RDNS_NONE can't possibly be
considered until the GA is a little more apt at figuring this sort of thing
out.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-22 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #123 from Adam Katz antis...@khopis.com 2009-10-22 13:47:40 UTC 
---
(In reply to comment #122)
sorry, that should be:

elinks -dump http://ruleqa.spamassassin.org/ |perl -ne 
  'print if /(\s+[\d.]+){2}\s+[1-9][\d.]+(\s+[\d.]+){3}\s+(?!T_)\w|\sMSECS/'
  |tee rules.txt

for rule in $(perl -ne 'if (/.*\s([A-Z]+\w*_\w*)/) { s//$1/; print; }'
   rules.txt); do grep ^[^#]* $rule  /tmp/50_scores_newest.cf ||
  echo score $rule UNKNOWN; done

With each of those two stanzas living on just one line.

Obviously, ignore the genuine ham rules.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


Re: [Bug 6155] generate new scores for 3.3.0 release

2009-10-21 Thread Michael Peddemors
On October 20, 2009, bugzilla-dae...@bugzilla.spamassassin.org wrote:
 Getting back to this issue:  I don't see any problem with prejudice against
 poorly constructed network infrastructures that can't bother to adhere to
  the SMTP standard (RFC1912 section 2.1).  This is something that any
  network admin who should legitimately be managing a mail server should be
  able to fix with a single phone call (please correct me if this sentence
  is prejudiced in any way).
 
 The SMTP standard requires a server's rDNS must match the server's reported
 name (thus the IP must have rDNS), and most allocated IPs have them anyway
 (even if they're wrong or ~dynamic, e.g. RDNS_DYNAMIC).  There is also a
 growing number of deployments that block improper FCrDNS at the door
  (RDNS_NONE is a subset of failing FCrDNS).
 

MagicMail Servers have been blocking all email at the connection level that do 
not have rDNS now for the past couple of years, except when SMTP AUTH is 
presented, and we haven't had an F/P reported in over a year.

However, this SHOULD be the MTA responsibility, and not the filtering system.  
Of course there are some MTA's still out there where this may help, but it is 
better to reject those during SMTP phase, so that the clueless admin can get 
reverse DNS up as soon as possible.. HOWEVER.. 

Please note, you have to watch this.. we have seen too many times where 
temporary DNS failures resulted in email blockages, and you dont' want to be 
dropping those messages on the floor when that happens.. 

Better to reject them, or at least send back temporary deferrals... 

Another point, is that the SMTP 'standard' is not yet a standard.. In the real 
world, just be happy they have any sort of reverse DNS.. We are trying to 
adopt a standard where at least the reverse DNS resolves to a domain owned by 
the email operator, (and not his upstream providers generic addressing scheme) 
and we still get some push back on that.. to get the average MS Exchange 
operator to set up the servers' reported name.. how many times do we see HELO 
localhost.localnet still :) And there are many operators who have reasons NOT 
to do this.. (Email Clusters, Server with Internal Naming Conventions et al)

It would be nice to see SA having to do less of the 'Best Practices' stuff.. 
leave that to MTA's.. 

Just thought I would put my two bits in SA 'could' go farther with 
'prejudiced' rules, but if they are sufficiently prejudiced, should they not 
be absolutes, instead of scores? 

PS, since I am posting.. 

Warren, have you done any 'testing' with the SPAM-RATS RBL's against the 
corpus? would be interested in numbers.. even with the variables of aged 
dating, and not checking SMTP Authed messages..

-- 
--
Catch the Magic of Linux...

Michael Peddemors - President/CEO - LinuxMagic
Products, Services, Support and Development
Visit us at http://www.linuxmagic.com

A Wizard IT Company - For More Info http://www.wizard.ca
LinuxMagic is a Registered TradeMark of Wizard Tower TechnoServices Ltd.

604-589-0037 Beautiful British Columbia, Canada

This email and any electronic data contained are confidential and intended 
solely for the use of the individual or entity to which they are addressed. 
Please note that any views or opinions presented in this email are solely 
those of the author and are not intended to  represent those of the company.


Re: [Bug 6155] generate new scores for 3.3.0 release

2009-10-21 Thread Karsten Bräckelmann
On Wed, 2009-10-21 at 18:34 -0700, Michael Peddemors wrote:
 MagicMail Servers have been blocking all email at the connection level that 
 do 
 not have rDNS now for the past couple of years, except when SMTP AUTH is 
 presented, and we haven't had an F/P reported in over a year.

Funnily enough, there are ISPs out there advertising targeted towards
small businesses, handing over static IPs with NO rDNS whatsoever.
Dialup customers do have (generic) rDNS.

Not made up. Political decision.


 Just thought I would put my two bits in SA 'could' go farther with 
 'prejudiced' rules, but if they are sufficiently prejudiced, should they not 
 be absolutes, instead of scores? 

SA is ALL about scores, and NOT absolute.

If you want absolute, reject BEFORE even passing the mail to SA. Easy.
Lots of cycles spared. But since you're a regular on the user's list, I
assume you've read that before...


-- 
char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: [Bug 6155] generate new scores for 3.3.0 release

2009-10-21 Thread Karsten Bräckelmann
On Wed, 2009-10-21 at 22:03 -0400, Warren Togami wrote:
 On 10/21/2009 09:34 PM, Michael Peddemors wrote:
 
  Warren, have you done any 'testing' with the SPAM-RATS RBL's against the
  corpus? would be interested in numbers.. even with the variables of aged
  dating, and not checking SMTP Authed messages..

s/ Warren /SA devs, contributors and mass-check contributors/x

# There is something seriously disturbing with the above comment.
# Fix using a trivial substitution.

This is not about Warren. He just happens to dump random BLs for a short
time in his granted sandbox. It is everyone else, who does the heavy
weight lifting.

 I have never seen this RBL before.

You might want to catch up on years of user's list archives, first.
There are opinions, and folks who tested it. Nothing new, really.

 I assume this is your service, and you give us permission to swamp it 
 with hundreds of thousands of rapid queries every Saturday?  If so I'll 
 give sufficient warning to the list here and add it before Saturday 
 masscheck.

Warning, or a brief discussion, if it might actually be worthwhile. Or
not.


-- 
char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: [Bug 6155] generate new scores for 3.3.0 release

2009-10-21 Thread Warren Togami

On 10/21/2009 10:46 PM, Karsten Bräckelmann wrote:


s/ Warren /SA devs, contributors and mass-check contributors/x

# There is something seriously disturbing with the above comment.
# Fix using a trivial substitution.

This is not about Warren. He just happens to dump random BLs for a short
time in his granted sandbox. It is everyone else, who does the heavy
weight lifting.


While I agree it is unfortunate that he used my name there, don't you 
think what you wrote here a bit unnecessarily insulting?  This suggests 
that dumping random BL's into the sandbox is all I do.



Warning, or a brief discussion, if it might actually be worthwhile. Or
not.


Sure.

Warren


[Bug 6155] generate new scores for 3.3.0 release

2009-10-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #110 from Justin Mason j...@jmason.org 2009-10-20 03:46:49 UTC ---
(In reply to comment #109)
 (In reply to comment #108)
  The important questions are, where is KB_RATWARE_BOUNDARY, which was
  specifically pushed right before the deadline to supersede these?
 
 Argh!  It is in freqs.full, attachment 4541 [details]. However, it appears 
 we've been
 using inconsistent rule-sets, with most contributors using one outdated
 rule-set or the other. :-(
 
  10.830  14.1437   0.19010.987   0.670.00  T_KB_RATWARE_BOUNDARY
   0.025   0.0327   0.1.000   0.651.00  KB_RATWARE_BOUNDARY

mysterious:

: exit=[130] uid=jm Tue Oct 20 10:40:30 GMT 2009; cd
/export/home/corpus-rsync/corpus/submit
: 6...; grep KB_RATWARE_BOUNDARY *.log | grep -v T_KB_RATWARE_BOUNDARY
: exit=[0 1] uid=jm Tue Oct 20 10:43:41 GMT 2009; cd
/export/home/corpus-rsync/corpus/submit

I can't find any non-T_ hits in the submit logs.  Mark?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #111 from Justin Mason j...@jmason.org 2009-10-20 03:48:45 UTC ---
(In reply to comment #110)
 (In reply to comment #109)
  (In reply to comment #108)
   The important questions are, where is KB_RATWARE_BOUNDARY, which was
   specifically pushed right before the deadline to supersede these?

anyway it doesn't look like that rules is good enough to supersede them:

 10.830  14.1437   0.19010.987   0.670.00  T_KB_RATWARE_BOUNDARY

vs

  9.846  12.9126   0.00031.000   0.981.00  KB_RATWARE_OUTLOOK_08
  9.836  12.8985   0.00031.000   0.981.00  KB_RATWARE_OUTLOOK_MID
  9.835  12.8976   0.00031.000   0.981.00  KB_RATWARE_OUTLOOK_16
  9.835  12.8976   0.00031.000   0.981.00  KB_RATWARE_OUTLOOK_12

that's a much higher FP rate!

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #112 from Karsten Bräckelmann guent...@rudersport.de 2009-10-20 
04:15:03 UTC ---
 anyway it doesn't look like that rules is good enough to supersede them:
 that's a much higher FP rate!

Yes. It's all Warren's fault! ;)  Seriously, the new BOUNDARY one does indeed
have quite some FPs, all in Warren's corpus, and he kindly provided me with the
samples. Appears these are all entirely legit, though auto-generated messages.
I wish MS wouldn't re-use their code like that.
  X-Mailer: Microsoft CDO for Windows 2000

Anyway, I agree -- RATWARE_BOUNDARY is bad, I screwed up with too low a range
between headers. One of the previous rules needs to be kept. (The massive
overlap along with the introduced FNs made it drop off of the active rules.)

Still wondering why there are different rule names in freqs.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

2009-10-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #113 from Karsten Bräckelmann guent...@rudersport.de 2009-10-20 
04:43:31 UTC ---
   9.836  12.8985   0.00031.000   0.981.00  KB_RATWARE_OUTLOOK_MID

Proposing the MID variant for inclusion, and dropping the other variants.

The BOUNDARY one is bad, and the variants do have an almost 100% overlap with
the MID one. It's also the most strict one. (Funny side-effect of the
additional constraint is actually catching a spam or two more... Go figure.)

The ham hit probably is not really ham (no FP in nightlies).

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

2009-10-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #114 from Justin Mason j...@jmason.org 2009-10-20 08:26:26 UTC ---
(In reply to comment #113)
9.836  12.8985   0.00031.000   0.981.00  KB_RATWARE_OUTLOOK_MID
 
 Proposing the MID variant for inclusion, and dropping the other variants.

can you list exactly which rules you want zeroed, before Mark reruns the GA
accordingly?  minimize the work he has to do ;)

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #115 from Karsten Bräckelmann guent...@rudersport.de 2009-10-20 
08:46:55 UTC ---
Err, sure. :)  The following variations should just be dropped.

score KB_RATWARE_OUTLOOK_08  0
score KB_RATWARE_OUTLOOK_12  0
score KB_RATWARE_OUTLOOK_16  0
score KB_RATWARE_BOUNDARY0

Keep KB_RATWARE_OUTLOOK_MID (instead of the above) and KB_RATWARE_MSGID (which
is an unrelated rule anyway).

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

2009-10-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Adam Katz antis...@khopis.com changed:

   What|Removed |Added

 CC||antis...@khopis.com

--- Comment #116 from Adam Katz antis...@khopis.com 2009-10-20 13:08:15 UTC 
---
Standing up for RDNS_NONE ...

http://ruleqa.spamassassin.org/week/RDNS_NONE/detail
bb_trec_enron has 98.9497% of its ham match RDNS_NONE, which is to say that
it's bogus.  Discounting that corpus, RDNS_NONE matches 58.7244% of the total
spam corpus and 1.7463% of the total ham corpus (down from 12.1273%), which
makes it far more interesting.  Many of the people on the sa-users list have
manually scored RDNS_NONE higher than the default 0.1.  I score it at 0.9 on my
own production servers.

(Not sure if this is the right venue -- or if I'm an approved kibitzer)

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #117 from Karsten Bräckelmann guent...@rudersport.de 2009-10-20 
13:17:26 UTC ---
 bb_trec_enron has 98.9497% of its ham match RDNS_NONE, which is to say that
 it's bogus.

Indeed. From the dev list earlier today, that's a corpus with generated
(synthetic) headers [...], only useful for body hits, and is not included in
the re-scoring.

 Many of the people on the sa-users list have
 manually scored RDNS_NONE higher than the default 0.1.

FWIW, nailed to 0.1 as per comment 56.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

2009-10-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #119 from Warren Togami wtog...@redhat.com 2009-10-20 13:47:28 
UTC ---
(In reply to comment #118)
 ... despite the current corpus data (unless 1.7% is a high ham hit-rate)?

http://ruleqa.spamassassin.org/20091017-r826198-n/RDNS_NONE/detail
The most recent weekly run has pretty substantial hits even outside of the
synthetic corpus.

Adam, this like your RCVD_IN_APNIC are examples of inherently prejudiced rules.
 It might work for the most part, and you might accept the risk of accidental
FP's because the score alone wont push it above the threshold.  However the
combined risks of multiple prejudiced rules is too great.  Prejudiced rules
should be up to the sysadmin if they want to enable.  We should not highly
score any known prejudiced rules in the default ruleset.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #118 from Adam Katz antis...@khopis.com 2009-10-20 13:38:04 UTC 
---
(In reply to comment #117)
  bb_trec_enron has 98.9497% of its ham match RDNS_NONE, which is to say
  that it's bogus.
 
 Indeed. From the dev list earlier today, that's a corpus with generated
 (synthetic) headers [...], only useful for body hits, and is not included
 in the re-scoring.

Ah, I thought I saw that corpus mentions somewhere ... only thought to search
the bug.  I had assumed that if the rulesqa page mentioned it, it was factored
in everywhere.

  Many of the people on the sa-users list have
  manually scored RDNS_NONE higher than the default 0.1.
 
 FWIW, nailed to 0.1 as per comment 56.

I saw that but did not understand it ...  It says most of these are already
documented and labeled as [fixed/immutable] but it doesn't say where.  Is this
because it triggers when rDNS checks aren't performed by the first trusted
relay, and if so, can we work around that somehow (wasn't that bug 5586 )?

Or is this a remnant of Justin's checkin r497852 from 2007 which states:
 move 20_dynrdns.cf from sandbox into main ruleset, so RDNS_DYNAMIC
 and RDNS_NONE are core rules; lock their scores to an informational
 0.1, however, since they still have a high ham hit-rate alone 

... despite the current corpus data (unless 1.7% is a high ham hit-rate)?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #120 from Adam Katz antis...@khopis.com 2009-10-20 16:25:36 UTC 
---
(In reply to comment #119)
 (In reply to comment #118)
 ... despite the current corpus data (unless 1.7% is a high ham hit-rate)?
 
 http://ruleqa.spamassassin.org/20091017-r826198-n/RDNS_NONE/detail
 The most recent weekly run has pretty substantial hits even outside of
 the synthetic corpus.

Your link is just a longer version of mine.  It still results in a 1.7% total
ham hit-rate.  Is that too substantial?  Is there detail on what each corpus is
(specifically nbebout, since that's the only other corpus that hit 4+% of
spam)?

Looking only at ham scoring 4 or higher (including enron since I can't remove
it), RDNS_NONE hit 0.8528% of the total ham corpus.  Of the ham scoring JUST 4
(4.0-4.9), we're looking at 0.5865% that would become FPs assuming a score
of 1.1 (increasing the 0.1 by 1), and I'm not even proposing my own
implementation's 0.9.

 Adam, this [... and] your RCVD_IN_APNIC are examples of inherently
 prejudiced rules. It might work for the most part, and you might accept
 the risk of accidental FP's because the score alone wont push it above
 the threshold. However the combined risks of multiple prejudiced rules
 is too great. Prejudiced rules should be up to the sysadmin if they want
 to enable.  We should not highly score any known prejudiced rules in the
 default ruleset.

I agree that RCVD_VIA_APNIC is a prejudiced rule, and my channels have had it
rated 0.001 ever since you called me out on it (RCVD_VIA_APNIC accidentally
came in when I migrated from an internal-only propagation to a published
channel).  KHOP_NO_FIRST_NAME, my other poorly-considered published test,
pre-dates my more thorough testing mechanism (which has limited new rules'
entry quite considerably).  My rules will get even more cleaned up once I get
an svn account to test them here.  (Some of them, like the biased RCVD_IN_APNIC
and quasi-biased/unfair KHOP_SC_CIDR8, would either never get pushed up for
testing or would get the nopublish flag, depending on the guidelines ... that
nobody has yet pointed me to.)  (Side note: I see __RCVD_VIA_APNIC is already
in your own sandbox, hitting 86% of all Japanese ham.)

Getting back to this issue:  I don't see any problem with prejudice against
poorly constructed network infrastructures that can't bother to adhere to the
SMTP standard (RFC1912 section 2.1).  This is something that any network admin
who should legitimately be managing a mail server should be able to fix with a
single phone call (please correct me if this sentence is prejudiced in any
way).

The SMTP standard requires a server's rDNS must match the server's reported
name (thus the IP must have rDNS), and most allocated IPs have them anyway
(even if they're wrong or ~dynamic, e.g. RDNS_DYNAMIC).  There is also a
growing number of deployments that block improper FCrDNS at the door (RDNS_NONE
is a subset of failing FCrDNS).

SA already has built-in prejudices against poorly constructed email clients
(e.g. MISSING_HEADERS) and relays (e.g. DATE_IN_FUTURE_48_96), so why not the
network?  Isn't SPF_FAIL a prejudiced test against network configuration?

SA at its core is merely a system of probabilities.  Even without bayes, the
masscheck mechanism and its points are awarded based on statistical
significance.  Very few rules are actually free of FPs (or FNs for negative
rules).  My question still stands:  what does SA deem statistically significant
when it comes to FPs?  Why does RDNS_NONE need to be immutable rather than
dictated by the masscheck results?  What would the automated system score
RDNS_NONE if it were allowed to?  I'm guessing something between 0.2 and 0.7.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-20 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #121 from Warren Togami wtog...@redhat.com 2009-10-20 19:00:36 
UTC ---
(In reply to comment #120)
 I agree that RCVD_VIA_APNIC is a prejudiced rule, and my channels have had it
 rated 0.001 ever since you called me out on it (RCVD_VIA_APNIC accidentally

OK glad to hear that you reduced it.  I didn't look at your scores after that
first time.  You should really get a spamassassin account so your rules can be
more thoroughly tested against a more varied corpa.

 nobody has yet pointed me to.)  (Side note: I see __RCVD_VIA_APNIC is already
 in your own sandbox, hitting 86% of all Japanese ham.)

Yes, I'm using it as a softener to exclude from the extremely prejudiced
CN_NUMBER rules.  It just so happens that the majority of CN_NUMBER spam
comes from !APNIC, and APNIC is prejudiced in exactly the way to make
CN_NUMBER rules less dangerous.  Even though those rules have high spam hit
rates and zero FP's across our nightly masscheck corpa, it is still too
prejudiced to be safe as a default rule.

 SA at its core is merely a system of probabilities.  Even without bayes, the
 masscheck mechanism and its points are awarded based on statistical
 significance.  Very few rules are actually free of FPs (or FNs for negative
 rules).  My question still stands:  what does SA deem statistically 
 significant
 when it comes to FPs?  Why does RDNS_NONE need to be immutable rather than
 dictated by the masscheck results?  What would the automated system score
 RDNS_NONE if it were allowed to?  I'm guessing something between 0.2 and 0.7.

That is an interesting question.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-19 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #101 from Justin Mason j...@jmason.org 2009-10-19 07:53:59 UTC ---
(In reply to comment #98)
 The RCVD_IN_DNSWL_* scores are again unusual:
   score RCVD_IN_DNSWL_HI  0 -0.466 0 -0.001
   score RCVD_IN_DNSWL_LOW 0 -0.292 0 -0.760
   score RCVD_IN_DNSWL_MED 0 -1.703 0 -0.727
 
 probably because of their low frequency, especially the _HI rule:
 OVERALLSPAM% HAM% S/ORANK   SCORE  NAME
   0.184   0.0007   0.57070.001   0.76   -1.00  RCVD_IN_DNSWL_HI
   7.408   0.1096  22.75090.005   0.67   -1.00  RCVD_IN_DNSWL_MED
   2.553   0.1816   7.53650.024   0.59   -1.00  RCVD_IN_DNSWL_LOW
 
 and resulting zero ranges (tmp/ranges.data):
   0.000 0.000 0 RCVD_IN_DNSWL_HI
   0.000 0.000 0 RCVD_IN_DNSWL_MED
   0.000 0.000 0 RCVD_IN_DNSWL_LOW
 
 Don't know what a clean solution is, apart from fixing their scores
 manually.

feel free to fix them; it's hard for the GA to be mostly right about network
rules.  tbh I'm surprised the ranges were zeroed (for _MED at least).

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-19 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #102 from Justin Mason j...@jmason.org 2009-10-19 07:55:57 UTC ---
(In reply to comment #99)
 I just discovered that I was falsely triggering rules like RCVD_IN_SORBS_DUL, 
 RCVD_IN_PBL or RDNS_DYNAMIC on some of my corpus ham due to a misconfiguration
 on my server.  My users delivering mail directly to other users on my server
 from their home ISP or mobile phone were lacking authenticated user within
 the Received header causing many hits on these and unknown other rules. 
 Roughly ~150-170 of my FP's on these three rules should not count against 
 those
 rules.  Nearly all of my RCVD_IN_SORBS_DUL and RCVD_IN_PBL should have been
 AllTrusted instead.  Is this enough to throw off the GA scoring?

if you want, feel free to sed the log files to fix this, or just remove the
lines entirely, and reupload.  170 FPs for those DUL rules is quite strong imo.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-19 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #103 from Warren Togami wtog...@redhat.com 2009-10-19 10:31:26 
UTC ---
 if you want, feel free to sed the log files to fix this, or just remove the
 lines entirely, and reupload.  170 FPs for those DUL rules is quite strong 
 imo.

Removed the majority of the offending lines and reuploaded ham-rescore-wt*.log.

I also zeroed out *wt-en6.log because they were found to be too corrupted to
trust the results.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-19 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #104 from Mark Martinec mark.marti...@ijs.si 2009-10-19 11:28:49 
UTC ---
(In reply to comment #103)
 Removed the majority of the offending lines and reuploaded 
 ham-rescore-wt*.log. 
 I also zeroed out *wt-en6.log because they were found to be too corrupted to
 trust the results.

Thanks. Seems you did it in the 'corpus' rsync directory. Please also update
them in the 'submit' directory using existing names, otherwise in few weeks
time we'll all forget which file came from where - after all, the 'submit'
directory is the official source for rescoring runs.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-19 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #105 from Karsten Bräckelmann guent...@rudersport.de 2009-10-19 
12:21:56 UTC ---
Argh, late to the show, sorry. :-/  From the second GA re-score run, attachment
4553 (aligned for readability):

score KB_RATWARE_MSGID   4.099 3.315 4.095 1.475

This is awesome! :)  Though unrelated, so let me move on to the issue.


score KB_RATWARE_OUTLOOK_08  1.100 3.232 0.776 0.025
score KB_RATWARE_OUTLOOK_12  2.734 2.826 1.654 0.041
score KB_RATWARE_OUTLOOK_16  1.725 3.331 2.532 0.887
score KB_RATWARE_OUTLOOK_MID 2.259 2.485 3.121 0.001

This is also awesome -- kind of. But frankly, it also is a total mess. They are
essentially the same, just slightly differing in strictness or fuzziness. They
are almost *exactly* overlapping -- *all* four of them (see ruleqa).

These rules are really redundant, and there should be only one instead. FWIW,
that *should* be KB_RATWARE_BOUNDARY, which was added specifically for this.
This rule seems to be missing entirely, though. :(

Looking at the scores, I don't think simply adding them would do.

Also, I'm kind of un-satisfied with the score-set 3 scores. The FP rate is 0!
(Almost, I'll challenge the ham hits.) For all five rules above. Net tests or
not...

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

2009-10-19 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #106 from Warren Togami wtog...@redhat.com 2009-10-19 12:35:30 
UTC ---
 Thanks. Seems you did it in the 'corpus' rsync directory. Please also update
 them in the 'submit' directory using existing names, otherwise in few weeks
 time we'll all forget which file came from where - after all, the 'submit'
 directory is the official source for rescoring runs.

Fixed in 'submit'.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-19 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #107 from Justin Mason j...@jmason.org 2009-10-19 14:26:25 UTC ---
(In reply to comment #105)
 score KB_RATWARE_OUTLOOK_08  1.100 3.232 0.776 0.025
 score KB_RATWARE_OUTLOOK_12  2.734 2.826 1.654 0.041
 score KB_RATWARE_OUTLOOK_16  1.725 3.331 2.532 0.887
 score KB_RATWARE_OUTLOOK_MID 2.259 2.485 3.121 0.001
 
 This is also awesome -- kind of. But frankly, it also is a total mess. They 
 are
 essentially the same, just slightly differing in strictness or fuzziness. They
 are almost *exactly* overlapping -- *all* four of them (see ruleqa).
 
 These rules are really redundant, and there should be only one instead. FWIW,
 that *should* be KB_RATWARE_BOUNDARY, which was added specifically for this.
 This rule seems to be missing entirely, though. :(
 
 Looking at the scores, I don't think simply adding them would do.
 
 Also, I'm kind of un-satisfied with the score-set 3 scores. The FP rate is 0!
 (Almost, I'll challenge the ham hits.) For all five rules above. Net tests or
 not...

it looks like they overlap a lot with some other rules.  But yes, if they were
just 1 rule, it probably would have gotten a better single score.

I'm not sure if it's too late to fix this or not. :(

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-19 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #108 from Karsten Bräckelmann guent...@rudersport.de 2009-10-19 
14:49:17 UTC ---
(In reply to comment #107)
 it looks like they overlap a lot with some other rules.  But yes, if they were
 just 1 rule, it probably would have gotten a better single score.
 
 I'm not sure if it's too late to fix this or not. :(

Frankly, pretty much either one could be used and all other variants simply be
dropped for the next re-score run. Keeping all of them is just a waste of
cycles.

The important questions are, where is KB_RATWARE_BOUNDARY, which was
specifically pushed right before the deadline to supersede these?

And of course, why do the scores drop that drastically with score-set 3, if
there is *no* FP? Regardless of the spam already scoring above 5, there is no
FP reason to lower the score.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

2009-10-19 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #109 from Karsten Bräckelmann guent...@rudersport.de 2009-10-19 
15:37:16 UTC ---
(In reply to comment #108)
 The important questions are, where is KB_RATWARE_BOUNDARY, which was
 specifically pushed right before the deadline to supersede these?

Argh!  It is in freqs.full, attachment 4541. However, it appears we've been
using inconsistent rule-sets, with most contributors using one outdated
rule-set or the other. :-(

 10.830  14.1437   0.19010.987   0.670.00  T_KB_RATWARE_BOUNDARY
  0.025   0.0327   0.1.000   0.651.00  KB_RATWARE_BOUNDARY

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

2009-10-15 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #100 from Mark Martinec mark.marti...@ijs.si 2009-10-15 11:56:23 
UTC ---
Btw, I added a Target Milestone 3.3.1, so that a triage on 3.3.0 bugs
may be more selective, choosing between Future/Undefined/3.3.1

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-14 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Mark Martinec mark.marti...@ijs.si changed:

   What|Removed |Added

   Attachment #4550|0   |1
is obsolete||

--- Comment #96 from Mark Martinec mark.marti...@ijs.si 2009-10-14 16:21:44 
UTC ---
Created an attachment (id=4553)
 -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4553)
resulting 50_scores.cf from garescorer runs - V2

Here is now a 50_scores.cf from my second attempt after cleaning some
logs: removed binnocenti and wt-en6 logs as per Comment 93, removed
DKIM_ADSP_DISCARD hits from ham-bayes-net-bluestreak.log. I have also
limited the log entries to fewer months following the RescoreMassCheck
(wiki): -m 6 for spam, and -m 25 for ham (after 25th month there is a
large gap in data till the next peak, too far in the past).

This leaves us with the following number of entries in merged logs:
score set 1 (no data from score set 3), provides data for set0 and set1:
  360070 ham-full-set1.log
  472682 spam-full-set1.log
score set 3, provides data for set2 and set3:
  210603 ham-full-set3.log
  442709 spam-full-set3.log

For DCC_ rules, I took the DCC_CHECK value of 1.1 from a preliminary run
which had all the DCC_REPUT_* scores fixed at 0, then for the next run
I fixed the DCC_CHECK, but left the DCC_REPUT_* scores floating. This
should cope with both types of sites: those with a commercial license
that do receive reputation results from DCC servers, and those who don't.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-14 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #97 from Mark Martinec mark.marti...@ijs.si 2009-10-14 16:29:29 
UTC ---
gen-set0-5-5.0-1-ga
test (10%)
# SUMMARY for threshold 5.0:
# Correctly non-spam:  35461  98.50%
# Correctly spam:  38357  81.35%
# False positives:   541  1.50%
# False negatives:  8794  18.65%
# TCR(l=50): 1.315450  SpamRecall: 81.349%  SpamPrec: 98.609%
scores (90%):
# SUMMARY for threshold 5.0:
# Correctly non-spam: 283119  42.494%  (98.304% of non-spam corpus)
# Correctly spam: 306367  45.984%  (80.997% of spam corpus)
# False positives:  4886  0.733%  (1.696% of nonspam, 179777 weighted)
# False negatives: 71879  10.789%  (19.003% of spam, 231331 weighted)
# Average score for spam:  10.4nonspam: 1.7
# Average for false-pos:   5.6  false-neg: 3.2
# TOTAL:  666251  100.00%

gen-set1-10-5.0-1-ga
test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  35942  99.83%
# Correctly spam:  45983  97.52%
# False positives:60  0.17%
# False negatives:  1168  2.48%
# TCR(l=50): 11.312620  SpamRecall: 97.523%  SpamPrec: 99.870%
scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 287639  43.173%  (99.873% of non-spam corpus)
# Correctly spam: 368783  55.352%  (97.498% of spam corpus)
# False positives:   366  0.055%  (0.127% of nonspam,  27040 weighted)
# False negatives:  9463  1.420%  (2.502% of spam,  29645 weighted)
# Average score for spam:  20.3nonspam: 0.2
# Average for false-pos:   5.6  false-neg: 3.1
# TOTAL:  666251  100.00%

gen-set2-10-5.0-1-ga
test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  35949  99.85%
# Correctly spam:  44538  94.46%
# False positives:53  0.15%
# False negatives:  2613  5.54%
# TCR(l=50): 8.958959  SpamRecall: 94.458%  SpamPrec: 99.881%
scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 287557  43.160%  (99.844% of non-spam corpus)
# Correctly spam: 357656  53.682%  (94.556% of spam corpus)
# False positives:   448  0.067%  (0.156% of nonspam,  33456 weighted)
# False negatives: 20590  3.090%  (5.444% of spam,  73371 weighted)
# Average score for spam:  12.3nonspam: 0.8
# Average for false-pos:   5.7  false-neg: 3.6
# TOTAL:  666251  100.00%

gen-set3-20-5.0-1-ga
test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  21173  99.92%
# Correctly spam:  43749  99.08%
# False positives:17  0.08%
# False negatives:   404  0.92%
# TCR(l=50): 35.209729  SpamRecall: 99.085%  SpamPrec: 99.961%
scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 168159  32.186%  (99.976% of non-spam corpus)
# Correctly spam: 350875  67.159%  (99.046% of spam corpus)
# False positives:40  0.008%  (0.024% of nonspam,   9039 weighted)
# False negatives:  3379  0.647%  (0.954% of spam,  11476 weighted)
# Average score for spam:  19.3nonspam: -0.8
# Average for false-pos:   5.4  false-neg: 3.4
# TOTAL:  522453  100.00%

===
In summary, the essential data:

score set 0 (no net, no bayes):
# False positives:  4886  0.733%  (1.696% of nonspam, 179777 weighted)
# False negatives: 71879  10.789%  (19.003% of spam, 231331 weighted)

score set 1 (net, no bayes):
# False positives:   366  0.055%  (0.127% of nonspam,  27040 weighted)
# False negatives:  9463  1.420%  (2.502% of spam,  29645 weighted)

score set 2 (no net, bayes):
# False positives:   448  0.067%  (0.156% of nonspam,  33456 weighted)
# False negatives: 20590  3.090%  (5.444% of spam,  73371 weighted)

score set 3 (net, bayes):
# False positives:40  0.008%  (0.024% of nonspam,   9039 weighted)
# False negatives:  3379  0.647%  (0.954% of spam,  11476 weighted)

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-14 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #98 from Mark Martinec mark.marti...@ijs.si 2009-10-14 16:48:26 
UTC ---
The RCVD_IN_DNSWL_* scores are again unusual:
  score RCVD_IN_DNSWL_HI  0 -0.466 0 -0.001
  score RCVD_IN_DNSWL_LOW 0 -0.292 0 -0.760
  score RCVD_IN_DNSWL_MED 0 -1.703 0 -0.727

probably because of their low frequency, especially the _HI rule:
OVERALLSPAM% HAM% S/ORANK   SCORE  NAME
  0.184   0.0007   0.57070.001   0.76   -1.00  RCVD_IN_DNSWL_HI
  7.408   0.1096  22.75090.005   0.67   -1.00  RCVD_IN_DNSWL_MED
  2.553   0.1816   7.53650.024   0.59   -1.00  RCVD_IN_DNSWL_LOW

and resulting zero ranges (tmp/ranges.data):
  0.000 0.000 0 RCVD_IN_DNSWL_HI
  0.000 0.000 0 RCVD_IN_DNSWL_MED
  0.000 0.000 0 RCVD_IN_DNSWL_LOW

Don't know what a clean solution is, apart from fixing their scores
manually.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-14 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #99 from Warren Togami wtog...@redhat.com 2009-10-14 21:58:58 UTC 
---
I just discovered that I was falsely triggering rules like RCVD_IN_SORBS_DUL, 
RCVD_IN_PBL or RDNS_DYNAMIC on some of my corpus ham due to a misconfiguration
on my server.  My users delivering mail directly to other users on my server
from their home ISP or mobile phone were lacking authenticated user within
the Received header causing many hits on these and unknown other rules. 
Roughly ~150-170 of my FP's on these three rules should not count against those
rules.  Nearly all of my RCVD_IN_SORBS_DUL and RCVD_IN_PBL should have been
AllTrusted instead.  Is this enough to throw off the GA scoring?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-11 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #93 from Warren Togami wtog...@redhat.com 2009-10-11 00:01:01 UTC 
---
Bad news.  Please remove the binnocenti logs from the rescore masschecks. 
Working with him we discovered 50+ additional spam in his ham folders and there
is certainly more.  Furthermore his ham contains lots of automated low quality
sources like Bugzilla, trac, cron and log monitoring daemons that should
probably be removed from ham corpa.  It seems incorrect FP's and bias
introduced by this corpus can be large enough to possibly throw off scoring.

Did you also remove wt-en6 after we discovered that copying mail from a Yahoo
account corrupts the messages?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-11 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

Matthias Leisi matth...@leisi.net changed:

   What|Removed |Added

 CC||matth...@leisi.net

--- Comment #94 from Matthias Leisi matth...@leisi.net 2009-10-11 02:19:21 
UTC ---
(In reply to comment #56)
 Here is a set of rules in 50_scores.cf that I ended up as fixed (immutable)
 for the GA run (score set 3). Most of these are already documented and labeled
 as such, but it doesn't hurt to post it here as a double-check.

I suspect that RCVD_IN_DNSWL_* should be immutable as well; in generated
scores, there are counter-intuitive scores assigned (expected _HI  _MED 
_LOW, observed _MED  _HI  _LOW). 

https://svn.apache.org/repos/asf/spamassassin/trunk/rules/50_scores.cf has the
following outside the gen:mutable section:

| score RCVD_IN_DNSWL_LOW 0 -1 0 -1
| score RCVD_IN_DNSWL_MED 0 -4 0 -4
| score RCVD_IN_DNSWL_HI 0 -8 0 -8

The DNSWL stats posted by Warren to the users list seem to indicate that this
should be the correct ordering (at least based on safety):

| SPAM%   HAM%RANK RULE
| 0.0016% 4.2489% 0.91 RCVD_IN_DNSWL_HI
| 0.0281% 6.9639% 0.90 RCVD_IN_DNSWL_MED
| 0.1147% 3.9169% 0.81 RCVD_IN_DNSWL_LOW

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

2009-10-11 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #95 from Warren Togami wtog...@redhat.com 2009-10-11 07:03:21 UTC 
---
(In reply to comment #94)
 The DNSWL stats posted by Warren to the users list seem to indicate that this
 should be the correct ordering (at least based on safety):
 
 | SPAM%   HAM%RANK RULE
 | 0.0016% 4.2489% 0.91 RCVD_IN_DNSWL_HI
 | 0.0281% 6.9639% 0.90 RCVD_IN_DNSWL_MED
 | 0.1147% 3.9169% 0.81 RCVD_IN_DNSWL_LOW

These were yesterday's weekly results, not the rescore masscheck.  Weekly
results are a smaller sample size and lower confidence.

http://ruleqa.spamassassin.org/20090930-r808953-n

SPAM%   HAM% RANK RULE
0.0002% 0.3651%  0.75 RCVD_IN_DNSWL_HI
0.0288% 18.6970% 0.79 RCVD_IN_DNSWL_MED
0.0753% 8.1433%  0.68 RCVD_IN_DNSWL_LOW

This was the rescore masscheck.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

2009-10-09 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #88 from Mark Martinec mark.marti...@ijs.si 2009-10-09 06:23:06 
PDT ---
  The release notes could then say that one should lower the DKIM_ADSP_*
  scores on installations where it is known that mail is not reaching
  SpamAssassin in its pristine form (as received by the MTA).
 
 This case or old ham where the sender subsequently changed their DKIM policy
 is only an issue for masscheck, not production scanning.

True for the case of old ham where the sender subsequently changed their DKIM
policy,
or for the case of expired signatures - these are only an issue with masscheck.

...but not the case of wt-en6, where mail is transformed by its path through
webmail. This is an issue both for masschecks, as well as for production runs.

 Lowering the DKIM scores makes no sense then?

If one knows that mail reaching SpamAssassin will be modified by his mail path,
then one must disable rules targeting mail forgery and depending on a pristine
mail, such as the DKIM_ADSP_DISCARD rule. Otherwise the rule would generate
FP score points for legitimate mail from domains publishing ADSP (explicitly
or through overrides).

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-09 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #89 from Mark Martinec mark.marti...@ijs.si 2009-10-09 06:38:09 
PDT ---
Created an attachment (id=4550)
 -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4550)
resulting 50_scores.cf from garescorer runs

Ok, here it is as last, the auto-generated 50_scores.cf from garescorer runs
on all four sets, with no hand-tweaking of results (yet) ... to give us
something to digest and comment on, and can serve as the first approximation.
Some values are surprising or plain wrong, I'll comment on some later.

I used the submitted logs (tweaked as per Comment 78), with all the recent
updates to them as posted so far in this ticket. I left the BAYES scores
fully floating. I fixed at zero the DCC_REPUT_* scores and JM_SOUGHT_FRAUD_*,
as was discussed previously (as can be seen by the end of the attached file).
Eventually these will need to be set to some manually determined score.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-09 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #90 from Mark Martinec mark.marti...@ijs.si 2009-10-09 06:49:27 
PDT ---
To assess the quality and repeatability of results, here are the summaries
on all four score sets, each pair consists of a normal run on 90% of
entries, and a test run on remaining 10% of log entries.

The most interesting figures are the FP and FN percents, e.g. 0.028% and
0.961%,
in this clipping:
  # False positives: 65  0.011%  (0.028% of nonspam,  10580 weighted)
  # False negatives:   3411  0.578%  (0.961% of spam,  12054 weighted)


==
gen-set0-5-5.0-25000-ga
SCORESET 0 : (no net, not bayes)

test (10%):
# SUMMARY for threshold 5.0:
# Correctly non-spam:  45335  98.03%
# Correctly spam:  39320  81.61%
# False positives:   913  1.97%
# False negatives:  8860  18.39%
# TCR(l=50): 0.883875  SpamRecall: 81.611%  SpamPrec: 97.731%

scores (90%):
# SUMMARY for threshold 5.0:
# Correctly non-spam: 365397  48.193%  (98.401% of non-spam corpus)
# Correctly spam: 314466  41.476%  (81.286% of spam corpus)
# False positives:  5936  0.783%  (1.599% of nonspam, 173347 weighted)
# False negatives: 72396  9.548%  (18.714% of spam, 226867 weighted)
# Average score for spam:  10.0nonspam: 1.4
# Average for false-pos:   5.6  false-neg: 3.1
# TOTAL:  758195  100.00%

==
gen-set1-10-5.0-3-ga
SCORESET 1: (net, no bayes)

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  46183  99.86%
# Correctly spam:  46648  96.82%
# False positives:65  0.14%
# False negatives:  1532  3.18%
# TCR(l=50): 10.075282  SpamRecall: 96.820%  SpamPrec: 99.861%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 370804  48.906%  (99.858% of non-spam corpus)
# Correctly spam: 374579  49.404%  (96.825% of spam corpus)
# False positives:   529  0.070%  (0.142% of nonspam,  31804 weighted)
# False negatives: 12283  1.620%  (3.175% of spam,  39385 weighted)
# Average score for spam:  17.4nonspam: 0.4
# Average for false-pos:   5.8  false-neg: 3.2
# TOTAL:  758195  100.00%


==
gen-set2-10-5.0-3-ga
SCORESET 2: (no net, bayes)

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  29308  99.78%
# Correctly spam:  42344  95.69%
# False positives:64  0.22%
# False negatives:  1907  4.31%
# TCR(l=50): 8.664774  SpamRecall: 95.690%  SpamPrec: 99.849%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 234375  39.745%  (99.864% of non-spam corpus)
# Correctly spam: 339736  57.612%  (95.700% of spam corpus)
# False positives:   320  0.054%  (0.136% of nonspam,  26164 weighted)
# False negatives: 15265  2.589%  (4.300% of spam,  58794 weighted)
# Average score for spam:  10.4nonspam: 0.6
# Average for false-pos:   5.4  false-neg: 3.9
# TOTAL:  589696  100.00%


==
gen-set3-20-5.0-2-ga
SCORESET 3: (net, bayes)

test:
# SUMMARY for threshold 5.0:
# Correctly non-spam:  29342  99.90%
# Correctly spam:  43843  99.08%
# False positives:30  0.10%
# False negatives:   408  0.92%
# TCR(l=50): 23.192348  SpamRecall: 99.078%  SpamPrec: 99.932%

scores:
# SUMMARY for threshold 5.0:
# Correctly non-spam: 234630  39.788%  (99.972% of non-spam corpus)
# Correctly spam: 351590  59.622%  (99.039% of spam corpus)
# False positives:65  0.011%  (0.028% of nonspam,  10580 weighted)
# False negatives:  3411  0.578%  (0.961% of spam,  12054 weighted)
# Average score for spam:  18.5nonspam: -0.1
# Average for false-pos:   5.4  false-neg: 3.5
# TOTAL:  589696  100.00%

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-09 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #92 from Warren Togami wtog...@redhat.com 2009-10-09 20:22:24 UTC 
---
(In reply to comment #89)
 Created an attachment (id=4550)
 -- (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4550) [details]
 resulting 50_scores.cf from garescorer runs
 
 Ok, here it is as last, the auto-generated 50_scores.cf from garescorer runs
 on all four sets, with no hand-tweaking of results (yet) ... to give us
 something to digest and comment on, and can serve as the first approximation.
 Some values are surprising or plain wrong, I'll comment on some later.

Bug #6156 RCVD_IN_PSBL
We should manually adjust this score something between 2.0 through 2.5 for
these reasons.

* Rescore masschecks were with deep parsing.  We have subsequently changed it
to lastexternal which should be much safer.  Even with deep parsing it proved
to be very good.
* At the time of the rescore masschecks, PSBL's recent whitelist filtering of
gmail, yahoo, rr.com and several other major ISP's had not yet timed out
legitimate MTA's.  Safety should be improved further now.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-08 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #83 from Henrik Krohns h...@hege.li 2009-10-08 01:02:43 PDT ---
Cleaned up my DKIM_ADSP_DISCARD hits (old 2005 ebay mails removed) and some
other old stuff, logs sent..

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-08 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #84 from Mark Martinec mark.marti...@ijs.si 2009-10-08 06:50:37 
PDT ---
 These are all legitimate looking paypal mail delivered to a Yahoo account from
 mid-2008 through recently.

Thanks Warren for your out-of-band mail. Apart from some general comments
from my previous posting, there is a real problem regarding your method of
fetching mail for a Yahoo account. You are using the FetchYahoo to download
these messages from the Yahoo webmail interface. The FetchYahoo has to jump
hoops to be able to retrieve a message as close to its original form as
possible, but there are some real obstacles there. Glancing at its source
code, it has to pull attachments separately and splice them back together
into a message, necessarily reinventing the MIME boundaries. This is enough
to render DomainKeys and DKIM signatures invalid. Apart from this, it also
converts QP and base64 encoded messages into UTF-8 binary, which again is
a sufficient reason for signature breakage. Moreover, it has to repair some
damage to header field folding and empty lines, which are broken either due to
bugs in Yahoo HTML rendering (indicated by comments in the FetchYahoo code),
or details are simply lost because of a conversion to HTML and back to mail.

This method of fetching mail is bound to cause trouble. It may quite easily
cause some other low-level SpamAssassin rules to misfire or to fail triggering,
not just the signature verification failures.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-08 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #85 from Warren Togami wtog...@redhat.com 2009-10-08 10:15:55 PDT 
---
I guess we have no choice but to drop wt-en6 from the rescore GA.

Should I drop it from nightly masscheck as well?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-08 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #86 from Mark Martinec mark.marti...@ijs.si 2009-10-08 10:37:23 
PDT ---
 I guess we have no choice but to drop wt-en6 from the rescore GA.
 Should I drop it from nightly masscheck as well?

I can imagine such problem could also affect other users, especially
those not running SpamAssassin close to their MTA. I guess we can keep
the wt-en6 corpus (and similar, if identified), but keep in mind that FP
hits on DKIM_ADSP_DISCARD (and possibly on some other rule if identified)
should be disregarded. I already removed the DKIM_ADSP_DISCARD hit
from my copy of wt-en6 log.

If it turns out the undesired mail modifications are more common
in submitted corpora, we could perhaps re-run the GA on a subset
of logs know not to be suffering from the problem, and just fetch
the DKIM_* scores from results as obtained from this run.

The release notes could then say that one should lower the DKIM_ADSP_*
scores on installations where it is known that mail is not reaching
SpamAssassin in its pristine form (as received by the MTA).

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


[Bug 6155] generate new scores for 3.3.0 release

2009-10-08 Thread bugzilla-daemon
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155

--- Comment #87 from Warren Togami wtog...@redhat.com 2009-10-08 13:51:31 PDT 
---
(In reply to comment #86)
 The release notes could then say that one should lower the DKIM_ADSP_*
 scores on installations where it is known that mail is not reaching
 SpamAssassin in its pristine form (as received by the MTA).

This case or old ham where the sender subsequently changed their DKIM policy is
only an issue for masscheck, not production scanning.  Lowering the DKIM scores
makes no sense then?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.


  1   2   >