[Bug 6155] generate new scores for 3.3.0 release

bugzilla-daemon Tue, 20 Oct 2009 16:26:20 -0700

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155


--- Comment #120 from Adam Katz <antis...@khopis.com> 2009-10-20 16:25:36 UTC 
---
(In reply to comment #119)
> (In reply to comment #118)
>> ... despite the current corpus data (unless 1.7% is a high ham hit-rate)?
> 
> http://ruleqa.spamassassin.org/20091017-r826198-n/RDNS_NONE/detail
> The most recent weekly run has pretty substantial hits even outside of
> the synthetic corpus.

Your link is just a longer version of mine.  It still results in a 1.7% total
ham hit-rate.  Is that too substantial?  Is there detail on what each corpus is
(specifically nbebout, since that's the only other corpus that hit 4+% of
spam)?

Looking only at ham scoring 4 or higher (including enron since I can't remove
it), RDNS_NONE hit 0.8528% of the total ham corpus.  Of the ham scoring JUST 4
(4.0-4.99999), we're looking at 0.5865% that would become FPs assuming a score
of 1.1 (increasing the 0.1 by 1), and I'm not even proposing my own
implementation's 0.9.

> Adam, this [... and] your RCVD_IN_APNIC are examples of inherently
> prejudiced rules. It might work for the most part, and you might accept
> the risk of accidental FP's because the score alone wont push it above
> the threshold. However the combined risks of multiple prejudiced rules
> is too great. Prejudiced rules should be up to the sysadmin if they want
> to enable.  We should not highly score any known prejudiced rules in the
> default ruleset.

I agree that RCVD_VIA_APNIC is a prejudiced rule, and my channels have had it
rated 0.001 ever since you called me out on it (RCVD_VIA_APNIC accidentally
came in when I migrated from an internal-only propagation to a published
channel).  KHOP_NO_FIRST_NAME, my other poorly-considered published test,
pre-dates my more thorough testing mechanism (which has limited new rules'
entry quite considerably).  My rules will get even more cleaned up once I get
an svn account to test them here.  (Some of them, like the biased RCVD_IN_APNIC
and quasi-biased/unfair KHOP_SC_CIDR8, would either never get pushed up for
testing or would get the nopublish flag, depending on the guidelines ... that
nobody has yet pointed me to.)  (Side note: I see __RCVD_VIA_APNIC is already
in your own sandbox, hitting 86% of all Japanese ham.)

Getting back to this issue:  I don't see any problem with prejudice against
poorly constructed network infrastructures that can't bother to adhere to the
SMTP standard (RFC1912 section 2.1).  This is something that any network admin
who should legitimately be managing a mail server should be able to fix with a
single phone call (please correct me if this sentence is prejudiced in any
way).

The SMTP standard requires a server's rDNS must match the server's reported
name (thus the IP must have rDNS), and most allocated IPs have them anyway
(even if they're wrong or ~dynamic, e.g. RDNS_DYNAMIC).  There is also a
growing number of deployments that block improper FCrDNS at the door (RDNS_NONE
is a subset of failing FCrDNS).

SA already has built-in "prejudices" against poorly constructed email clients
(e.g. MISSING_HEADERS) and relays (e.g. DATE_IN_FUTURE_48_96), so why not the
network?  Isn't SPF_FAIL a "prejudiced" test against network configuration?

SA at its core is merely a system of probabilities.  Even without bayes, the
masscheck mechanism and its points are awarded based on statistical
significance.  Very few rules are actually free of FPs (or FNs for negative
rules).  My question still stands:  what does SA deem statistically significant
when it comes to FPs?  Why does RDNS_NONE need to be immutable rather than
dictated by the masscheck results?  What would the automated system score
RDNS_NONE if it were allowed to?  I'm guessing something between 0.2 and 0.7.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6155] generate new scores for 3.3.0 release

Reply via email to