Re: [Mimedefang] learner indicated ham

2014-08-12 Thread Bill Cole

On 9 Aug 2014, at 13:41, G.W. Haywood wrote:


Hi there,

On Sat, 9 Aug 2014, Bill Cole wrote:


... you probably could get a better answer from the broader SA
community, but I'll offer a vague rambling one :)


It wasn't all that vague. :)

You guys do REJECT your spam, don't you?


Generally, yes. I actually manage spam control for multiple systems that 
operate under a diversity of policy regimes, some of which require 
tag-and-release and/or quarantine for some mail that is in fact nearly 
pure spam. On my personal domain (20yo, including still-live addresses 
used for about a decade unmunged on Usenet) I reject 95% of all 
attempted SMTP transactions before DATA (a majority doomed before MAIL) 
so my filter_end function in MD (where SA gets a look) sees a mostly 
de-spammed stream of messages.

___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: [Mimedefang] learner indicated ham

2014-08-12 Thread Bill Cole

On 11 Aug 2014, at 10:22, Justin Edmands wrote:


Bill,
Thank you very much for the response. The detail is much appreciated.
As Ged mentioned, not vague, helpful to say the least. The part about
highly trusted rules caught my attention:

Another way to increase autolearning without going all the way to the
learn on error behavior is to flag rules that you trust highly as
autolearn_force so that messages matching them won't ever be
excluded from autolearning based on the existing Bayes DB disagreeing
with the deterministic rules.

I think these will get me started:

tflags URIBL_DBL_SPAM autolearn_force
tflags URIBL_JP_SURBL autolearn_force
tflags URIBL_BLACK autolearn_force
tflags INVALID_DATE autolearn_force

Any others that are definites?


That's a hard question for anyone to answer without knowing your 
mailstream's quirks. I can't tell you who your users are and what sort 
of mail they want that matches which rules. The default SA rules have 
mostly low scores because they are all individually highly error-prone.


I'm especially wary about putting too much trust in individual rules 
because I get lots of mail that talks about spam, often with things like 
lists of evil domains that trigger URIBL rules. And INVALID_DATE shows 
up in a surprising number of ethically upstanding but technically sordid 
messages (e.g. Terminix customer notices.) This is why I reserve 
autolearn_force for meta-rules, since it carries a risk of turning a few 
false positives into a bad Bayes DB. The specific example of what I 
described that I can share is this locally-defined rule:


describe URIBL_MULTI1 Multiple URIBL  hits  
meta URIBL_MULTI1 URIBL_DBL_SPAM + URIBL_RED + URIBL_BLACK + URIBL_SBL + 
URIBL_WS_SURBL + URIBL_OB_SURBL + URIBL_JP_SURBL + URIBL_SC_SURBL  2

score URIBL_MULTI1 10
tflags URIBL_MULTI1 autolearn_force

That means that if 3 or more of 8 different URIBL tests hit on a 
message, In tack on an extra 10 point and override the learner 
protections. I should add a note of warning by example: last week a 
thread in the Postfix users list was started with a message including a 
long list of spammer domains, causing the original message and any that 
fully quoted it to match *6* of those URIBLs. If your mailstream 
includes mail discussing spam, you have to take precautions to protect 
from such things ruining your Bayes DB.


My other autolearn_force rules are also meta-rules that bundle multiple 
rules, but I unfortunately cannot freely share their details as the 
constituent rules come from private (i.e. encumbered) sources. The 
general process I use is to look for clusters of rules (positive OR 
negative) that often hit together on mail that gets a Bayes score in the 
opposite direction. Before SA 3.4 I just set high scores on those 
meta-rules to assure rejection, but autolearn_force improves on that.

___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: [Mimedefang] learner indicated ham

2014-08-11 Thread Justin Edmands
On Sat, Aug 9, 2014 at 1:41 PM, G.W. Haywood
mimedef...@jubileegroup.co.uk wrote:

 It wasn't all that vague. :)

 You guys do REJECT your spam, don't you?

 --

 73,
 Ged.


Bill,
Thank you very much for the response. The detail is much appreciated.
As Ged mentioned, not vague, helpful to say the least. The part about
highly trusted rules caught my attention:

Another way to increase autolearning without going all the way to the
learn on error behavior is to flag rules that you trust highly as
autolearn_force so that messages matching them won't ever be
excluded from autolearning based on the existing Bayes DB disagreeing
with the deterministic rules.

I think these will get me started:

tflags URIBL_DBL_SPAM autolearn_force
tflags URIBL_JP_SURBL autolearn_force
tflags URIBL_BLACK autolearn_force
tflags INVALID_DATE autolearn_force

Any others that are definites?
___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


Re: [Mimedefang] learner indicated ham

2014-08-09 Thread Bill Cole

On 8 Aug 2014, at 12:05, Justin Edmands wrote:


Aug  8 12:00:53.067 [19948] dbg: learn: auto-learn: message score:
13.934, computed score for autolearn: 17.583
Aug  8 12:00:53.067 [19948] dbg: learn: auto-learn? ham=0, spam=7,
body-points=7.448, head-points=5.511, learned-points=-1.9
Aug  8 12:00:53.067 [19948] dbg: learn: auto-learn: autolearn_force
not flagged for a rule. Body Only Points: 7.448 (3 req'd) / Head Only
Points: 5.511 (3 req'd)
Aug  8 12:00:53.067 [19948] dbg: learn: auto-learn? no: scored as spam
but learner indicated ham (-1.9  -1)


This is really a SpamAssassin issue rather than a MIMEDefang issue, so 
you probably could get a better answer from the broader SA community, 
but I'll offer a vague rambling one :)


The SA auto-learn subsystem is designed to be very cautious in what it 
learns because it carries diverse mistraining risks. The obvious part of 
the caution is the spam/non-spam thresholds for auto-learning, but there 
are also less prominent: the message is rescored for the threshold check 
using scoreset 0 or 1, the learner demands a minimum of 3 pts each from 
body  header/network rules to score as spam unless a matched rule has 
the autolearn_force tflasg set, and other per-rule 'tflags' can modify 
how the learner acts on a matching message. As a result, a message 
actually has 5 scores tallied by SA: the normal score using scoreset 3 
or 4, the score using scoreset 0 or 1 that gets compared to the spam  
nonspam autolearn threshold settings, the body-only score, the 
header-only score, and the score using only rules with the learn tflag 
(by default, that's only BAYES_* rules) which is reported in debug 
messages as learned-points. By default, that last value is used as a 
backstop to prevent wildly divergent auto-learning. If the Bayes rules 
score a message -1 or 1 (by default: a Bayes probability below 1% or 
above 50%) in dissent from the overall score, the message will not be 
autolearned.



Is this something that I can fix? I want stuff to be trained as spam
but it doesn't seem to make it. I am thinking it's either a setting I
am not aware of or I need to retrain my bayes DB ham. Any help would
be great.


The real question is whether it is a problem at all, i.e. whether it's a 
thing that merits fixing rather than a thing that is working as designed 
and, at least in aggregate, for your benefit. Probably that particular 
message was spam, given the very high score spread across rule types, 
but it is certain that learning it as spam would change the way your 
Bayes DB interprets similar messages and possible (absent other 
evidence) that it was not spam at all. Unless you do intensive periodic 
score adjustments of your non-Bayes rules based on a carefully 
human-classified corpus of messages that are representative of the 
actual mailstream seen by SA, a well-fed Bayes DB is going to be a 
better judge than the other (static and mostly default) rules. As of SA 
v3.4 (which you apparently have, as autolearn_force is new) you can 
switch bayes_auto_learn_on_error to 1 to flip the auto-learner into a 
mode where it *ONLY* learns a message when its learned-points 
classification (i.e. the judgment of the existing Bayes DB) disagrees 
with classification based on surpassing an autolearn threshold.


Whether you leave bayes_auto_learn_on_error at its default 0 for the 
traditional behavior or switch it to 1 depends on what you believe to 
be true about the relative accuracy of your Bayes and non-Bayes SA 
rules. The traditional behavior expresses an assumption that the Bayes 
DB is less likely to make a large classification error than the rules 
used for the autolearn score, while the learn on error behavior 
assumes that your Bayes DB is probably in error when it disagrees with 
the other SA rules. Which way is better is site-specific, as that is 
influenced by a site's particular mailstream idiosyncrasies, the 
autolearn thresholds, local rules, local score adjustments to standard 
rules, the exclusion of messages from SA scoring by other anti-spam 
measures, and the nature of what gets fed to the Bayes DB after explicit 
human classification.


Another way to increase autolearning without going all the way to the 
learn on error behavior is to flag rules that you trust highly as 
autolearn_force so that messages matching them won't ever be excluded 
from autolearning based on the existing Bayes DB disagreeing with the 
deterministic rules. I have started doing this for locally-defined 
meta-rules that match on multiple hits on net rules such as the URIBL 
family. My reasoning there is that an identical message can get 
autolearned as ham at 12:00 because the spammer filled it with 
Bayes-busting garbage and freshly minted payload URLs and sent from a 
fresh snowshoe range but score well past the autolearn spam threshold 
at 12:05 because by then multiple network services checked by SA rules 
have switched their opinions. In short: there are non-Bayes rules which 

Re: [Mimedefang] learner indicated ham

2014-08-09 Thread G.W. Haywood

Hi there,

On Sat, 9 Aug 2014, Bill Cole wrote:


... you probably could get a better answer from the broader SA
community, but I'll offer a vague rambling one :)


It wasn't all that vague. :)

You guys do REJECT your spam, don't you?

--

73,
Ged.
___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang


[Mimedefang] learner indicated ham

2014-08-08 Thread Justin Edmands
Aug  8 12:00:53.067 [19948] dbg: learn: auto-learn: message score:
13.934, computed score for autolearn: 17.583
Aug  8 12:00:53.067 [19948] dbg: learn: auto-learn? ham=0, spam=7,
body-points=7.448, head-points=5.511, learned-points=-1.9
Aug  8 12:00:53.067 [19948] dbg: learn: auto-learn: autolearn_force
not flagged for a rule. Body Only Points: 7.448 (3 req'd) / Head Only
Points: 5.511 (3 req'd)
Aug  8 12:00:53.067 [19948] dbg: learn: auto-learn? no: scored as spam
but learner indicated ham (-1.9  -1)


Is this something that I can fix? I want stuff to be trained as spam
but it doesn't seem to make it. I am thinking it's either a setting I
am not aware of or I need to retrain my bayes DB ham. Any help would
be great.
___
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang