Re: Spamassassin Bayes... "why give that spam that score???"

2016-02-25 Thread John Hardin

On Thu, 25 Feb 2016, RW wrote:


On Thu, 25 Feb 2016 13:58:03 -0800 (PST)
John Hardin wrote:

On Thu, 25 Feb 2016, Steve wrote:



b) Configure spamc -C report  (run as any user) to initiate
training of the amavis bayes database (in ~amavis/.spamassassin) ?


That would probably be a code change, unless you want to write a
wrapped script that calls the real spamc and then sa-learn...
Probably not a good idea.


I don't see why it would require a code change if ~amavis is a real
unix home directory. It does require an instance of spamd that does
nothing else since AFAIK it's not needed by amavisd.


Sorry, I was thinking in terms of "learning" at all rather than "learning 
to a specific database".


{refreshes memory of spamc command line}

spamc -L iham|spam is for learning. I expect if you configured the correct 
database (as Reindl suggested) then -L would do what you want. Having -C 
report do that as well would be a code change, I'm not sure that's a good 
idea.


Apologies for not mentioning -L initially, I had forgotten about it.


You can either run spamd -u amavis , or leave it as root and run
spamc -u amavis. Either way spamd will drop to the user amavis and look
for its files in ~amavis/.spamassassin

I think you do need to use both  the -C and -L options to spamc though.

The alternative for both training and reporting/revoking would be to use
the spamassassin script, but that's inefficient from the Dovecot plugin.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The Constitution is a written instrument. As such its meaning does
  not alter. That which it meant when adopted, it means now.
-- U.S. Supreme Court
   SOUTH CAROLINA v. US, 199 U.S. 437, 448 (1905)
---
 66 days since the first successful real return to launch site (SpaceX)


Re: Spamassassin Bayes... "why give that spam that score???"

2016-02-25 Thread RW
On Thu, 25 Feb 2016 13:58:03 -0800 (PST)
John Hardin wrote:

> On Thu, 25 Feb 2016, Steve wrote:

> > b) Configure spamc -C report  (run as any user) to initiate
> > training of the amavis bayes database (in ~amavis/.spamassassin) ?  
> 
> That would probably be a code change, unless you want to write a
> wrapped script that calls the real spamc and then sa-learn...
> Probably not a good idea.



I don't see why it would require a code change if ~amavis is a real
unix home directory. It does require an instance of spamd that does
nothing else since AFAIK it's not needed by amavisd.

You can either run spamd -u amavis , or leave it as root and run
spamc -u amavis. Either way spamd will drop to the user amavis and look
for its files in ~amavis/.spamassassin

I think you do need to use both  the -C and -L options to spamc though.

The alternative for both training and reporting/revoking would be to use
the spamassassin script, but that's inefficient from the Dovecot plugin.



> That's probably the easiest to do.
> 
> https://wiki.apache.org/spamassassin/SiteWideBayesSetup

It's presumably already site-wide with a database in
~amavis/.spamassassin

> Also, if you are going to leave autolearn on, reduce the learn-as-ham 
> threshold!

Autotraining and the Dovecot plugin isn't a good combination since they
are both very poor at learning ham. If you really must use them
together train a few thousand hams manually and then set the threshold
low enough that it wont get screwed-up by autotraining.



Re: Spamassassin Bayes... "why give that spam that score???"

2016-02-25 Thread Reindl Harald



Am 25.02.2016 um 22:58 schrieb John Hardin:

b) Configure spamc -C report  (run as any user) to initiate training
of the amavis bayes database (in ~amavis/.spamassassin) ?


That would probably be a code change, unless you want to write a wrapped
script that calls the real spamc and then sa-learn... Probably not a
good idea


why?

spamc --help
-F, --config path   Use this configuration file





signature.asc
Description: OpenPGP digital signature


Re: Spamassassin Bayes... "why give that spam that score???"

2016-02-25 Thread John Hardin

On Thu, 25 Feb 2016, Steve wrote:

Please keep the discussion on-list so others may help/benefit.


On 25/02/2016 01:14, John Hardin wrote:

 The second one has autolearn=yes, so I would say that autolearn is
 probably the cause of this behavior.


You're right... Manual training wasn't working - and autolearn became 
self-reinforcing as a result.  I had been misinterpreting my logs 
(face-palm)! I now see that the training initiated by spamc (behind 
dovecot antispam) was trying to train the bayes database in 
~/.spamassassin/bayes* - but amavis was using the bayes database in
~ amavis/.spamassassin/bayes* - and was failing as a result (which I had 
overlooked.)


Yeah, "are you training the right database?" is a standard initial 
troubleshooting question; I apologize for not asking that up front.



I can now refine my question:  Is there an easy way to:

a) Configure amavisd to use the spamassassin configuration 
(~/.spamassassin/user_prefs and bayes_*) for the intended mailbox's account? 
(As far as I can tell, this isn't supported...)


Not sure, I'm unfamiliar with the details of amavisd. Sorry.

b) Configure spamc -C report  (run as any user) to initiate training of the 
amavis bayes database (in ~amavis/.spamassassin) ?


That would probably be a code change, unless you want to write a wrapped 
script that calls the real spamc and then sa-learn... Probably not a good 
idea.


c) Configure everything to use a single site-wide database?  (I've found 
how-to documents suggesting that I set "bayes_path" and "bayes_file_mode" - 
but when I try this, this part of the configuration seems to be ignored.)


That's probably the easiest to do.

https://wiki.apache.org/spamassassin/SiteWideBayesSetup

Also, if you are going to leave autolearn on, reduce the learn-as-ham 
threshold!



 Have you considered greylisting to give domains a chance to be added to
 URIBLs before you see them?


I have - but I quickly lost patience with it.  It is important to me that - 
if I'm having a phone conversation with someone, and they send me an email 
"there and then" - that I get to see it before hanging up.  Greylisting is 
incompatible with this wish.


It doesn't work for everyone.

I'm not comfortable increasing the URIBL_BLACK score (as you appear to have 
done) as I don't want to risk any block-list ever being a single point of 
failure for false positives.


URIBL_BLACK wouldn't become a poison pill by itself unless you score it 
over 5. I don't necessarily recommend trusting it *that* much, but 3.0 
seems reasonable to me.


I am, however, very curious about IXHASH - 
which looks as if it is useful.  How does this compare with (or relate to) 
RAXOR/PYZOR/DCC?  What's the best way to install it (on Ubuntu - if the 
distro is relevant to the answer...)?


Dunno, maybe somebody else will chime in.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  A sword is never a killer, it is but a tool in the killer's hands.
  -- Lucius Annaeus Seneca (Martial) 4BC-65AD
---
 66 days since the first successful real return to launch site (SpaceX)


Re: Spamassassin Bayes... "why give that spam that score???"

2016-02-25 Thread Bill Cole

On 24 Feb 2016, at 20:14, John Hardin wrote:


On Thu, 25 Feb 2016, Steve wrote:


On 24/02/2016 22:59, John Hardin wrote:

 On Wed, 24 Feb 2016, Steve wrote:

>  I've used spamassassin for many years - on Ubuntu, using amvisd - 
with >  great success.  In recent months, I've been receiving 
several spam >  messages each day that evade the filters.


 Can you provide samples? (e.g. three or four on Pastebin)


One of each of the most common forms:

http: //pastebin.com/Wk2KD1Q1
http: //pastebin.com/QCQ9Ymw7
http: //pastebin.com/wgkmiJLt


The second one has autolearn=yes, so I would say that autolearn is 
probably the cause of this behavior.


Note that the bayes score doesn't contribute to the autolearning 
decision to avoid positive feedback, but if there are no non-Bayes 
spam signs and the message scores lightly negative like that one does, 
it can be learned as ham. That would make any subsequent similar 
messages score even lower, possibly offsetting actual spam hits.


Subsequently training those messages as spam will offset that effect, 
but you're to a degree playing whack-a-mole that way.


I misspoke a bit when I said there are no knobs to twiddle. I forgot 
about the autolearn thresholds, but they aren't strictly part of how 
bayes itself works, they are (again) training. If you want to use 
autolearn, you might want to reduce the learn-as-ham threshold even 
further. View autolearn as a not-quite-trustworthy user making 
submissions, and the thresholds are a way to limit the effects of poor 
judgement. :)


I'm much more certain that you should reduce your 
bayes_auto_learn_threshold_nonspam. Everyone should.


The default is 0.1, and it looks like you've left that as-is. I use -0.2 
because I really don't want the autolearner to assume mail is ham 
without at least 2 minor or one substantial indicator of hamminess. 
Maybe giving  mail the benefit of the doubt made sense circa v3.1, but 
it definitely does not today. In the case of your 2nd example, it was 
autolearned as ham because its non-bayes score was -0.101, based on 
rules that only have independent scores at all for strategic UI (some 
might even say political) purposes.


Re: Spamassassin Bayes... "why give that spam that score???"

2016-02-25 Thread RW
On Thu, 25 Feb 2016 00:41:04 +
Steve wrote:

> On 24/02/2016 22:59, John Hardin wrote:

> > How do you train your Bayes? Autolearn? General user submissions? 
> > Trusted user submissions? Only you, from only your personal mail?  
> Only my personal mailbox *really* matters to me.  I train from it
> using the dovecot antispam plugin... which feeds mail I shift to/from
> a spam folder through a pipe involving "spamc -C".

I think that might be your problem. The equivalent option in the
spamassassin script trains Bayes as a side-effect of reporting or
revoking. I don't think  "spamc -C" does.





Re: Spamassassin Bayes... "why give that spam that score???"

2016-02-24 Thread Reindl Harald



Am 25.02.2016 um 02:14 schrieb John Hardin:

On Thu, 25 Feb 2016, Steve wrote:


On 24/02/2016 22:59, John Hardin wrote:

 On Wed, 24 Feb 2016, Steve wrote:

>  I've used spamassassin for many years - on Ubuntu, using amvisd -
with >  great success.  In recent months, I've been receiving several
spam >  messages each day that evade the filters.

 Can you provide samples? (e.g. three or four on Pastebin)


One of each of the most common forms:

http: //pastebin.com/Wk2KD1Q1
http: //pastebin.com/QCQ9Ymw7
http: //pastebin.com/wgkmiJLt


The second one has autolearn=yes, so I would say that autolearn is
probably the cause of this behavior


autolearn is the root of all evil, it's nice for a "fire and fforget" 
setup with no manual training, but that's it


got hit by it in the past multiple times in both directions (false 
negative ham and false positive spam) with the result of purge the whole 
bayes (commercial appliance using SpamAssassin as one part)


after build up my own spamfilter solution, keep the whole corpus and 
*only* train by hand with no autlearning/autoexpire the bayes is 100% 
trustworthy and can be scored as nearly posion pill for spam as well as 
-3,5 for BAYES_00


given that 99% of junk is killed long before SA on MTA-level, 30% are 
sortcircuit ham and over 70% of the messages making it through bayes are 
BAYES_00 the setup is proven to be right


0  61132SPAM
0  21786HAM
02540731TOKEN

insgesamt 73M
-rw--- 1 sa-milt sa-milt 10M 2016-02-25 02:24 bayes_seen
-rw--- 1 sa-milt sa-milt 81M 2016-02-25 02:24 bayes_toks

BAYES_0025445   73.52 %
BAYES_05  6711.93 %
BAYES_20  7802.25 %
BAYES_40  7202.08 %
BAYES_50 25197.27 %
BAYES_60  3701.06 % 7.90 % (OF TOTAL BLOCKED)
BAYES_80  2880.83 % 6.15 % (OF TOTAL BLOCKED)
BAYES_95  2840.82 % 6.06 % (OF TOTAL BLOCKED)
BAYES_99 3529   10.19 %75.38 % (OF TOTAL BLOCKED)
BAYES_99931859.20 %68.04 % (OF TOTAL BLOCKED)

DNSWL   4   90.78 %
SPF 33608   65.37 %
SPF/DKIM WL 14653   28.50 %
SHORTCIRCUIT16744   32.57 %

BLOCKED  46819.10 %
SPAMMY   44718.69 %95.51 % (OF TOTAL BLOCKED)




signature.asc
Description: OpenPGP digital signature


Re: Spamassassin Bayes... "why give that spam that score???"

2016-02-24 Thread John Hardin

On Thu, 25 Feb 2016, Reindl Harald wrote:


 7.0 URIBL_BLACKContains an URL listed in the URIBL blacklist
[URIs: leslie-bib***b.org]


That, too. Steve, you might consider boosting your local score for 
URIBL_BLACK. :)


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Pork (n): (political) The manifestation of the principle that it is
  a felony to bribe a legislator, unless you are also a legislator.
---
 65 days since the first successful real return to launch site (SpaceX)


Re: Spamassassin Bayes... "why give that spam that score???"

2016-02-24 Thread John Hardin

On Thu, 25 Feb 2016, Steve wrote:


On 24/02/2016 22:59, John Hardin wrote:

 On Wed, 24 Feb 2016, Steve wrote:

>  I've used spamassassin for many years - on Ubuntu, using amvisd - with 
>  great success.  In recent months, I've been receiving several spam 
>  messages each day that evade the filters.


 Can you provide samples? (e.g. three or four on Pastebin)


One of each of the most common forms:

http: //pastebin.com/Wk2KD1Q1
http: //pastebin.com/QCQ9Ymw7
http: //pastebin.com/wgkmiJLt


The second one has autolearn=yes, so I would say that autolearn is 
probably the cause of this behavior.


Note that the bayes score doesn't contribute to the autolearning decision 
to avoid positive feedback, but if there are no non-Bayes spam signs and 
the message scores lightly negative like that one does, it can be learned 
as ham. That would make any subsequent similar messages score even lower, 
possibly offsetting actual spam hits.


Subsequently training those messages as spam will offset that effect, but 
you're to a degree playing whack-a-mole that way.


I misspoke a bit when I said there are no knobs to twiddle. I forgot about 
the autolearn thresholds, but they aren't strictly part of how bayes 
itself works, they are (again) training. If you want to use autolearn, you 
might want to reduce the learn-as-ham threshold even further. View 
autolearn as a not-quite-trustworthy user making submissions, and the 
thresholds are a way to limit the effects of poor judgement. :)


I note that they tend to come from different mail servers each time - the 
URLs in the body tend to be unique, too.


Have you considered greylisting to give domains a chance to be added to 
URIBLs before you see them?


>  * The false positives all match BAYES_00 - attracting a default score of 
>  -1.9. BAYES_00 seems to be at the crux of the misclassification.
> 
>  Is there a way to delve into why these messages have been allocated such 
>  a low bayes score - while (to a human) appearing blatant, simple, spam 
>  on "vanilla" spam topics?  Has my bayes data been "poisoned" somehow?


 Poisoning is less likely than mistraining.
 How large is your userbase and mail volume?


One user - me - several email addresses.  10,000 mails per month - several 
mailing lists where I read only a tiny fraction of the posts.


Heh. For once it's someone pretty much like me. :)

~ 1,500 spams (that survive mail server RBLs).  Autolearn is on - I don't 
think about it, it is automatic. :)



 How do you train your Bayes? Autolearn? General user submissions? Trusted
 user submissions? Only you, from only your personal mail?


Only my personal mailbox *really* matters to me.  I train from it using the 
dovecot antispam plugin... which feeds mail I shift to/from a spam folder 
through a pipe involving "spamc -C".


And I assume there's a similar ham folder? You need both.


 Do you keep base training corpora so you can wipe and retrain if it goes
 off the rails for some reason?


(In principle) I've got multi-gigabyte-scale spam/ham corpora.  I'm yet to 
[ever] do anything with it. :)


I have base bayes corpora of a few thousand messages each spam and ham, 
kept in aged corpora files. I add a handful to that every month, mostly on 
the spam side. SA is trained nightly from the current corpora files and I 
can retrain from from scratch from all of them if needed, but I haven't 
needed to do that yet.



 If all the FNs are getting BAYES_00, make sure you're (re)training them as
 spam.


I believe I'm doing that - but it isn't easy to prove that the training 
'worked'.


If you look at the output from the training you'll be able to see how many 
"new" messages it learned from.


It will have an effect, in that it will remove a specific mistraining, but 
in the meantime autolearn may be making bad decisions about other 
messages.



 Review how you're training. If your users aren't really trustworthy you
 should be manually reviewing submissions.


When spam  arrives in my primary inbox, I hand classify - I'm less obsessive 
about mailing lists. Dovecot initiates training automatically when I shift 
messages to a special spam folder.


OK, good. If you had a userbase, their judgement (or lack thereof) could 
be an issue.



 I feel autolearn can be problematic, particularly if things are already
 going off the rails.


I expect Autolearn (assisted by Razor, Pyzor and DCC) has done the vast 
majority of my training.  This year, I've hand-trained 216 false-negatives 
and 0 false positives.


For the size of your install, I'd recommend turning off autolearn and go 
with purely hand-collected corpora. It serves me well.



 If you have base training corpora, review it for misclassifications (FNs),
 wipe and retrain.


I guess I could do that... My expectation is that - if I train with the 
corpora I can pick easily (without changing configuration) I'll get the same 
bayes database I currently have... which will give the same scores.


No, autolearning would 

Re: Spamassassin Bayes... "why give that spam that score???"

2016-02-24 Thread Reindl Harald



Am 25.02.2016 um 01:41 schrieb Steve:

On 24/02/2016 22:59, John Hardin wrote:

On Wed, 24 Feb 2016, Steve wrote:


I've used spamassassin for many years - on Ubuntu, using amvisd -
with great success.  In recent months, I've been receiving several
spam messages each day that evade the filters.


Can you provide samples? (e.g. three or four on Pastebin)


One of each of the most common forms:


none of that 3 messages should make it into your inbox and at least 
never get BAYES_00 - looks like bad training!


i tried to obfuscate the URIBL hits because otherwise even the 
mailing-list would reject my message



http://pastebin.com/Wk2KD1Q1


/var/www/uploadtemp/ac5a53b19de9a182194b8e94cb6724eb4b3ce574.eml: 
Sanesecurity.Junk.52024.UNOFFICIAL FOUND
/var/www/uploadtemp/ac5a53b19de9a182194b8e94cb6724eb4b3ce574.eml: 
Sanesecurity.Blurl.6a2ebd.UNOFFICIAL FOUND
/var/www/uploadtemp/ac5a53b19de9a182194b8e94cb6724eb4b3ce574.eml: 
Sanesecurity.Blurl.6a2ebd.UNOFFICIAL FOUND
/var/www/uploadtemp/ac5a53b19de9a182194b8e94cb6724eb4b3ce574.eml: 
Sanesecurity.Blurl.6a2ebd.UNOFFICIAL FOUND
/var/www/uploadtemp/ac5a53b19de9a182194b8e94cb6724eb4b3ce574.eml: 
Sanesecurity.Blurl.6a2ebd.UNOFFICIAL FOUND
/var/www/uploadtemp/ac5a53b19de9a182194b8e94cb6724eb4b3ce574.eml: 
Sanesecurity.Blurl.6a2ebd.UNOFFICIAL FOUND


--- VIRUS-SCAN SUMMARY ---
Infected files: 1
Time: 0.009 sec (0 m 0 s)
Content analysis details:   (20.6 points, 5.5 required)

 pts rule name  description
 -- 
--

 1.0 GENERIC_IXHASH DIGEST: generic.ixhash.net
-0.3 RCVD_IN_MSPIKE_H4  RBL: Very Good reputation (+4)
[108.62.157.149 listed in wl.mailspike.net]
 7.0 URIBL_BLACKContains an URL listed in the URIBL blacklist
[URIs: leslie-bib***b.org]
 1.5 SPF_HELO_FAIL  SPF: HELO does not match SPF record (fail)
[SPF failed: Please see 
http://www.openspf.org/Why?s=helo;id=gw.shic.co.uk;ip=192.168.42.2;r=mail-gw.thelounge.net]

 3.0 INVESTMENT_ADVICE  BODY: Message mentions investment advice
 1.5 BAYES_50   BODY: Bayes spam probability is 40 to 60%
[score: 0.5002]
 0.0 HTML_MESSAGE   BODY: HTML included in message
-0.1 DKIM_VALID_AU  Message has a valid DKIM or DK signature 
from author's

domain
-0.1 DKIM_VALID Message has at least one valid DKIM or DK 
signature

 0.5 PYZOR_CHECKListed in Pyzor (http://pyzor.sf.net/)
 0.1 DKIM_SIGNEDMessage has a DKIM or DK signature, not 
necessarily valid

 1.5 IXHASH_CHECK   Message hits one ore more IXHASH digest-sources
 2.5 RDNS_NONE  Delivered to internal network by a host 
with no rDNS

-0.0 RCVD_IN_MSPIKE_WL  Mailspike good senders
 2.5 DIGEST_MULTIPLE_LOCAL  Message hits more than one network digest check
 (razor, pyzor, ixhash)


http://pastebin.com/QCQ9Ymw7


/var/www/uploadtemp/cb2bd7249493a618230fc12473f311ee092a9c6a.eml: 
Sanesecurity.Blurl.56d5c1.UNOFFICIAL FOUND
/var/www/uploadtemp/cb2bd7249493a618230fc12473f311ee092a9c6a.eml: 
Sanesecurity.Blurl.56d5c1.UNOFFICIAL FOUND
/var/www/uploadtemp/cb2bd7249493a618230fc12473f311ee092a9c6a.eml: 
Sanesecurity.Blurl.56d5c1.UNOFFICIAL FOUND


--- VIRUS-SCAN SUMMARY ---
Infected files: 1
Time: 0.007 sec (0 m 0 s)
Content analysis details:   (18.5 points, 5.5 required)

 pts rule name  description
 -- 
--

 7.0 URIBL_BLACKContains an URL listed in the URIBL blacklist
[URIs: pinkhand***print.com]
 3.5 URIBL_DBL_SPAM Contains a spam URL listed in the DBL blocklist
[URIs: pinkhand***print.com]
-0.1 CUST_DNSWL_2   RBL: score.senderscore.com (Low Trust)
[85.195.78.13 listed in score.senderscore.com]
-0.3 RCVD_IN_MSPIKE_H4  RBL: Very Good reputation (+4)
[85.195.78.13 listed in wl.mailspike.net]
 1.5 SPF_HELO_FAIL  SPF: HELO does not match SPF record (fail)
[SPF failed: Please see 
http://www.openspf.org/Why?s=helo;id=gw.shic.co.uk;ip=192.168.42.2;r=mail-gw.thelounge.net]

 1.5 BAYES_50   BODY: Bayes spam probability is 40 to 60%
[score: 0.5000]
 0.0 HTML_MESSAGE   BODY: HTML included in message
 2.0 RAZOR2_CF_RANGE_E8_51_100 Razor2 gives engine 8 confidence level
above 50%
[cf: 100]
-0.1 DKIM_VALID_AU  Message has a valid DKIM or DK signature 
from author's

domain
 0.5 RAZOR2_CHECK   Listed in Razor2 (http://razor.sf.net/)
 0.5 RAZOR2_CF_RANGE_51_100 Razor2 gives confidence level above 50%
[cf: 100]
-0.1 

Re: Spamassassin Bayes... "why give that spam that score???"

2016-02-24 Thread Steve

On 24/02/2016 22:59, John Hardin wrote:

On Wed, 24 Feb 2016, Steve wrote:

I've used spamassassin for many years - on Ubuntu, using amvisd - 
with great success.  In recent months, I've been receiving several 
spam messages each day that evade the filters.


Can you provide samples? (e.g. three or four on Pastebin)


One of each of the most common forms:

http://pastebin.com/Wk2KD1Q1
http://pastebin.com/QCQ9Ymw7
http://pastebin.com/wgkmiJLt

I note that they tend to come from different mail servers each time - 
the URLs in the body tend to be unique, too.




* The false positives all match BAYES_00 - attracting a default score 
of -1.9. BAYES_00 seems to be at the crux of the misclassification.


Is there a way to delve into why these messages have been allocated 
such a low bayes score - while (to a human) appearing blatant, 
simple, spam on "vanilla" spam topics?  Has my bayes data been 
"poisoned" somehow?


Poisoning is less likely than mistraining.
How large is your userbase and mail volume?


One user - me - several email addresses.  10,000 mails per month - 
several mailing lists where I read only a tiny fraction of the posts.  
~1,500 spams (that survive mail server RBLs).  Autolearn is on - I don't 
think about it, it is automatic. :)


How do you train your Bayes? Autolearn? General user submissions? 
Trusted user submissions? Only you, from only your personal mail?
Only my personal mailbox *really* matters to me.  I train from it using 
the dovecot antispam plugin... which feeds mail I shift to/from a spam 
folder through a pipe involving "spamc -C".


Do you keep base training corpora so you can wipe and retrain if it 
goes off the rails for some reason?
(In principle) I've got multi-gigabyte-scale spam/ham corpora.  I'm yet 
to [ever] do anything with it. :)


It is worth noting that I get a lot of correctly identified spam - 
and much of that matches BAYES_99 and BAYES_999... and my ham gets 
BATES_00... so, for many messages, bayes is working. Is it likely 
that I am suffering poor performance (for these specific messages) as 
a result of some tunable parameter?


Probably not. There's not a lot to tune in Bayes. It's pretty much 
solely dependent on what you've trained it with.



What is the most effective way to tackle this?


If all the FNs are getting BAYES_00, make sure you're (re)training 
them as spam.
I believe I'm doing that - but it isn't easy to prove that the training 
'worked'.


Review how you're training. If your users aren't really trustworthy 
you should be manually reviewing submissions.


When spam  arrives in my primary inbox, I hand classify - I'm less 
obsessive about mailing lists. Dovecot initiates training automatically 
when I shift messages to a special spam folder.


I feel autolearn can be problematic, particularly if things are 
already going off the rails.


I expect Autolearn (assisted by Razor, Pyzor and DCC) has done the vast 
majority of my training.  This year, I've hand-trained 216 
false-negatives and 0 false positives.


If you have base training corpora, review it for misclassifications 
(FNs), wipe and retrain.


I guess I could do that... My expectation is that - if I train with the 
corpora I can pick easily (without changing configuration) I'll get the 
same bayes database I currently have... which will give the same 
scores.  Really, I'd like to understand why my current bayes database 
makes the classifications it does.





Re: Spamassassin Bayes... "why give that spam that score???"

2016-02-24 Thread John Hardin

On Wed, 24 Feb 2016, Steve wrote:

I've used spamassassin for many years - on Ubuntu, using amvisd - with great 
success.  In recent months, I've been receiving several spam messages each 
day that evade the filters.


Can you provide samples? (e.g. three or four on Pastebin)

* The false positives all match BAYES_00 - attracting a default score of 
-1.9. BAYES_00 seems to be at the crux of the misclassification.


Is there a way to delve into why these messages have been allocated such a 
low bayes score - while (to a human) appearing blatant, simple, spam on 
"vanilla" spam topics?  Has my bayes data been "poisoned" somehow?


Poisoning is less likely than mistraining.

How large is your userbase and mail volume?

How do you train your Bayes? Autolearn? General user submissions? Trusted 
user submissions? Only you, from only your personal mail?


Do you keep base training corpora so you can wipe and retrain if it goes 
off the rails for some reason?


It is worth noting that I get a lot of correctly identified spam - and 
much of that matches BAYES_99 and BAYES_999... and my ham gets 
BATES_00... so, for many messages, bayes is working. Is it likely that I 
am suffering poor performance (for these specific messages) as a result 
of some tunable parameter?


Probably not. There's not a lot to tune in Bayes. It's pretty much solely 
dependent on what you've trained it with.



What is the most effective way to tackle this?


If all the FNs are getting BAYES_00, make sure you're (re)training them as 
spam.


Review how you're training. If your users aren't really trustworthy you 
should be manually reviewing submissions.


I feel autolearn can be problematic, particularly if things are already 
going off the rails.


If you have base training corpora, review it for misclassifications (FNs), 
wipe and retrain.


If you *don't* have base training corpora, start building them.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Maxim XXIX: The enemy of my enemy is my enemy's enemy.
  No more. No less.
---
 65 days since the first successful real return to launch site (SpaceX)