Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Bill Cole

On 13 Feb 2018, at 9:33, Horváth Szabolcs wrote:

This is a production mail gateway serving since 2015. I saw that a few 
messages (both hams and spams) automatically learned by 
amavisd/spamassassin. Today's statistics:


   3616 autolearn=ham
  10076 autolearn=no
   2817 autolearn=spam
134 autolearn=unavailable


That's quite high for spam, ham, AND "unavailable" (which indicates 
something wrong with the Bayes subsystem, usually transient.) This seems 
like a recipe for a mis-learning disaster. For comparison, my 2018 
autolearn counts:


spam: 418
ham: 15018
unavailable: 166
no: 129555

I also manually train any spam that gets through to me (the biggest spam 
target,) a small number of spams reported by others, and 'trap' hits. A 
wide variety of ham is harder to get for training but I have found it 
useful to give users a well-documented and simple way to help. One way 
is to look at what happens to mail AFTER delivery which can indicate 
that a message is ham without needing an admin to try to make a 
determination based on content. The simplest one is to learn anything 
users mark as $NotJunk as ham. Another is to create an "Archive" mailbox 
for every user and learn anything as ham that has been moved there a day 
after it is moved. The most important factor (especially in 
jurisdictions where human examination of email is a problem) is to tell 
users how to protect their email and then do what you tell them, 
robotically. In the US, Canada, and *SOME* of the EU, this is not risky. 
However, I have been told by people in *SOME* EU countries that they 
can't even robotically scan ANY mail content, so you shouldn't take my 
advice as authoritative: I'm not even a lawyer in the US, much less 
Hungary...



I think I have no control over what is learnt automatically.


Yes, you do. Run "perldoc 
Mail::SpamAssassin::Plugin::AutoLearnThreshold" for details.


You can set the learning thresholds, which control what gets learned. 
The defaults (0.1 and 12) mis-learn far too much spam as ham and not 
enough spam. I use -0.2 and 6, which means I don't autolearn a lot but 
everything I autolearn as ham has at least one hit on a substantial 
"nice" rule or 2 hits on weak ones.


There's a lot of vehemence against autolearn expressed here but not a 
lot of evidence that it operates poorly when configured wisely. The 
defaults are NOT wise.



Let's just assume for a moment that 1.4M ham-samples are valid.


Bad assumption. Your Bayes checks are uncertain about mail you've told 
SA is definitely spam. That's broken. It's a sort of breakage that 
cannot exist if you do not have a large quantity of spam that has been 
learned as ham.



Is there a ham:spam ratio I should stick to it?


No.

I presume if we have a 1:1 ratio then future messages won't be 
considered as spam as well.


The ham:spam ratio in the Bayes DB or its autolearning is not a 
generally useful metric. 1:1 is not magically good and neither is any 
other ratio, even with reference to a single site's mailstream. A very 
large ratio *on either side* indicates a likely problem in what is being 
learned, but you can't correlate the ratio to any particularly wrong 
bias in Bayes scoring. It is an inherently chaotic relationship. Factors 
that actually matter are correctness of learning, sample quality, and 
currency. You can control how current your Bayes DB is (USE AUTO-EXPIRE) 
but the other two factors are never going to be perfect.


RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread John Hardin

On Tue, 13 Feb 2018, Horváth Szabolcs wrote:


3. populate the ham database


That's the tricky part. As I mentioned earlier, I don't really want 
end-users involved in this.


You might be able to find a few that are somewhat technically competent 
and don't mind their ham samples being manually reviewed.



One more question: is there a recommended ham to spam ratio? 1:1?


I suggest "try to match your ham:spam ratio at your MTA before filtering", 
but others may have different advice. Generally: the more *reliable* data 
you can feed Bayes, the better it does.


I'm thinking about if you see my "populating the ham database 
automatically with the outgoing emails" idea as a complete nonsense, 
then I would find sysadministrator resource to collect 2000 legit emails 
and train those mails as hams, but cannot allocate 2 workhours/day for 
months. (Also I'm not sure if 2000 legit emails are enough for training)


2000 is enough to start, but it would have to be ongoing as the nature of 
mail changes over time.


Generally training on misclassifications is what you do after the initial 
training. So if a ham drops into a user's quarantine folder, you'd want to 
train that as ham.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Windows Genuine Advantage (WGA) means that now you use your
  computer at the sufferance of Microsoft Corporation. They can
  kill it remotely without your consent at any time for any reason;
  it also shuts down in sympathy when the servers at Microsoft crash.
---
 9 days until George Washington's 286th Birthday

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Benny Pedersen

John Hardin skrev den 2018-02-14 02:28:

Properly training your Bayes and increasing the score for BAYES_80, 
BAYES_95, and BAYES_99

and BAYES_999


score BAYES_999 5000

/me hiddes, could not resists :=)


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread John Hardin

On Tue, 13 Feb 2018, David Jones wrote:

Properly training your Bayes and increasing the score for BAYES_80, BAYES_95, 
and BAYES_99


and BAYES_999


is the best bet on this one.



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Windows Genuine Advantage (WGA) means that now you use your
  computer at the sufferance of Microsoft Corporation. They can
  kill it remotely without your consent at any time for any reason;
  it also shuts down in sympathy when the servers at Microsoft crash.
---
 9 days until George Washington's 286th Birthday


Re: Malformed List-Id header

2018-02-13 Thread Kenneth Porter

On 2/4/2018 3:35 PM, Kenneth Porter wrote:
I've noticed quite a bit of spam lately with a malformed List-Id 
header. Most notably, the angle brackets are missing, but the contents 
of the angle brackets when present often don't look like a domain. No 
dots, for example.





It looks like the header has the same format as a To or From header, 
except that there's no local part or at-sign in the angle brackets. Is 
there some kind of test already available to validate the format of the 
to/from headers that I could adapt to validate the List-id header? I 
don't want to re-invent the wheel and create a new test if one already 
exists.


Ideally I'd like to extract the domain (minus the list name) and run 
that through the DNSBLs, as well.




Re: Email filtering theory and the definition of spam

2018-02-13 Thread Rupert Gallagher
Said the blind person...

Sent from ProtonMail Mobile

On Tue, Feb 13, 2018 at 21:03, @lbutlr  wrote:

> On 13 Feb 2018, at 06:57, Rupert Gallagher wrote: > Not sure why you guys are 
> still discussing RFCs, though, Because one person keeps insisting that RFC822 
> is the relevant active standard despite being shown multiple times that it’s 
> been obsoleted. Twice. -- If you [Carrot] were dice, you'd always roll sixes. 
> And the dice don't roll themselves. If it wasn't against everything he wanted 
> to be true about the world, Vimes might just then have believed in destiny 
> controlling people. And gods help the other people who were around when a big 
> destiny was alive in the world, bending every poor bugger around itself... 
> @protonmail.com>

Re: URIBL_BLOCKED

2018-02-13 Thread David B Funk

If you read that informational spamassassin wiki page referenced in that message
you'd know that it has nothing to do with querying a Russian RBL.

That Russian URI is what the query to URIBL was asking.
So your use of URIBL (via spamassassin) hit a threshold and was blocked.

Read that spamassassin wiki page for more information.


On Tue, 13 Feb 2018, @lbutlr wrote:


0.0 URIBL_BLOCKED  ADMINISTRATOR NOTICE: The query to URIBL was blocked.
   See
   
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
   [URIs: cz-salda.ru]

So, I’ve never heard of cz-salda.ru, is that the RBL that is blocking me? If 
so, where is it listed in SA’s configuration (FreeBSD 11.1-RELEASE)? (tried a 
`grep salda.ru /usr/local/etc/mail/spamassassin/*` for no results)

Also, why would anything be checking a Russian RBL?

Supposedly I can disable this with a line like

Score RCVD_IN_ORBS 0

But “ORBS” wouldn’t be right and there’s nothing in the text above to indicate 
what it might be.





--
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{

URIBL_BLOCKED

2018-02-13 Thread @lbutlr
0.0 URIBL_BLOCKED  ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See

http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
 for more information.
[URIs: cz-salda.ru]

So, I’ve never heard of cz-salda.ru, is that the RBL that is blocking me? If 
so, where is it listed in SA’s configuration (FreeBSD 11.1-RELEASE)? (tried a 
`grep salda.ru /usr/local/etc/mail/spamassassin/*` for no results)

Also, why would anything be checking a Russian RBL?

Supposedly I can disable this with a line like

Score RCVD_IN_ORBS 0

But “ORBS” wouldn’t be right and there’s nothing in the text above to indicate 
what it might be.




Re: Email filtering theory and the definition of spam

2018-02-13 Thread @lbutlr
On 13 Feb 2018, at 06:57, Rupert Gallagher  wrote:
> Not sure why you guys are still discussing RFCs, though,

Because one person keeps insisting that RFC822 is the relevant active standard 
despite being shown multiple times that it’s been obsoleted. Twice.

-- 
If you [Carrot] were dice, you'd always roll sixes. And the dice don't
roll themselves. If it wasn't against everything he wanted to be true
about the world, Vimes might just then have believed in destiny
controlling people. And gods help the other people who were around when
a big destiny was alive in the world, bending every poor bugger around
itself...



RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Hello,

David Jones [mailto:djo...@ena.com] wrote:
> With non-English email flow, it's more challenging.  If no RBLs hit, then you 
> really must train your Bayes properly which requires some way to accurately 
> determine the ham and spam.  You must keep a copy of the 
ham and spam corpi and be allowed to review suspicious email.

I really appreciate you to take time helping on this. 

Yes, I can confirm that we usually have issues with Hungarian spams. English 
spams often caught by the default rules.

As far as I understood today, I need to re-build the bayes database from 
scratch:

1. turn off autolearning

2. populate then spam database
Guys behind the http://artinvoice.hu/spams/ site are doing an excellent work, 
they publish catched spams in mbox format
I checked, many spam e-mails that was sent for investigation are in their mbox.

3. populate the ham database
That's the tricky part. As I mentioned earlier, I don't really want end-users 
involved in this. And I don't have the necessary resource to do that manually.
I assume I can hack something into the mailflow to copy all outgoing e-mails to 
a separate mailbox and - we'll assume that every outgoing e-mail are hams - 
these mails are learnt.
That should do it?

End-users are working in a heavily controlled environment (both technically and 
legally), in the last ten years, we haven't experienced spams that were sent 
from inside. That's why I would blindly trust outgoing emails as hams.

One more question: is there a recommended ham to spam ratio? 1:1? 

I'm thinking about if you see my "populating the ham database automatically 
with the outgoing emails" idea as a complete nonsense, then I would find 
sysadministrator resource to collect 2000 legit emails and train those mails as 
hams, but cannot allocate 2 workhours/day for months. (Also I'm not sure if 
2000 legit emails are enough for training)

Best regards,
  Szabolcs Horvath


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread David Jones

On 02/13/2018 11:45 AM, Horváth Szabolcs wrote:

Reindl Harald [mailto:h.rei...@thelounge.net] wrote:

I think I have no control over what is learnt automatically.

surely, don't do autolearning at all


This is a mail gateway for multiple companies. I'm not supposed to read e-mails 
on that, or picking mails that can be used for learning ham.
And I can't ask users to use a "ham" mailbox, because they are not IT experts, 
sometimes they have problems with a simple mail forwarding.



If you aren't allowed to check specific emails with a suspicious subject 
or that are reported as spam by your users, there's no way you can do 
your job of accurately filtering email.



Without autolearning and without the help of the end-users, I can't build a 
proper ham bayes database, can I?



SA's autolearning doesn't use the results from BAYES_* rules since that 
could make incorrect training even worse so you are going to have to 
build local rules or get help from RBLs and other SA plugins to get to 
the autolearning thresholds.


With non-English email flow, it's more challenging.  If no RBLs hit, 
then you really must train your Bayes properly which requires some way 
to accurately determine the ham and spam.  You must keep a copy of the 
ham and spam corpi and be allowed to review suspicious email.


Can you setup a split copy of the email that can redact the recipient or 
anonymize it enough to allow for review?  If not, your filtering is not 
going to be accurate.



Best regards
   Szabolcs



--
David Jones


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread David Jones

On 02/13/2018 11:24 AM, Horváth Szabolcs wrote:

Hello,

David Jones [mailto:djo...@ena.com]  wrote:

There should be many more rule hits than just these 3.  It looks like
network tests aren't happening.
Can you post the original email to pastebin.com with minimal redacting
so the rest of us can run it through our SA to see how it scores to help
with suggestions?


Thanks for taking time to answer. Here it is: https://pastebin.com/5XZ5kbus



My SA instance would have blocked it but the 2 rules that did it won't 
apply to your mail flow based on language and non-US relays.


Properly training your Bayes and increasing the score for BAYES_80, 
BAYES_95, and BAYES_99 is the best bet on this one.  It might take some 
local content rules but I can't read the subject or body.  :)



Content analysis details:   (10.2 points, 5.0 required)

 pts rule name  description
 -- 
--

 5.2 BAYES_99   BODY: Bayes spam probability is 99 to 100%
[score: 0.9926]
 0.0 HTML_IMAGE_RATIO_08BODY: HTML has a low ratio of text to image 
area

 2.8 UNWANTED_LANGUAGE_BODY BODY: Message written in an undesired language
 0.0 HTML_MESSAGE   BODY: HTML included in message
 2.2 ENA_RELAY_NOT_US   Relayed from outside the US and not on 
whitelists

 0.0 ENA_BAD_SPAM   Spam hitting really bad rules.


This brings up a good point that we need help with non-English 
masscheckers and SA rules.


The sending mail server 79.96.0.147 is not listed on any major RBLs and 
it has proper FCrDNS.  I can't tell the envelope-from domain but it must 
not have an SPF record.  Definitely no DMARC record for fiok.com.


The "IdeaSmtpServer" might be something to investigate it's relationship 
to spam to see if it's an indicator worthy of a local rule.


The domain in the Message-ID might be worth checking with other spam to 
see if that is a pattern worth a local rule.


If there are unique body phrases or misspellings, then that is 
definitely something to put into a local rule to add a point or two in 
the future.



I suspect there needs to be some MTA tuning in front of SA along with
some SA tuning that is mentioned on this list every couple of months --
add extra RBLs, add KAM.cf, enable some SA plugins, etc.


Oops. I'm a new member on this list. Could you please tell us which 
customizations do you mean?
I already looked KAM.cf, doesn't really help in situation. We're using a lot of 
RBLs.




--
David Jones


RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Reindl Harald [mailto:h.rei...@thelounge.net] wrote:
>> This is a mail gateway for multiple companies. I'm not supposed to read 
>> e-mails on that, or picking mails that can be used for learning ham
> 
> how did you then manage 1.4 Mio ham-samples in your biased corpus

Looks like in this amavisd-spamassassin combo, it automatically learnt a lot of 
ham (which weren't hams)

Feb 11 03:37:31 amavis[20024]: (20024-06) spam-tag,  -> 
, No, score=-0.099 tagged_above=- required=4 
tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, 
HTML_MESSAGE=0.001] autolearn=ham

I never configured autolearning, I assume it came with this centos setup. Man 
spamassassin says, bayes_auto_learn has a default value of 1.

>> Without autolearning and without the help of the end-users, I can't build a 
>> proper ham bayes database, can I?
> surely, or don't you and people around you which can help don't send and 
> reveive mails?

I don't want to go in this "fight", but end-users have limited IT knowledge. 
They are 100% outlook users (forwarding inline and attached always confuse 
them).
If I really want this, I need something user-proof one click solutions like 
gmail's "spam" and "not spam" buttons which magically saves e-mails to the 
proper technical mailbox (which is reviewed by the admins then trained SA).
With outlook users, exchange internal mta's, my options are limited. 

So, if I understood correctly, you all agree on that bayesian database is 
f* up, let's start with a new one, autolearn turned off, and train SA from 
the stratch both with ham and spam mails.

Best regards
  Szabolcs


RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Reindl Harald [mailto:h.rei...@thelounge.net] wrote:
>> I think I have no control over what is learnt automatically.
> surely, don't do autolearning at all

This is a mail gateway for multiple companies. I'm not supposed to read e-mails 
on that, or picking mails that can be used for learning ham.
And I can't ask users to use a "ham" mailbox, because they are not IT experts, 
sometimes they have problems with a simple mail forwarding.

Without autolearning and without the help of the end-users, I can't build a 
proper ham bayes database, can I?

Best regards
  Szabolcs


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread John Hardin

On Tue, 13 Feb 2018, Horváth Szabolcs wrote:


After:

pts rule name  description
 -- --
0.0 HTML_IMAGE_RATIO_08BODY: HTML has a low ratio of text to image area
0.0 HTML_MESSAGE   BODY: HTML included in message
0.8 BAYES_50   BODY: Bayes spam probability is 40 to 60%
   [score: 0.5000]


BAYES_50 is "can't decide".



Version: spamassassin-3.3.2-4.el6.rfx.x86_64

$ sa-learn --dump magic --dbpath /var/spool/amavisd/.spamassassin/
0.000  0  3  0  non-token data: bayes db version
0.000  0 338770  0  non-token data: nspam
0.000  01460807  0  non-token data: nham


That ratio is really suspicious. I'd expect something closer to 1:1 or 
even a bit heavier on spam.


It *seems* that you have spam trained as ham; that would explain BAYES_50 
with that much in the BAYES database.



My questions are:
1) is there any chance to change spamassassin settings to mark similar messages 
as SPAM in the future?
bayes_50 with 0.8 points are really-really low.


No, it's not. "BAYES_50" is "I can't decide" and increasing the score for 
that implies "I can't decide" means "spam". That's not justified.


Don't adjust the score of BAYES_50.

It's recommended (if possible) to retain the training corpora so that it 
can be reviewed and retrained from scratch if necessary.


Your admin is manually vetting user-submitted training messages. Are they 
retained after being trained?


You might consider reviewing the training corpus and retraining Bayes from 
scratch.



Another note: the "before" result:


Before: spamassassin -D -t 

...with *no* BAYES hits at all (not even BAYES_50) suggests your SA is 
*not* using the database whose statistics you reported above.


First: verify which Bayes database your SA install is using, and that it 
is the one you're training into and getting those stats from.



--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Maxim IX: Never turn your back on an enemy.
---
 9 days until George Washington's 286th Birthday

RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Hello,

David Jones [mailto:djo...@ena.com]  wrote:
> There should be many more rule hits than just these 3.  It looks like 
> network tests aren't happening.
> Can you post the original email to pastebin.com with minimal redacting 
> so the rest of us can run it through our SA to see how it scores to help 
> with suggestions?

Thanks for taking time to answer. Here it is: https://pastebin.com/5XZ5kbus

> I suspect there needs to be some MTA tuning in front of SA along with 
> some SA tuning that is mentioned on this list every couple of months -- 
> add extra RBLs, add KAM.cf, enable some SA plugins, etc.

Oops. I'm a new member on this list. Could you please tell us which 
customizations do you mean?
I already looked KAM.cf, doesn't really help in situation. We're using a lot of 
RBLs.


> > It only assigns 0.8. (required_hits around 4.0)
> You are certainly free to set a local score higher if you want but that  is 
> probably not the main resolution to this issue.

I agree.

> > Version: spamassassin-3.3.2-4.el6.rfx.x86_64
> This is very old and no longer supported.  Why not upgrade to 3.4.x?

Because centos6 ships with this version. When the infrastructure was built, 
there were no centos7 around. Migration between the major versions is still not 
an easy thing to do.

> > My questions are:
> > 1) is there any chance to change spamassassin settings to mark similar 
> > messages as SPAM in the future?
> > bayes_50 with 0.8 points are really-really low.
> > 
>
> You should be hitting BAYES_95, BAYES_99, and BAYES_999 on these really 
> bad emails with proper training which would give it a higher probability 
> and thus a higher score.

I agree. Can't wait to see what your results are on this e-mail.

Best regards
  Szabolcs Horvath


Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread David Jones

On 02/13/2018 07:55 AM, Horváth Szabolcs wrote:

Dear members,

User repeatedly sends us spam messages to train SA.
Traning - at the moment - requires manual intervention: administrator verifies 
if it's really spam then issues sa-learn.

Then the user thinks the process is done, and the next time when the same email 
arrives, it will automatically marked as spam.

However, that doesn't happen.

Before: spamassassin -D -t 

There should be many more rule hits than just these 3.  It looks like 
network tests aren't happening.


Can you post the original email to pastebin.com with minimal redacting 
so the rest of us can run it through our SA to see how it scores to help 
with suggestions?


I suspect there needs to be some MTA tuning in front of SA along with 
some SA tuning that is mentioned on this list every couple of months -- 
add extra RBLs, add KAM.cf, enable some SA plugins, etc.




It only assigns 0.8. (required_hits around 4.0)



You are certainly free to set a local score higher if you want but that 
is probably not the main resolution to this issue.




Version: spamassassin-3.3.2-4.el6.rfx.x86_64



This is very old and no longer supported.  Why not upgrade to 3.4.x?



$ sa-learn --dump magic --dbpath /var/spool/amavisd/.spamassassin/
0.000  0  3  0  non-token data: bayes db version
0.000  0 338770  0  non-token data: nspam
0.000  01460807  0  non-token data: nham
0.000  0 187804  0  non-token data: ntokens
0.000  0 1512318030  0  non-token data: oldest atime
0.000  0 1518524875  0  non-token data: newest atime
0.000  0 1518524876  0  non-token data: last journal sync atime
0.000  0 1518508126  0  non-token data: last expiry atime
0.000  0  43238  0  non-token data: last expire atime delta
0.000  0 136970  0  non-token data: last expire reduction 
count

I obviously see that nspam is increased after the sa-learn.

When I tried to understand what was happening, I found the following:
# https://wiki.apache.org/spamassassin/BayesInSpamAssassin
The Bayesian classifier in Spamassassin tries to identify spam by looking at 
what are called tokens; words or short character sequences that are commonly 
found in spam or ham. If I've handed 100 messages to sa-learn that have the 
phrase penis enlargement and told it that those are all spam, when the 101st 
message comes in with the words penis and enlargment, the Bayesian classifier 
will be pretty sure that the new message is spam and will increase the spam 
score of that message.


My questions are:
1) is there any chance to change spamassassin settings to mark similar messages 
as SPAM in the future?
bayes_50 with 0.8 points are really-really low.



You should be hitting BAYES_95, BAYES_99, and BAYES_999 on these really 
bad emails with proper training which would give it a higher probability 
and thus a higher score.



I know that I'm able to write custom rules based on e-mail body content but I 
flattered myself that sa-learn would do that by manipulating the bayes database.



I suspect that after the MTA and SA are tuned, this would be blocked 
without requiring a local custom rule but I would need to see the rule 
hits on my SA platform before I could say for sure.  Sometimes it does 
require a header or body rule combine with other hits in a local custom 
meta rule to block them.



2) or tell users that learning process doesn't necessarily mean that future 
messages will be flagged SPAM.
Rather than it should be considered as a "warning sign".

I appreciate any feedback on this.

Already try to find docs that answers those questions, but no luck so far.
If you have a good documentation, just send me. I love reading manuals.

Best regards,
   Szabolcs Horvath



--
David Jones


RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Reindl Harald [mailto:h.rei...@thelounge.net] wrote:

> > However, that doesn't happen.
> > 0.000  0 338770  0  non-token data: nspam
> > 0.000  01460807  0  non-token data: nham

> what do you expect when you train 4 times more ham than spam?
> frankly you "flooded" your bayes with 1.4 Mio ham-samples and i thought 
> our 140k total corpus is large - don' forget that ham messages are 
> typically larger than junk trying to point you with some words to a URL
> 
> 108897   SPAM
> 31492HAM

This is a production mail gateway serving since 2015. I saw that a few messages 
(both hams and spams) automatically learned by amavisd/spamassassin. Today's 
statistics:

   3616 autolearn=ham
  10076 autolearn=no
   2817 autolearn=spam
134 autolearn=unavailable

I think I have no control over what is learnt automatically.

Let's just assume for a moment that 1.4M ham-samples are valid.
Is there a ham:spam ratio I should stick to it? I presume if we have a 1:1 
ratio then future messages won't be considered as spam as well.

Regards
  Szabolcs


Re: Email filtering theory and the definition of spam

2018-02-13 Thread Rupert Gallagher
Humans tend to confuse Science and Engineering, including professional 
journalists: their mistake does not change the facts, but certainly confuses 
the weaker minds.

Sent from ProtonMail Mobile

On Mon, Feb 12, 2018 at 08:49, Groach  
wrote:

> On 12/02/2018 06:54, Rupert Gallagher wrote:
>
>> A "standard" "obsoleted" by a "proposed standard" or a "draft standard" is 
>> nonsense. A standard is obsoleted by a new standard, not a draft or a 
>> proposal. RFC 821-822 are still the standard, until their obsoleting drafts 
>> and proposals become the new standard, and are clearly identified as such.
>>
>> Sent from ProtonMail Mobile
>
> As ever, though, whilst technically correct by definition, things are not so 
> black and white (humans tend to wander off the binary path that logic tends 
> to define and takes a short cut until a new path appears):
>
> https://tools.ietf.org/html/rfc7127#page-2
>
> Initially it was intended that most IETF technical specifications
>would progress through a series of maturity stages starting with
>Proposed Standard, then progressing to Draft Standard, then finally
>to Internet Standard (see
> [Section 6 of RFC 2026](https://tools.ietf.org/html/rfc2026#section-6)
> ).  For a number of
>reasons this progression is not common.  Many Proposed Standards are
>actually deployed on the Internet and used extensively, as stable
>protocols.  This proves the point that the community often deems it
>unnecessary to upgrade a specification to Internet Standard.  Actual
>practice has been that full progression through the sequence of
>standards levels is typically quite rare, and most popular IETF
>protocols remain at Proposed Standard.
>
> (Not sure why you guys are still discussing RFCs, though, my definition of 
> Spam (as in the thread title) is what I choose to define it for my business 
> or personal likes - I dont need any RFC telling me what I find annoying or 
> unwanted or will be binned/filtered).

Train SA with e-mails 100% proven spams and next time it should be marked as spam

2018-02-13 Thread Horváth Szabolcs
Dear members,

User repeatedly sends us spam messages to train SA. 
Traning - at the moment - requires manual intervention: administrator verifies 
if it's really spam then issues sa-learn.

Then the user thinks the process is done, and the next time when the same email 
arrives, it will automatically marked as spam.

However, that doesn't happen.

Before: spamassassin -D -t https://wiki.apache.org/spamassassin/BayesInSpamAssassin
The Bayesian classifier in Spamassassin tries to identify spam by looking at 
what are called tokens; words or short character sequences that are commonly 
found in spam or ham. If I've handed 100 messages to sa-learn that have the 
phrase penis enlargement and told it that those are all spam, when the 101st 
message comes in with the words penis and enlargment, the Bayesian classifier 
will be pretty sure that the new message is spam and will increase the spam 
score of that message.


My questions are: 
1) is there any chance to change spamassassin settings to mark similar messages 
as SPAM in the future?
bayes_50 with 0.8 points are really-really low. 

I know that I'm able to write custom rules based on e-mail body content but I 
flattered myself that sa-learn would do that by manipulating the bayes database.

2) or tell users that learning process doesn't necessarily mean that future 
messages will be flagged SPAM. 
Rather than it should be considered as a "warning sign".

I appreciate any feedback on this. 

Already try to find docs that answers those questions, but no luck so far. 
If you have a good documentation, just send me. I love reading manuals.

Best regards,
  Szabolcs Horvath