Re: spamassassin and *compressed* Maildir

2021-05-21 Thread RW
On Fri, 21 May 2021 15:41:22 -0400
Clive Jacques wrote:

> I have a mail folder that I put false negatives in (i.e., spam which
> ends up in my inbox) and another for false negatives (ham that ends
> up in my spam folder).  Each night I run sa-learn on each folder
> (sa-learn will munch on entire Maildirs) and also feed each message
> to spamassassin -r to report it.  So using zcat or gunzip -c will
> work for spamassassin -r, but not for sa-learn.

spamassassin -r also trains to Bayes


Re: spamassassin and *compressed* Maildir

2021-05-21 Thread Lucas Rolff
$ cat RMScaa8wVRnMfwqlQ0RxAzDjYGmIumlp1wlA8QNr8z.eml | sa-learn --spam
Learned tokens from 0 message(s) (1 message(s) examined)

Indeed does work from stdin

- Lucas

From: Clive Jacques 
Date: Friday, 21 May 2021 at 21.41
To: "users@spamassassin.apache.org" 
Subject: Re: spamassassin and *compressed* Maildir

I have a mail folder that I put false negatives in (i.e., spam which ends up in 
my inbox) and another for false negatives (ham that ends up in my spam folder). 
 Each night I run sa-learn on each folder (sa-learn will munch on entire 
Maildirs) and also feed each message to spamassassin -r to report it.  So using 
zcat or gunzip -c will work for spamassassin -r, but not for sa-learn.

Unless sa-learn can munch on stdin as well as files

-CJ

On Fri, May 21, 2021 at 3:28 PM Lucas Rolff 
mailto:lu...@lucasrolff.com>> wrote:
You can do `zcat -f` or `gunzip -c -f` and avoid having to have .gz extension, 
that way you can skip the rename step

Best Regards,
Lucas Rolff

From: Clive Jacques mailto:westriverp...@gmail.com>>
Date: Friday, 21 May 2021 at 21.04
To: "users@spamassassin.apache.org" 
mailto:users@spamassassin.apache.org>>
Subject: Re: spamassassin and *compressed* Maildir

That's confirmed.  sa-learn doesn't like compressed files.  I don't know if it 
will dine on compressed files with the correct extension (i.e., .gz).  
Unfortunately, when using compression with Maildir format, Dovecot doesn't seem 
to like to use extensions.  So, I copied the directory to a temporary location, 
decompressed the files and then set sa-learn on them.  Even getting gunzip to 
operate on the files was a pain because it only wants files with the .gz 
extension (so I had to rename all 6,000 of them first - using a utility like 
'rename').  I then did the same thing with about 9,000 hams.

There was much good news.  Learning proceeded about the same pace, but syncing 
the journal to the database was much faster.  Maybe the tokens were smaller?  I 
verified that it seemed to work with --dump magic.

Then, all by itself, Spamassassin's bayes filtering was instantly much better.  
Stuff that was tripping BAYES_00 was suddenly popping BAYES_99.

Now, I just need to update my nightly learning/reporting script.

Still, a very nice result.

On Fri, May 21, 2021 at 11:30 AM Henrik K mailto:h...@hege.li>> 
wrote:
On Fri, May 21, 2021 at 10:54:54AM -0400, Clive Jacques wrote:
> Do spamassassin or sa-learn understand compressed files or compressed Maildir?

I believe sa-learn will automatically decompress if the files have .gz or
.bz2 extension, but yes Maildir files without extension will not work.

Should be easy to detect compressed Maildir files, perhaps file enhancement
request in bugzilla.


Re: spamassassin and *compressed* Maildir

2021-05-21 Thread Clive Jacques
I have a mail folder that I put false negatives in (i.e., spam which ends
up in my inbox) and another for false negatives (ham that ends up in my
spam folder).  Each night I run sa-learn on each folder (sa-learn will
munch on entire Maildirs) and also feed each message to spamassassin -r to
report it.  So using zcat or gunzip -c will work for spamassassin -r, but
not for sa-learn.

Unless sa-learn can munch on stdin as well as files

-CJ

On Fri, May 21, 2021 at 3:28 PM Lucas Rolff  wrote:

> You can do `zcat -f` or `gunzip -c -f` and avoid having to have .gz
> extension, that way you can skip the rename step
>
>
>
> Best Regards,
>
> Lucas Rolff
>
>
>
> *From: *Clive Jacques 
> *Date: *Friday, 21 May 2021 at 21.04
> *To: *"users@spamassassin.apache.org" 
> *Subject: *Re: spamassassin and *compressed* Maildir
>
>
>
> That's confirmed.  sa-learn doesn't like compressed files.  I don't know
> if it will dine on compressed files with the correct extension (i.e.,
> .gz).  Unfortunately, when using compression with Maildir format, Dovecot
> doesn't seem to like to use extensions.  So, I copied the directory to a
> temporary location, decompressed the files and then set sa-learn on them.
> Even getting gunzip to operate on the files was a pain because it only
> wants files with the .gz extension (so I had to rename all 6,000 of them
> first - using a utility like 'rename').  I then did the same thing with
> about 9,000 hams.
>
>
>
> There was much good news.  Learning proceeded about the same pace, but
> syncing the journal to the database was *much *faster.  Maybe the tokens
> were smaller?  I verified that it seemed to work with --dump magic.
>
>
>
> Then, all by itself, Spamassassin's bayes filtering was instantly much
> better.  Stuff that was tripping BAYES_00 was suddenly popping BAYES_99.
>
>
>
> Now, I just need to update my nightly learning/reporting script.
>
>
>
> Still, a very nice result.
>
>
>
> On Fri, May 21, 2021 at 11:30 AM Henrik K  wrote:
>
> On Fri, May 21, 2021 at 10:54:54AM -0400, Clive Jacques wrote:
> > Do spamassassin or sa-learn understand compressed files or compressed
> Maildir?
>
> I believe sa-learn will automatically decompress if the files have .gz or
> .bz2 extension, but yes Maildir files without extension will not work.
>
> Should be easy to detect compressed Maildir files, perhaps file enhancement
> request in bugzilla.
>
>


Re: spamassassin and *compressed* Maildir

2021-05-21 Thread Lucas Rolff
You can do `zcat -f` or `gunzip -c -f` and avoid having to have .gz extension, 
that way you can skip the rename step

Best Regards,
Lucas Rolff

From: Clive Jacques 
Date: Friday, 21 May 2021 at 21.04
To: "users@spamassassin.apache.org" 
Subject: Re: spamassassin and *compressed* Maildir

That's confirmed.  sa-learn doesn't like compressed files.  I don't know if it 
will dine on compressed files with the correct extension (i.e., .gz).  
Unfortunately, when using compression with Maildir format, Dovecot doesn't seem 
to like to use extensions.  So, I copied the directory to a temporary location, 
decompressed the files and then set sa-learn on them.  Even getting gunzip to 
operate on the files was a pain because it only wants files with the .gz 
extension (so I had to rename all 6,000 of them first - using a utility like 
'rename').  I then did the same thing with about 9,000 hams.

There was much good news.  Learning proceeded about the same pace, but syncing 
the journal to the database was much faster.  Maybe the tokens were smaller?  I 
verified that it seemed to work with --dump magic.

Then, all by itself, Spamassassin's bayes filtering was instantly much better.  
Stuff that was tripping BAYES_00 was suddenly popping BAYES_99.

Now, I just need to update my nightly learning/reporting script.

Still, a very nice result.

On Fri, May 21, 2021 at 11:30 AM Henrik K mailto:h...@hege.li>> 
wrote:
On Fri, May 21, 2021 at 10:54:54AM -0400, Clive Jacques wrote:
> Do spamassassin or sa-learn understand compressed files or compressed Maildir?

I believe sa-learn will automatically decompress if the files have .gz or
.bz2 extension, but yes Maildir files without extension will not work.

Should be easy to detect compressed Maildir files, perhaps file enhancement
request in bugzilla.


Re: spamassassin and *compressed* Maildir

2021-05-21 Thread Clive Jacques
That's confirmed.  sa-learn doesn't like compressed files.  I don't know if
it will dine on compressed files with the correct extension (i.e., .gz).
Unfortunately, when using compression with Maildir format, Dovecot doesn't
seem to like to use extensions.  So, I copied the directory to a temporary
location, decompressed the files and then set sa-learn on them.  Even
getting gunzip to operate on the files was a pain because it only wants
files with the .gz extension (so I had to rename all 6,000 of them first -
using a utility like 'rename').  I then did the same thing with about 9,000
hams.

There was much good news.  Learning proceeded about the same pace, but
syncing the journal to the database was *much *faster.  Maybe the tokens
were smaller?  I verified that it seemed to work with --dump magic.

Then, all by itself, Spamassassin's bayes filtering was instantly much
better.  Stuff that was tripping BAYES_00 was suddenly popping BAYES_99.

Now, I just need to update my nightly learning/reporting script.

Still, a very nice result.

On Fri, May 21, 2021 at 11:30 AM Henrik K  wrote:

> On Fri, May 21, 2021 at 10:54:54AM -0400, Clive Jacques wrote:
> > Do spamassassin or sa-learn understand compressed files or compressed
> Maildir?
>
> I believe sa-learn will automatically decompress if the files have .gz or
> .bz2 extension, but yes Maildir files without extension will not work.
>
> Should be easy to detect compressed Maildir files, perhaps file enhancement
> request in bugzilla.
>
>


Re: Detect Emoticons in Subject

2021-05-21 Thread RW
On Thu, 20 May 2021 19:39:06 +0100
RW wrote:

> 
> /\xF0\x9F(?:\x98[\x80-\xBF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xBF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/


This includes the block mentioned by Bill Cole and and is simplified a
bit


/\xF0\x9F[\x98-\x99\xA4-\xA7\x8C-\x97][\x80-\x8F]|\xE2\x98[\xB9-\xBB]/


However, if you don't expect to get any legitimate mail with Asian
languages in the subject, you can probably get away with including all
4-byte UTF-8. Those code points are dominated by CJK, symbols, emojis
and dead languages.


/[\xF0-\xF7][\x80-\xBF]{3}|\xE2\x98[\xB9-\xBB]/


Re: spamassassin and *compressed* Maildir

2021-05-21 Thread Henrik K
On Fri, May 21, 2021 at 10:54:54AM -0400, Clive Jacques wrote:
> Do spamassassin or sa-learn understand compressed files or compressed Maildir?

I believe sa-learn will automatically decompress if the files have .gz or
.bz2 extension, but yes Maildir files without extension will not work.

Should be easy to detect compressed Maildir files, perhaps file enhancement
request in bugzilla.



spamassassin and *compressed* Maildir

2021-05-21 Thread Clive Jacques
Do spamassassin or sa-learn understand compressed files or compressed
Maildir?

I've been running spamassassin on my ubuntu mail server for years very
successfully.  Recently, I've been experiencing a lot of difficulty and I'm
trying to figure it out.  Earlier this year we upgraded the server from
Trusty Tahr to Xenial (long time coming!) and some other stuff got
upgraded as well.  We run an IMAP server with Dovecot against a Maildir
formatted message store.  I noticed the message store was taking a fair
amount of space, so I decided to compress it with zlib (gz compression).

Pretty much since the upgrade (and simultaneous switch to compressed
Maildir) spamassassin has been doing a much worse job.  I upgraded from the
distribution version of spamassassin (3.4.2) to the most recent version
(3.4.6) but no real joy.  I keep a 'learn spam' folder to put false
negatives in (stuff that makes it into my inbox which ought not to), and
every night, run sa-learn on it and also spamassassin -r to report it.  I
started noticing that DCC was complaining on report that "missing message
body; fatal error".

I ran spamassassin -d -r to see what was happening and noticed that it
interacted with dcc using dccproc.  Maybe dccproc doesn't understand
compressed mail?  Well, if it doesn't then perhaps sa-learn doesn't
either.  That might explain why my bayes rules don't seem to be working
very well despite retraining.

-CJ


Re: KAM_SENDGRID and SPF_HELO_NONE

2021-05-21 Thread Kevin A. McGrail
Interesting for sure.  For me I saw the issue start to really get noticed
last February.

I think there might be correlation with a hack on their platform too.

I reached out to Twilio leadership with nothing but crickets too.

Here is a great cyber security reporter and an article from August 2020:
https://krebsonsecurity.com/2020/08/sendgrid-under-siege-from-hacked-accounts/

What's amazing to me is how much they've done to fix the problem oh wait
they've done nothing...

-KAM


On Fri, May 21, 2021, 08:28 Jared Hall  wrote:

> Kevin A. McGrail wrote:
> > And that rule is probably designed to hit legitimate sendgrid emails.
> >
> > They have become a hacker and spammer haven over the last year and a
> > half approximately.
> >
> Damned straight.  I'd say more like 2.5 years, maybe 1.5 pre-pandemic
> years.
>
> SendGrid -> novel (at thie time) Positive Delivery company.
> SendGrid -> API opens up for quazi-spam/newsletter delivery..
> SendGrid -> adds support for smaller ISPs and their infected customers.
>
> For my part, I made some changes to my rules in CHAOS to differentiate
> between the occurrence of a SendGrid header versus encapsulated SendGrid
> headers like you'll get when larger mail systems populate the References
> header for forwarding. Respectively, the rules set are JR_SGRID_DIRECT
> and JR_SGRID_FWD. At least that seems to be a little more effective for
> Comcast and BellSouth mail systems.
>
> You just haven't lived until you've seen endless mailserver rejects
> issued to SendGrid and SendGrid Partners  who are sending you Aaron
> Smith Sextortions or Emotet variants.   If I'm a hostile, nation-state
> actor,  I probably already have an account with SendGrid.
>
> Nobody should be using SendGrid; NEVER, EVER.  One thing is certain, if
> this matter is NOT addressed by the mail admins on this list, it WILL BE
> addressed by the US Department of Commerce.
>
> What started out as an interesting project has become a National
> Security risk.
>
>
> -- Jared Hall
>
>
>
>
>
>
>


Re: KAM_SENDGRID and SPF_HELO_NONE

2021-05-21 Thread Jared Hall

Kevin A. McGrail wrote:

And that rule is probably designed to hit legitimate sendgrid emails.

They have become a hacker and spammer haven over the last year and a 
half approximately.



Damned straight.  I'd say more like 2.5 years, maybe 1.5 pre-pandemic years.

SendGrid -> novel (at thie time) Positive Delivery company.
SendGrid -> API opens up for quazi-spam/newsletter delivery..
SendGrid -> adds support for smaller ISPs and their infected customers.

For my part, I made some changes to my rules in CHAOS to differentiate 
between the occurrence of a SendGrid header versus encapsulated SendGrid 
headers like you'll get when larger mail systems populate the References 
header for forwarding. Respectively, the rules set are JR_SGRID_DIRECT 
and JR_SGRID_FWD. At least that seems to be a little more effective for 
Comcast and BellSouth mail systems.


You just haven't lived until you've seen endless mailserver rejects 
issued to SendGrid and SendGrid Partners  who are sending you Aaron 
Smith Sextortions or Emotet variants.   If I'm a hostile, nation-state 
actor,  I probably already have an account with SendGrid.


Nobody should be using SendGrid; NEVER, EVER.  One thing is certain, if 
this matter is NOT addressed by the mail admins on this list, it WILL BE 
addressed by the US Department of Commerce.


What started out as an interesting project has become a National 
Security risk.



-- Jared Hall








Re: KAM_SENDGRID and SPF_HELO_NONE

2021-05-21 Thread Matus UHLAR - fantomas

> Perhaps it's because Return-Path is null?
> Return-Path: <>

That's a different problem, apparently with your MTA->SA glue. The fact
that something added a non-null "X-Envelope-From:" header and something
(else?) added a null "Return-Path:" header indicates fundamental
breakage. Whether SA is seeing that or if it is a delivery artifact is
unclear.


On 20.05.21 18:24, Alex wrote:

Perhaps this is a problem with my amavis configuration? It appears all
quarantined messages have a null Return-Path header.


I have checked 2 of my installations and both have Return-Path equivalent to
X-Envelope-From: and recipients in X-Envelope-To:

I assume amavis only uses X-Envelope-* when picking mail from quarantine and
that Return-Path is not important.

Why it's empty, no idea.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Remember half the people you know are below average.


Re: Detect Emoticons in Subject

2021-05-21 Thread Henrik K
On Fri, May 21, 2021 at 09:53:36AM +0200, Tom Hendrikx wrote:
>
> Can someone explain why SA cannot support this type of syntax, or what would
> be needed to get it supported? IMHO it makes it a lot easier for end-users
> to understand a rule, and for rule developers to write or even contribute
> new UTF-8-related rules, so it might be worth the effort to get it
> supported?

Perl strings internally would have to be UTF8.  Mandatory prerequisite would
be normalize_charset 1 in SA.  Could be some cases where SA can't decode
mails properly to UTF8, so it's a question mark what happens then.

Some changes are coming already in 4.0, for example normalize_charset 1 will
be default.  But more complex internal/rule changes require a lot of thought
on how to maintain backwards compatibility.  I'm sure some people will still
run 3.4 for years to come.

Sorry to say but there are too few developers right now.  It's up to the
community to pick up the pace.



Re: Detect Emoticons in Subject

2021-05-21 Thread Tom Hendrikx

On 20-05-2021 18:19, RW wrote:

On Thu, 20 May 2021 11:42:59 -0400
Clive Jacques wrote:


Hi,

I've been using SA a long time.  Lately, I'm getting more and more
spam with emoticons in the subject line.  I'd say about 90% of my
emails with emoticons in the subject are spam.  I'd like to create a
local rule which scores email with emoticons in the subject.



# Local Rule for Emoticons in subject
subjectEMOTICON_IN_SUBJECT  Subject =~ /\p{Emoticons}/


The rule should start with "header", that's what's causing the lint
failure.

However, AFAIK, the rule still won't work because \p{Emoticons}
isn't supported in spamassassin, which works on byte sequences. You
need to rewrite it to match UTF-8 bytes.



I'm not a real fan of very complex regular expressions, as they tend to 
get hard to read/understand very quickly. This thread is a perfect 
example: the syntax that the OP proposed (/\p{Emoticons}/) seems 
perfectly readable, and all the actually working alternatives are, with 
all respect to the authors, a nightmare to decipher. Especially for 
users not really proficient in regular expressions, the OP's syntax is 
perfectly understandable and all the alternatives aren't.


I'm not really into the regex engine of perl/SA, so please correct if 
I'm wrong. The /\p{Emoticons}/ syntax seems to me a builtin feature of 
the regex spec/perl (as opposed to pseudo-code, displaying something 
that actually doesn't exist).


Can someone explain why SA cannot support this type of syntax, or what 
would be needed to get it supported? IMHO it makes it a lot easier for 
end-users to understand a rule, and for rule developers to write or even 
contribute new UTF-8-related rules, so it might be worth the effort to 
get it supported?


Thanks in advance,
Tom