Re: BayesStore MariaDB on EL9

2024-06-18 Thread Gerald Vogt

On 18.06.24 22:23, Bill Cole wrote:

On 2024-06-18 at 14:58:15 UTC-0400 (Tue, 18 Jun 2024 20:58:15 +0200)
Gerald Vogt 
is rumored to have said:


Hi,

for a test, I have increased the column length of token to binary(32) 
and used a test file to import containing a single token.


This time it went through. However, as I suspected, the token length 
is not 5 byte. Token line from backup:


t    1    0    1718024618    027121926a

Hex representation of content in database:

MariaDB [spamassassin]> select hex(token) from bayes_token\G
*** 1. row ***
hex(token): 
027121C2926A

1 row in set (0.000 sec)

Compared:

Original 02 71 21    92 6a
Database 02 71 21 C2 92 6A

C2 92 is the UTF-8 encoding of U+0092, thus basically the token is 
written in UTF-8 into the database.


That's odd... What is the character set of the database?


It is standard DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci 
just like the table.


Running sa-learn with DBI_TRACE=2 I can also see that it looks like it 
actually has the UTF-8 encoding already in there during parameter 
binding:


Binding parameters: INSERT INTO bayes_token
   (id, token, spam_count, ham_count, atime)
   VALUES ('43','^Bq!j','1','0','1718024618')
   ON DUPLICATE KEY UPDATE spam_count = 
GREATEST(spam_count + '1', 0),
   ham_count = GREATEST(ham_count 
+ '0', 0),
   atime = GREATEST(atime, 
'1718024618')


Thus, I would say it's not an issue with the database.

Any idea?

Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4.


First: upgrade to 4.0.1


Well, it's the RHEL packaged version. I don't really want to upgrade to 
a manually handled version.


There were substantial changes in how encoding was handled between 3.4.6 
and 4.0, and there is a substantial likelihood that any problem with 
encoding would not occur in 4.0 or later.


Yes, you are right. It works with 4.0.1.

I have looked into the source code and the reason became obvious pretty 
quickly, e.g. the part in _put_token in 3.4.6


https://github.com/apache/spamassassin/blob/4a1fe99da9296364be0c50f02d2a73b5af74207a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L827

compared with this in trunk

https://github.com/apache/spamassassin/blob/8307bb22a7709125ab0f8e94fb7a271461944f61/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L997

4.0 does specifically tag the token as BINARY while default is VARCHAR I 
think. Thus, it automatically encodes it.


This was added in

https://github.com/apache/spamassassin/commit/3dd8ea4ff51d50a72212ac8cbb2f6f8d443c3489

I'll open a bug with redhat and see if they either upgrade spamassassin 
in EL9 or backport something into 3.4.6.


Just for the fun of it, I have replaced the packaged file with the 4.0.1 
MySQL.pm file and then it works. Looking at the commit and the commit 
history after, I think the 4.0.1 MySQL.pm should work just fine in 3.4.6.


Anyway, we'll see what RedHat does about this.

Thanks a lot!

Regards,

Gerald


Docs confusion and missing dependency on EL9

2024-06-18 Thread Gerald Vogt

Hi,

for testing I tried to install spamassassin 4.0.1 on EL9 (AlmaLinux 
9.4). I have noticed some dependencies are not mentioned on the INSTALL 
page:


I have had to install perl-ExtUtils-MakeMaker.noarch to run Makefile.PL
I have had to install perl-Archive-Tar.noarch to run sa-update.

Those two are nowhere mentioned.

It also took me a while to find the instructions how to install.

I started at https://spamassassin.apache.org/index.html

where "Click here to get started using SpamAssassin! " looked promising.

But at

https://cwiki.apache.org/confluence/display/SPAMASSASSIN/StartUsing

I have spent considerable time to look for where to download and how to 
actually install spamassassin, but eventually gave up. Only now I have 
found some instructions on the SingleUserUnixInstall page.


So I have circled back and checked the Download link from the top. There 
I can download the tar, get hints on Upgrading but still nothing on 
installation.


The Wiki and FAQ links from the top are not helpful either.

So eventually, I have found it on "Docs", pointing to the INSTALL file.

From experience, that it not really the first place I would look.

I would think the "Get Started" page should have a link to the Download 
and INSTALL page at the beginning. Downloading and installing seem to be 
the obvious first steps to get started.


The Download page should have a link for INSTALL like it already has for 
the Upgrade.


And I would say "Where to download" and "How to install" are pretty 
common FAQs, too.


I hope this helps.

Thanks,

Gerald





Re: BayesStore MariaDB on EL9

2024-06-18 Thread Bill Cole

On 2024-06-18 at 14:58:15 UTC-0400 (Tue, 18 Jun 2024 20:58:15 +0200)
Gerald Vogt 
is rumored to have said:


Hi,

for a test, I have increased the column length of token to binary(32) 
and used a test file to import containing a single token.


This time it went through. However, as I suspected, the token length 
is not 5 byte. Token line from backup:


t   1   0   1718024618  027121926a

Hex representation of content in database:

MariaDB [spamassassin]> select hex(token) from bayes_token\G
*** 1. row ***
hex(token): 
027121C2926A

1 row in set (0.000 sec)

Compared:

Original 02 71 2192 6a
Database 02 71 21 C2 92 6A

C2 92 is the UTF-8 encoding of U+0092, thus basically the token is 
written in UTF-8 into the database.


That's odd... What is the character set of the database?

Running sa-learn with DBI_TRACE=2 I can also see that it looks like it 
actually has the UTF-8 encoding already in there during parameter 
binding:


Binding parameters: INSERT INTO bayes_token
   (id, token, spam_count, ham_count, atime)
   VALUES ('43','^Bq!j','1','0','1718024618')
   ON DUPLICATE KEY UPDATE spam_count = 
GREATEST(spam_count + '1', 0),
   ham_count = GREATEST(ham_count 
+ '0', 0),
   atime = GREATEST(atime, 
'1718024618')


Thus, I would say it's not an issue with the database.

Any idea?

Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4.


First: upgrade to 4.0.1

There were substantial changes in how encoding was handled between 3.4.6 
and 4.0, and there is a substantial likelihood that any problem with 
encoding would not occur in 4.0 or later.


I don't know exactly what the cause of the problem is (i.e. why is SA 
trying to write UTF-8 to the database?) but I'm quite sure that an 
official fix for 3.4.x will never happen.






Thanks,

Gerald

On 18.06.24 17:09, Gerald Vogt wrote:

Hi!

I am trying to use a mariadb database as bayesstore, but it fails to 
load tokens. Whenever it tries to insert something into bayes_token 
it fails with an error


dbg: bayes: _put_token: SQL error: Data too long for column 'token' 
at row 1


The table has been created as mentioned in

https://github.com/apache/spamassassin/blob/trunk/sql/bayes_mysql.sql

but the 5 byte binary isn't big enough. I have tried with sa-learn 
--restore as well as learning some spam mails. bayes_token remains 
empty.


MariaDB [spamassassin]> show create table bayes_token\G
*** 1. row ***
    Table: bayes_token
Create Table: CREATE TABLE `bayes_token` (
   `id` int(11) NOT NULL DEFAULT 0,
   `token` binary(5) NOT NULL,
   `spam_count` int(11) NOT NULL DEFAULT 0,
   `ham_count` int(11) NOT NULL DEFAULT 0,
   `atime` int(11) NOT NULL DEFAULT 0,
   PRIMARY KEY (`id`,`token`),
   KEY `bayes_token_idx1` (`id`,`atime`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci
1 row in set (0.000 sec)

Any idea what goes wrong here?

Thanks,

Gerald





--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo@toad.social and many *@billmail.scconsult.com 
addresses)

Not Currently Available For Hire


Re: BayesStore MariaDB on EL9

2024-06-18 Thread Gerald Vogt

Hi,

for a test, I have increased the column length of token to binary(32) 
and used a test file to import containing a single token.


This time it went through. However, as I suspected, the token length is 
not 5 byte. Token line from backup:


t   1   0   1718024618  027121926a

Hex representation of content in database:

MariaDB [spamassassin]> select hex(token) from bayes_token\G
*** 1. row ***
hex(token): 027121C2926A
1 row in set (0.000 sec)

Compared:

Original 02 71 2192 6a
Database 02 71 21 C2 92 6A

C2 92 is the UTF-8 encoding of U+0092, thus basically the token is 
written in UTF-8 into the database.


Running sa-learn with DBI_TRACE=2 I can also see that it looks like it 
actually has the UTF-8 encoding already in there during parameter binding:


Binding parameters: INSERT INTO bayes_token
   (id, token, spam_count, ham_count, atime)
   VALUES ('43','^Bq!j','1','0','1718024618')
   ON DUPLICATE KEY UPDATE spam_count = GREATEST(spam_count 
+ '1', 0),
   ham_count = GREATEST(ham_count + 
'0', 0),
   atime = GREATEST(atime, 
'1718024618')


Thus, I would say it's not an issue with the database.

Any idea?

Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4.

Thanks,

Gerald

On 18.06.24 17:09, Gerald Vogt wrote:

Hi!

I am trying to use a mariadb database as bayesstore, but it fails to 
load tokens. Whenever it tries to insert something into bayes_token it 
fails with an error


dbg: bayes: _put_token: SQL error: Data too long for column 'token' at 
row 1


The table has been created as mentioned in

https://github.com/apache/spamassassin/blob/trunk/sql/bayes_mysql.sql

but the 5 byte binary isn't big enough. I have tried with sa-learn 
--restore as well as learning some spam mails. bayes_token remains empty.


MariaDB [spamassassin]> show create table bayes_token\G
*** 1. row ***
    Table: bayes_token
Create Table: CREATE TABLE `bayes_token` (
   `id` int(11) NOT NULL DEFAULT 0,
   `token` binary(5) NOT NULL,
   `spam_count` int(11) NOT NULL DEFAULT 0,
   `ham_count` int(11) NOT NULL DEFAULT 0,
   `atime` int(11) NOT NULL DEFAULT 0,
   PRIMARY KEY (`id`,`token`),
   KEY `bayes_token_idx1` (`id`,`atime`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci
1 row in set (0.000 sec)

Any idea what goes wrong here?

Thanks,

Gerald






BayesStore MariaDB on EL9

2024-06-18 Thread Gerald Vogt

Hi!

I am trying to use a mariadb database as bayesstore, but it fails to 
load tokens. Whenever it tries to insert something into bayes_token it 
fails with an error


dbg: bayes: _put_token: SQL error: Data too long for column 'token' at row 1

The table has been created as mentioned in

https://github.com/apache/spamassassin/blob/trunk/sql/bayes_mysql.sql

but the 5 byte binary isn't big enough. I have tried with sa-learn 
--restore as well as learning some spam mails. bayes_token remains empty.


MariaDB [spamassassin]> show create table bayes_token\G
*** 1. row ***
   Table: bayes_token
Create Table: CREATE TABLE `bayes_token` (
  `id` int(11) NOT NULL DEFAULT 0,
  `token` binary(5) NOT NULL,
  `spam_count` int(11) NOT NULL DEFAULT 0,
  `ham_count` int(11) NOT NULL DEFAULT 0,
  `atime` int(11) NOT NULL DEFAULT 0,
  PRIMARY KEY (`id`,`token`),
  KEY `bayes_token_idx1` (`id`,`atime`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci
1 row in set (0.000 sec)

Any idea what goes wrong here?

Thanks,

Gerald




Re: Sv: Re: Question about a rule

2024-06-18 Thread Laurent S.
I'd also strongly recommend adding boundaries: /\b(blah1|blah2|blah3)\b/i

Otherwise, you might have a whole *pano*ply of words that will make 
legit mails marked a spam. You need to be super sure about poison pills 
rules, or in french - *pillu*le empoisonnée.

Good luck.

On 18.06.24 13:35, Axb wrote:
> You need to enclose in brackets
> body LOCAL_BLAH   /(blah1|blah2|blah3)/i
> 
> On 6/18/24 13:05, Anders Gustafsson wrote:
>> Sure:
>>
>> body LOCAL_PORN_RULE   
>> /kiimainen|naida|sexikäs|nussikas|nussia|pillu|pano|kinky|bdsm|pillua|x69-JOOGA/i
>> score LOCAL_PORN_RULE 8
>> describe LOCAL_PORN_RULE   This catches peter's porn spam
>>
>> Sorry again for mailing directly. No idea why it suggests the user and not 
>> users@
>>
> 



Re: Sv: Re: Question about a rule

2024-06-18 Thread Axb

You need to enclose in brackets
body LOCAL_BLAH   /(blah1|blah2|blah3)/i

On 6/18/24 13:05, Anders Gustafsson wrote:

Sure:

body LOCAL_PORN_RULE   
/kiimainen|naida|sexikäs|nussikas|nussia|pillu|pano|kinky|bdsm|pillua|x69-JOOGA/i
score LOCAL_PORN_RULE 8
describe LOCAL_PORN_RULE   This catches peter's porn spam

Sorry again for mailing directly. No idea why it suggests the user and not 
users@





Re: Sv: Re: Question about a rule

2024-06-18 Thread Matus UHLAR - fantomas

On 18.06.24 14:05, Anders Gustafsson wrote:

body LOCAL_PORN_RULE   
/kiimainen|naida|sexikäs|nussikas|nussia|pillu|pano|kinky|bdsm|pillua|x69-JOOGA/i
score LOCAL_PORN_RULE 8
describe LOCAL_PORN_RULE   This catches peter's porn spam

Sorry again for mailing directly. No idea why it suggests the user and not 
users@



I guess that the "sexikäs" causes troubles.
Do you use SA 4.0 ? That should be compatible with utf-8. 




Matus UHLAR - fantomas  2024-06-18 14:00 >>>

On 18.06.24 13:50, Anders Gustafsson wrote:

body LOCAL_PORN_RULE   /word1|word2.|x69-JOOGA/i
score LOCAL_PORN_RULE 8
describe LOCAL_PORN_RULE   This catches peter's porn spam

Funny thing is that it seems to trigger on messages that contain none of those 
words. I have removed the
actual words so that my message will not be regarded ass spam ??

Wonder if it is that last word that matches some regexp??


This can happen in case of incorrect regular expression.
Maybe uf you posted it here, we could see the error.

run spamassassin -D < mail 2>/tmp/mail.err
and you should be able to see which string matched

Finally, SA recommends using multiple rules with small scores instead of
single rule with huge score.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
"To Boot or not to Boot, that's the question." [WD1270 Caviar]


Sv: Re: Question about a rule

2024-06-18 Thread Anders Gustafsson
Sure:

body LOCAL_PORN_RULE   
/kiimainen|naida|sexikäs|nussikas|nussia|pillu|pano|kinky|bdsm|pillua|x69-JOOGA/i
score LOCAL_PORN_RULE 8
describe LOCAL_PORN_RULE   This catches peter's porn spam

Sorry again for mailing directly. No idea why it suggests the user and not 
users@

-- 
Med vänlig hälsning

Anders Gustafsson, ingenjör
anders.gustafs...@pedago.fi  |  Support +358 18 12060  |  Direkt +358 9 315 45 
121  |  Mobil +358 40506 7099

Pedago interaktiv ab, Nygatan 7 B , AX-22100 MARIEHAMN, ÅLAND, FINLAND



>>> Matus UHLAR - fantomas  2024-06-18 14:00 >>>
On 18.06.24 13:50, Anders Gustafsson wrote:
>body LOCAL_PORN_RULE   /word1|word2.|x69-JOOGA/i
>score LOCAL_PORN_RULE 8
>describe LOCAL_PORN_RULE   This catches peter's porn spam
>
>Funny thing is that it seems to trigger on messages that contain none of those 
>words. I have removed the
>actual words so that my message will not be regarded ass spam ??
>
>Wonder if it is that last word that matches some regexp??

This can happen in case of incorrect regular expression.
Maybe uf you posted it here, we could see the error.

run spamassassin -D < mail 2>/tmp/mail.err
and you should be able to see which string matched

Finally, SA recommends using multiple rules with small scores instead of 
single rule with huge score.


-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ 
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
"They say when you play that M$ CD backward you can hear satanic messages."
"That's nothing. If you play it forward it will install Windows."


Re: Question about a rule

2024-06-18 Thread Matus UHLAR - fantomas

On 18.06.24 13:50, Anders Gustafsson wrote:

body LOCAL_PORN_RULE   /word1|word2.|x69-JOOGA/i
score LOCAL_PORN_RULE 8
describe LOCAL_PORN_RULE   This catches peter's porn spam

Funny thing is that it seems to trigger on messages that contain none of those 
words. I have removed the
actual words so that my message will not be regarded ass spam ??

Wonder if it is that last word that matches some regexp??


This can happen in case of incorrect regular expression.
Maybe uf you posted it here, we could see the error.

run spamassassin -D < mail 2>/tmp/mail.err
and you should be able to see which string matched

Finally, SA recommends using multiple rules with small scores instead of 
single rule with huge score.



--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
"They say when you play that M$ CD backward you can hear satanic messages."
"That's nothing. If you play it forward it will install Windows."


Question about a rule

2024-06-18 Thread Anders Gustafsson
We have a rule that is supposed to catch various porn-related stuff:

body LOCAL_PORN_RULE   /word1|word2.|x69-JOOGA/i
score LOCAL_PORN_RULE 8
describe LOCAL_PORN_RULE   This catches peter's porn spam

Funny thing is that it seems to trigger on messages that contain none of those 
words. I have removed the
actual words so that my message will not be regarded ass spam ��

Wonder if it is that last word that matches some regexp??


-- 
Med vänlig hälsning

Anders Gustafsson, ingenjör
anders.gustafs...@pedago.fi  |  Support +358 18 12060  |  Direkt +358 9 315 45 
121  |  Mobil +358 40506 7099

Pedago interaktiv ab, Nygatan 7 B , AX-22100 MARIEHAMN, ÅLAND, FINLAND