Re: BayesStore MariaDB on EL9

2024-06-24 Thread Gerald Vogt

Hi,

for your information and anyone who comes across this problem: I have 
opened an issue with RedHat.


https://issues.redhat.com/browse/RHEL-43418

It probably will be backported, but may take some time, maybe in 9.5 or 
possibly later.


We'll see...

Regards,

Gerald

On 19.06.24 08:41, Gerald Vogt wrote:

On 18.06.24 22:23, Bill Cole wrote:

On 2024-06-18 at 14:58:15 UTC-0400 (Tue, 18 Jun 2024 20:58:15 +0200)
Gerald Vogt 
is rumored to have said:


Hi,

for a test, I have increased the column length of token to binary(32) 
and used a test file to import containing a single token.


This time it went through. However, as I suspected, the token length 
is not 5 byte. Token line from backup:


t    1    0    1718024618    027121926a

Hex representation of content in database:

MariaDB [spamassassin]> select hex(token) from bayes_token\G
*** 1. row ***
hex(token): 
027121C2926A

1 row in set (0.000 sec)

Compared:

Original 02 71 21    92 6a
Database 02 71 21 C2 92 6A

C2 92 is the UTF-8 encoding of U+0092, thus basically the token is 
written in UTF-8 into the database.


That's odd... What is the character set of the database?


It is standard DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci 
just like the table.


Running sa-learn with DBI_TRACE=2 I can also see that it looks like 
it actually has the UTF-8 encoding already in there during parameter 
binding:


Binding parameters: INSERT INTO bayes_token
   (id, token, spam_count, ham_count, atime)
   VALUES ('43','^Bq!j','1','0','1718024618')
   ON DUPLICATE KEY UPDATE spam_count = 
GREATEST(spam_count + '1', 0),
   ham_count = GREATEST(ham_count 
+ '0', 0),
   atime = GREATEST(atime, 
'1718024618')


Thus, I would say it's not an issue with the database.

Any idea?

Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4.


First: upgrade to 4.0.1


Well, it's the RHEL packaged version. I don't really want to upgrade to 
a manually handled version.


There were substantial changes in how encoding was handled between 
3.4.6 and 4.0, and there is a substantial likelihood that any problem 
with encoding would not occur in 4.0 or later.


Yes, you are right. It works with 4.0.1.

I have looked into the source code and the reason became obvious pretty 
quickly, e.g. the part in _put_token in 3.4.6


https://github.com/apache/spamassassin/blob/4a1fe99da9296364be0c50f02d2a73b5af74207a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L827

compared with this in trunk

https://github.com/apache/spamassassin/blob/8307bb22a7709125ab0f8e94fb7a271461944f61/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L997

4.0 does specifically tag the token as BINARY while default is VARCHAR I 
think. Thus, it automatically encodes it.


This was added in

https://github.com/apache/spamassassin/commit/3dd8ea4ff51d50a72212ac8cbb2f6f8d443c3489

I'll open a bug with redhat and see if they either upgrade spamassassin 
in EL9 or backport something into 3.4.6.


Just for the fun of it, I have replaced the packaged file with the 4.0.1 
MySQL.pm file and then it works. Looking at the commit and the commit 
history after, I think the 4.0.1 MySQL.pm should work just fine in 3.4.6.


Anyway, we'll see what RedHat does about this.

Thanks a lot!

Regards,

Gerald




Re: BayesStore MariaDB on EL9

2024-06-19 Thread Gerald Vogt

On 18.06.24 22:23, Bill Cole wrote:

On 2024-06-18 at 14:58:15 UTC-0400 (Tue, 18 Jun 2024 20:58:15 +0200)
Gerald Vogt 
is rumored to have said:


Hi,

for a test, I have increased the column length of token to binary(32) 
and used a test file to import containing a single token.


This time it went through. However, as I suspected, the token length 
is not 5 byte. Token line from backup:


t    1    0    1718024618    027121926a

Hex representation of content in database:

MariaDB [spamassassin]> select hex(token) from bayes_token\G
*** 1. row ***
hex(token): 
027121C2926A

1 row in set (0.000 sec)

Compared:

Original 02 71 21    92 6a
Database 02 71 21 C2 92 6A

C2 92 is the UTF-8 encoding of U+0092, thus basically the token is 
written in UTF-8 into the database.


That's odd... What is the character set of the database?


It is standard DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci 
just like the table.


Running sa-learn with DBI_TRACE=2 I can also see that it looks like it 
actually has the UTF-8 encoding already in there during parameter 
binding:


Binding parameters: INSERT INTO bayes_token
   (id, token, spam_count, ham_count, atime)
   VALUES ('43','^Bq!j','1','0','1718024618')
   ON DUPLICATE KEY UPDATE spam_count = 
GREATEST(spam_count + '1', 0),
   ham_count = GREATEST(ham_count 
+ '0', 0),
   atime = GREATEST(atime, 
'1718024618')


Thus, I would say it's not an issue with the database.

Any idea?

Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4.


First: upgrade to 4.0.1


Well, it's the RHEL packaged version. I don't really want to upgrade to 
a manually handled version.


There were substantial changes in how encoding was handled between 3.4.6 
and 4.0, and there is a substantial likelihood that any problem with 
encoding would not occur in 4.0 or later.


Yes, you are right. It works with 4.0.1.

I have looked into the source code and the reason became obvious pretty 
quickly, e.g. the part in _put_token in 3.4.6


https://github.com/apache/spamassassin/blob/4a1fe99da9296364be0c50f02d2a73b5af74207a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L827

compared with this in trunk

https://github.com/apache/spamassassin/blob/8307bb22a7709125ab0f8e94fb7a271461944f61/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L997

4.0 does specifically tag the token as BINARY while default is VARCHAR I 
think. Thus, it automatically encodes it.


This was added in

https://github.com/apache/spamassassin/commit/3dd8ea4ff51d50a72212ac8cbb2f6f8d443c3489

I'll open a bug with redhat and see if they either upgrade spamassassin 
in EL9 or backport something into 3.4.6.


Just for the fun of it, I have replaced the packaged file with the 4.0.1 
MySQL.pm file and then it works. Looking at the commit and the commit 
history after, I think the 4.0.1 MySQL.pm should work just fine in 3.4.6.


Anyway, we'll see what RedHat does about this.

Thanks a lot!

Regards,

Gerald


Docs confusion and missing dependency on EL9

2024-06-18 Thread Gerald Vogt

Hi,

for testing I tried to install spamassassin 4.0.1 on EL9 (AlmaLinux 
9.4). I have noticed some dependencies are not mentioned on the INSTALL 
page:


I have had to install perl-ExtUtils-MakeMaker.noarch to run Makefile.PL
I have had to install perl-Archive-Tar.noarch to run sa-update.

Those two are nowhere mentioned.

It also took me a while to find the instructions how to install.

I started at https://spamassassin.apache.org/index.html

where "Click here to get started using SpamAssassin! " looked promising.

But at

https://cwiki.apache.org/confluence/display/SPAMASSASSIN/StartUsing

I have spent considerable time to look for where to download and how to 
actually install spamassassin, but eventually gave up. Only now I have 
found some instructions on the SingleUserUnixInstall page.


So I have circled back and checked the Download link from the top. There 
I can download the tar, get hints on Upgrading but still nothing on 
installation.


The Wiki and FAQ links from the top are not helpful either.

So eventually, I have found it on "Docs", pointing to the INSTALL file.

From experience, that it not really the first place I would look.

I would think the "Get Started" page should have a link to the Download 
and INSTALL page at the beginning. Downloading and installing seem to be 
the obvious first steps to get started.


The Download page should have a link for INSTALL like it already has for 
the Upgrade.


And I would say "Where to download" and "How to install" are pretty 
common FAQs, too.


I hope this helps.

Thanks,

Gerald





Re: BayesStore MariaDB on EL9

2024-06-18 Thread Gerald Vogt

Hi,

for a test, I have increased the column length of token to binary(32) 
and used a test file to import containing a single token.


This time it went through. However, as I suspected, the token length is 
not 5 byte. Token line from backup:


t   1   0   1718024618  027121926a

Hex representation of content in database:

MariaDB [spamassassin]> select hex(token) from bayes_token\G
*** 1. row ***
hex(token): 027121C2926A
1 row in set (0.000 sec)

Compared:

Original 02 71 2192 6a
Database 02 71 21 C2 92 6A

C2 92 is the UTF-8 encoding of U+0092, thus basically the token is 
written in UTF-8 into the database.


Running sa-learn with DBI_TRACE=2 I can also see that it looks like it 
actually has the UTF-8 encoding already in there during parameter binding:


Binding parameters: INSERT INTO bayes_token
   (id, token, spam_count, ham_count, atime)
   VALUES ('43','^Bq!j','1','0','1718024618')
   ON DUPLICATE KEY UPDATE spam_count = GREATEST(spam_count 
+ '1', 0),
   ham_count = GREATEST(ham_count + 
'0', 0),
   atime = GREATEST(atime, 
'1718024618')


Thus, I would say it's not an issue with the database.

Any idea?

Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4.

Thanks,

Gerald

On 18.06.24 17:09, Gerald Vogt wrote:

Hi!

I am trying to use a mariadb database as bayesstore, but it fails to 
load tokens. Whenever it tries to insert something into bayes_token it 
fails with an error


dbg: bayes: _put_token: SQL error: Data too long for column 'token' at 
row 1


The table has been created as mentioned in

https://github.com/apache/spamassassin/blob/trunk/sql/bayes_mysql.sql

but the 5 byte binary isn't big enough. I have tried with sa-learn 
--restore as well as learning some spam mails. bayes_token remains empty.


MariaDB [spamassassin]> show create table bayes_token\G
*** 1. row ***
    Table: bayes_token
Create Table: CREATE TABLE `bayes_token` (
   `id` int(11) NOT NULL DEFAULT 0,
   `token` binary(5) NOT NULL,
   `spam_count` int(11) NOT NULL DEFAULT 0,
   `ham_count` int(11) NOT NULL DEFAULT 0,
   `atime` int(11) NOT NULL DEFAULT 0,
   PRIMARY KEY (`id`,`token`),
   KEY `bayes_token_idx1` (`id`,`atime`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci
1 row in set (0.000 sec)

Any idea what goes wrong here?

Thanks,

Gerald






BayesStore MariaDB on EL9

2024-06-18 Thread Gerald Vogt

Hi!

I am trying to use a mariadb database as bayesstore, but it fails to 
load tokens. Whenever it tries to insert something into bayes_token it 
fails with an error


dbg: bayes: _put_token: SQL error: Data too long for column 'token' at row 1

The table has been created as mentioned in

https://github.com/apache/spamassassin/blob/trunk/sql/bayes_mysql.sql

but the 5 byte binary isn't big enough. I have tried with sa-learn 
--restore as well as learning some spam mails. bayes_token remains empty.


MariaDB [spamassassin]> show create table bayes_token\G
*** 1. row ***
   Table: bayes_token
Create Table: CREATE TABLE `bayes_token` (
  `id` int(11) NOT NULL DEFAULT 0,
  `token` binary(5) NOT NULL,
  `spam_count` int(11) NOT NULL DEFAULT 0,
  `ham_count` int(11) NOT NULL DEFAULT 0,
  `atime` int(11) NOT NULL DEFAULT 0,
  PRIMARY KEY (`id`,`token`),
  KEY `bayes_token_idx1` (`id`,`atime`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci
1 row in set (0.000 sec)

Any idea what goes wrong here?

Thanks,

Gerald