Hi,

for your information and anyone who comes across this problem: I have opened an issue with RedHat.

https://issues.redhat.com/browse/RHEL-43418

It probably will be backported, but may take some time, maybe in 9.5 or possibly later.

We'll see...

Regards,

Gerald

On 19.06.24 08:41, Gerald Vogt wrote:
On 18.06.24 22:23, Bill Cole wrote:
On 2024-06-18 at 14:58:15 UTC-0400 (Tue, 18 Jun 2024 20:58:15 +0200)
Gerald Vogt <v...@spamcop.net>
is rumored to have said:

Hi,

for a test, I have increased the column length of token to binary(32) and used a test file to import containing a single token.

This time it went through. However, as I suspected, the token length is not 5 byte. Token line from backup:

t    1    0    1718024618    027121926a

Hex representation of content in database:

MariaDB [spamassassin]> select hex(token) from bayes_token\G
*************************** 1. row ***************************
hex(token): 027121C2926A0000000000000000000000000000000000000000000000000000
1 row in set (0.000 sec)

Compared:

Original 02 71 21    92 6a
Database 02 71 21 C2 92 6A

C2 92 is the UTF-8 encoding of U+0092, thus basically the token is written in UTF-8 into the database.

That's odd... What is the character set of the database?

It is standard DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci just like the table.

Running sa-learn with DBI_TRACE=2 I can also see that it looks like it actually has the UTF-8 encoding already in there during parameter binding:

Binding parameters: INSERT INTO bayes_token
               (id, token, spam_count, ham_count, atime)
               VALUES ('43','^Bq!<U+0092>j','1','0','1718024618')
               ON DUPLICATE KEY UPDATE spam_count = GREATEST(spam_count + '1', 0),                                        ham_count = GREATEST(ham_count + '0', 0),                                        atime = GREATEST(atime, '1718024618')

Thus, I would say it's not an issue with the database.

Any idea?

Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4.

First: upgrade to 4.0.1

Well, it's the RHEL packaged version. I don't really want to upgrade to a manually handled version.

There were substantial changes in how encoding was handled between 3.4.6 and 4.0, and there is a substantial likelihood that any problem with encoding would not occur in 4.0 or later.

Yes, you are right. It works with 4.0.1.

I have looked into the source code and the reason became obvious pretty quickly, e.g. the part in _put_token in 3.4.6

https://github.com/apache/spamassassin/blob/4a1fe99da9296364be0c50f02d2a73b5af74207a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L827

compared with this in trunk

https://github.com/apache/spamassassin/blob/8307bb22a7709125ab0f8e94fb7a271461944f61/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L997

4.0 does specifically tag the token as BINARY while default is VARCHAR I think. Thus, it automatically encodes it.

This was added in

https://github.com/apache/spamassassin/commit/3dd8ea4ff51d50a72212ac8cbb2f6f8d443c3489

I'll open a bug with redhat and see if they either upgrade spamassassin in EL9 or backport something into 3.4.6.

Just for the fun of it, I have replaced the packaged file with the 4.0.1 MySQL.pm file and then it works. Looking at the commit and the commit history after, I think the 4.0.1 MySQL.pm should work just fine in 3.4.6.

Anyway, we'll see what RedHat does about this.

Thanks a lot!

Regards,

Gerald

Reply via email to