On 2024-06-18 at 14:58:15 UTC-0400 (Tue, 18 Jun 2024 20:58:15 +0200)
Gerald Vogt <v...@spamcop.net>
is rumored to have said:

Hi,

for a test, I have increased the column length of token to binary(32) and used a test file to import containing a single token.

This time it went through. However, as I suspected, the token length is not 5 byte. Token line from backup:

t       1       0       1718024618      027121926a

Hex representation of content in database:

MariaDB [spamassassin]> select hex(token) from bayes_token\G
*************************** 1. row ***************************
hex(token): 027121C2926A0000000000000000000000000000000000000000000000000000
1 row in set (0.000 sec)

Compared:

Original 02 71 21    92 6a
Database 02 71 21 C2 92 6A

C2 92 is the UTF-8 encoding of U+0092, thus basically the token is written in UTF-8 into the database.

That's odd... What is the character set of the database?

Running sa-learn with DBI_TRACE=2 I can also see that it looks like it actually has the UTF-8 encoding already in there during parameter binding:

Binding parameters: INSERT INTO bayes_token
               (id, token, spam_count, ham_count, atime)
               VALUES ('43','^Bq!<U+0092>j','1','0','1718024618')
ON DUPLICATE KEY UPDATE spam_count = GREATEST(spam_count + '1', 0), ham_count = GREATEST(ham_count + '0', 0), atime = GREATEST(atime, '1718024618')

Thus, I would say it's not an issue with the database.

Any idea?

Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4.

First: upgrade to 4.0.1

There were substantial changes in how encoding was handled between 3.4.6 and 4.0, and there is a substantial likelihood that any problem with encoding would not occur in 4.0 or later.

I don't know exactly what the cause of the problem is (i.e. why is SA trying to write UTF-8 to the database?) but I'm quite sure that an official fix for 3.4.x will never happen.




Thanks,

Gerald

On 18.06.24 17:09, Gerald Vogt wrote:
Hi!

I am trying to use a mariadb database as bayesstore, but it fails to load tokens. Whenever it tries to insert something into bayes_token it fails with an error

dbg: bayes: _put_token: SQL error: Data too long for column 'token' at row 1

The table has been created as mentioned in

https://github.com/apache/spamassassin/blob/trunk/sql/bayes_mysql.sql

but the 5 byte binary isn't big enough. I have tried with sa-learn --restore as well as learning some spam mails. bayes_token remains empty.

MariaDB [spamassassin]> show create table bayes_token\G
*************************** 1. row ***************************
        Table: bayes_token
Create Table: CREATE TABLE `bayes_token` (
   `id` int(11) NOT NULL DEFAULT 0,
   `token` binary(5) NOT NULL,
   `spam_count` int(11) NOT NULL DEFAULT 0,
   `ham_count` int(11) NOT NULL DEFAULT 0,
   `atime` int(11) NOT NULL DEFAULT 0,
   PRIMARY KEY (`id`,`token`),
   KEY `bayes_token_idx1` (`id`,`atime`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci
1 row in set (0.000 sec)

Any idea what goes wrong here?

Thanks,

Gerald




--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo@toad.social and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire

Reply via email to