Re: BayesStore MariaDB on EL9
Hi, for your information and anyone who comes across this problem: I have opened an issue with RedHat. https://issues.redhat.com/browse/RHEL-43418 It probably will be backported, but may take some time, maybe in 9.5 or possibly later. We'll see... Regards, Gerald On 19.06.24 08:41, Gerald Vogt wrote: On 18.06.24 22:23, Bill Cole wrote: On 2024-06-18 at 14:58:15 UTC-0400 (Tue, 18 Jun 2024 20:58:15 +0200) Gerald Vogt is rumored to have said: Hi, for a test, I have increased the column length of token to binary(32) and used a test file to import containing a single token. This time it went through. However, as I suspected, the token length is not 5 byte. Token line from backup: t 1 0 1718024618 027121926a Hex representation of content in database: MariaDB [spamassassin]> select hex(token) from bayes_token\G *** 1. row *** hex(token): 027121C2926A 1 row in set (0.000 sec) Compared: Original 02 71 21 92 6a Database 02 71 21 C2 92 6A C2 92 is the UTF-8 encoding of U+0092, thus basically the token is written in UTF-8 into the database. That's odd... What is the character set of the database? It is standard DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci just like the table. Running sa-learn with DBI_TRACE=2 I can also see that it looks like it actually has the UTF-8 encoding already in there during parameter binding: Binding parameters: INSERT INTO bayes_token (id, token, spam_count, ham_count, atime) VALUES ('43','^Bq!j','1','0','1718024618') ON DUPLICATE KEY UPDATE spam_count = GREATEST(spam_count + '1', 0), ham_count = GREATEST(ham_count + '0', 0), atime = GREATEST(atime, '1718024618') Thus, I would say it's not an issue with the database. Any idea? Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4. First: upgrade to 4.0.1 Well, it's the RHEL packaged version. I don't really want to upgrade to a manually handled version. There were substantial changes in how encoding was handled between 3.4.6 and 4.0, and there is a substantial likelihood that any problem with encoding would not occur in 4.0 or later. Yes, you are right. It works with 4.0.1. I have looked into the source code and the reason became obvious pretty quickly, e.g. the part in _put_token in 3.4.6 https://github.com/apache/spamassassin/blob/4a1fe99da9296364be0c50f02d2a73b5af74207a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L827 compared with this in trunk https://github.com/apache/spamassassin/blob/8307bb22a7709125ab0f8e94fb7a271461944f61/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L997 4.0 does specifically tag the token as BINARY while default is VARCHAR I think. Thus, it automatically encodes it. This was added in https://github.com/apache/spamassassin/commit/3dd8ea4ff51d50a72212ac8cbb2f6f8d443c3489 I'll open a bug with redhat and see if they either upgrade spamassassin in EL9 or backport something into 3.4.6. Just for the fun of it, I have replaced the packaged file with the 4.0.1 MySQL.pm file and then it works. Looking at the commit and the commit history after, I think the 4.0.1 MySQL.pm should work just fine in 3.4.6. Anyway, we'll see what RedHat does about this. Thanks a lot! Regards, Gerald
Re: BayesStore MariaDB on EL9
On 18.06.24 22:23, Bill Cole wrote: On 2024-06-18 at 14:58:15 UTC-0400 (Tue, 18 Jun 2024 20:58:15 +0200) Gerald Vogt is rumored to have said: Hi, for a test, I have increased the column length of token to binary(32) and used a test file to import containing a single token. This time it went through. However, as I suspected, the token length is not 5 byte. Token line from backup: t 1 0 1718024618 027121926a Hex representation of content in database: MariaDB [spamassassin]> select hex(token) from bayes_token\G *** 1. row *** hex(token): 027121C2926A 1 row in set (0.000 sec) Compared: Original 02 71 21 92 6a Database 02 71 21 C2 92 6A C2 92 is the UTF-8 encoding of U+0092, thus basically the token is written in UTF-8 into the database. That's odd... What is the character set of the database? It is standard DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci just like the table. Running sa-learn with DBI_TRACE=2 I can also see that it looks like it actually has the UTF-8 encoding already in there during parameter binding: Binding parameters: INSERT INTO bayes_token (id, token, spam_count, ham_count, atime) VALUES ('43','^Bq!j','1','0','1718024618') ON DUPLICATE KEY UPDATE spam_count = GREATEST(spam_count + '1', 0), ham_count = GREATEST(ham_count + '0', 0), atime = GREATEST(atime, '1718024618') Thus, I would say it's not an issue with the database. Any idea? Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4. First: upgrade to 4.0.1 Well, it's the RHEL packaged version. I don't really want to upgrade to a manually handled version. There were substantial changes in how encoding was handled between 3.4.6 and 4.0, and there is a substantial likelihood that any problem with encoding would not occur in 4.0 or later. Yes, you are right. It works with 4.0.1. I have looked into the source code and the reason became obvious pretty quickly, e.g. the part in _put_token in 3.4.6 https://github.com/apache/spamassassin/blob/4a1fe99da9296364be0c50f02d2a73b5af74207a/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L827 compared with this in trunk https://github.com/apache/spamassassin/blob/8307bb22a7709125ab0f8e94fb7a271461944f61/lib/Mail/SpamAssassin/BayesStore/MySQL.pm#L997 4.0 does specifically tag the token as BINARY while default is VARCHAR I think. Thus, it automatically encodes it. This was added in https://github.com/apache/spamassassin/commit/3dd8ea4ff51d50a72212ac8cbb2f6f8d443c3489 I'll open a bug with redhat and see if they either upgrade spamassassin in EL9 or backport something into 3.4.6. Just for the fun of it, I have replaced the packaged file with the 4.0.1 MySQL.pm file and then it works. Looking at the commit and the commit history after, I think the 4.0.1 MySQL.pm should work just fine in 3.4.6. Anyway, we'll see what RedHat does about this. Thanks a lot! Regards, Gerald
Docs confusion and missing dependency on EL9
Hi, for testing I tried to install spamassassin 4.0.1 on EL9 (AlmaLinux 9.4). I have noticed some dependencies are not mentioned on the INSTALL page: I have had to install perl-ExtUtils-MakeMaker.noarch to run Makefile.PL I have had to install perl-Archive-Tar.noarch to run sa-update. Those two are nowhere mentioned. It also took me a while to find the instructions how to install. I started at https://spamassassin.apache.org/index.html where "Click here to get started using SpamAssassin! " looked promising. But at https://cwiki.apache.org/confluence/display/SPAMASSASSIN/StartUsing I have spent considerable time to look for where to download and how to actually install spamassassin, but eventually gave up. Only now I have found some instructions on the SingleUserUnixInstall page. So I have circled back and checked the Download link from the top. There I can download the tar, get hints on Upgrading but still nothing on installation. The Wiki and FAQ links from the top are not helpful either. So eventually, I have found it on "Docs", pointing to the INSTALL file. From experience, that it not really the first place I would look. I would think the "Get Started" page should have a link to the Download and INSTALL page at the beginning. Downloading and installing seem to be the obvious first steps to get started. The Download page should have a link for INSTALL like it already has for the Upgrade. And I would say "Where to download" and "How to install" are pretty common FAQs, too. I hope this helps. Thanks, Gerald
Re: BayesStore MariaDB on EL9
Hi, for a test, I have increased the column length of token to binary(32) and used a test file to import containing a single token. This time it went through. However, as I suspected, the token length is not 5 byte. Token line from backup: t 1 0 1718024618 027121926a Hex representation of content in database: MariaDB [spamassassin]> select hex(token) from bayes_token\G *** 1. row *** hex(token): 027121C2926A 1 row in set (0.000 sec) Compared: Original 02 71 2192 6a Database 02 71 21 C2 92 6A C2 92 is the UTF-8 encoding of U+0092, thus basically the token is written in UTF-8 into the database. Running sa-learn with DBI_TRACE=2 I can also see that it looks like it actually has the UTF-8 encoding already in there during parameter binding: Binding parameters: INSERT INTO bayes_token (id, token, spam_count, ham_count, atime) VALUES ('43','^Bq!j','1','0','1718024618') ON DUPLICATE KEY UPDATE spam_count = GREATEST(spam_count + '1', 0), ham_count = GREATEST(ham_count + '0', 0), atime = GREATEST(atime, '1718024618') Thus, I would say it's not an issue with the database. Any idea? Running spamassassin-3.4.6-5.el9.x86_64 on AlmaLinux 9.4. Thanks, Gerald On 18.06.24 17:09, Gerald Vogt wrote: Hi! I am trying to use a mariadb database as bayesstore, but it fails to load tokens. Whenever it tries to insert something into bayes_token it fails with an error dbg: bayes: _put_token: SQL error: Data too long for column 'token' at row 1 The table has been created as mentioned in https://github.com/apache/spamassassin/blob/trunk/sql/bayes_mysql.sql but the 5 byte binary isn't big enough. I have tried with sa-learn --restore as well as learning some spam mails. bayes_token remains empty. MariaDB [spamassassin]> show create table bayes_token\G *** 1. row *** Table: bayes_token Create Table: CREATE TABLE `bayes_token` ( `id` int(11) NOT NULL DEFAULT 0, `token` binary(5) NOT NULL, `spam_count` int(11) NOT NULL DEFAULT 0, `ham_count` int(11) NOT NULL DEFAULT 0, `atime` int(11) NOT NULL DEFAULT 0, PRIMARY KEY (`id`,`token`), KEY `bayes_token_idx1` (`id`,`atime`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci 1 row in set (0.000 sec) Any idea what goes wrong here? Thanks, Gerald
BayesStore MariaDB on EL9
Hi! I am trying to use a mariadb database as bayesstore, but it fails to load tokens. Whenever it tries to insert something into bayes_token it fails with an error dbg: bayes: _put_token: SQL error: Data too long for column 'token' at row 1 The table has been created as mentioned in https://github.com/apache/spamassassin/blob/trunk/sql/bayes_mysql.sql but the 5 byte binary isn't big enough. I have tried with sa-learn --restore as well as learning some spam mails. bayes_token remains empty. MariaDB [spamassassin]> show create table bayes_token\G *** 1. row *** Table: bayes_token Create Table: CREATE TABLE `bayes_token` ( `id` int(11) NOT NULL DEFAULT 0, `token` binary(5) NOT NULL, `spam_count` int(11) NOT NULL DEFAULT 0, `ham_count` int(11) NOT NULL DEFAULT 0, `atime` int(11) NOT NULL DEFAULT 0, PRIMARY KEY (`id`,`token`), KEY `bayes_token_idx1` (`id`,`atime`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci 1 row in set (0.000 sec) Any idea what goes wrong here? Thanks, Gerald