https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8315
Bug ID: 8315
Summary: BayesStore/SQL regression when using MySQL defaults
Product: Spamassassin
Version: SVN Trunk (Latest Devel Version)
Hardware: PC
OS: Linux
Status: NEW
Severity: minor
Priority: P2
Component: Learner
Assignee: [email protected]
Reporter: [email protected]
Target Milestone: Undefined
Created attachment 6000
--> https://bz.apache.org/SpamAssassin/attachment.cgi?id=6000&action=edit
fix the SQLite concat syntax
I noticed that my bayes learning was not working well. The biggest symptom was
lots of new tokens but very few hammy, neutral or spammy tokens. For instance:
X-Spam-TokenSummary: Tokens: new, 177; hammy, 0; neutral, 1; spammy, 1.
X-Spam-TokenSummary: Tokens: new, 178; hammy, 0; neutral, 0; spammy, 2.
X-Spam-TokenSummary: Tokens: new, 104; hammy, 1; neutral, 1; spammy, 0.
I've been using the following configuration:
bayes_store_module Mail::SpamAssassin::BayesStore::SQL
for a loooooooong time, probably 10+ years. A change in 2022[1] changed the
default SQL syntax which uses "||" as a string concatenation operator. That's
evidently fine in SQLite, but not in MySQL by default[2]. As a result, the
generated SQL ended up with a boolean value instead of a string for the token:
MariaDB [spamassassin]> SELECT SUBSTR(token || ' ', 1, 5), spam_count,
ham_count, atime from bayes_token limit 10;
+--------------------------------+------------+-----------+------------+
| SUBSTR(token || ' ', 1, 5) | spam_count | ham_count | atime |
+--------------------------------+------------+-----------+------------+
| 0 | 0 | 1 | 1696434003 |
| 0 | 0 | 3 | 1696434018 |
| 0 | 0 | 6 | 1696441099 |
| 0 | 0 | 1 | 1696434008 |
| 0 | 0 | 2 | 1696440870 |
| 0 | 0 | 3 | 1696440394 |
| 0 | 0 | 1 | 1696434011 |
| 0 | 0 | 2 | 1696445725 |
| 0 | 0 | 1 | 1696441419 |
| 0 | 0 | 1 | 1696433986 |
+--------------------------------+------------+-----------+------------+
Basically, the token was either 0 or 1.
Then this loop in SpamAssassin/Plugin/Bayes.pm:
foreach my $tokendata (@{$tokensdata}) {
...
my ($token, $tok_spam, $tok_ham, $atime) = @{$tokendata};
$pw{$token} = {...
}
Would only see $token as "0" or "1" and the hashing would ensure that there
were only at *MOST* two tokens which explains the low token counts I see coming
out of the database.
The issue can be worked around by using:
bayes_store_module Mail::SpamAssassin::BayesStore::MySQL
but I think it should probably be fixed in case other folks are using plain
"SQL", not "MySQL". A totally untested patch is attached.
1. https://svn.apache.org/viewvc?view=revision&revision=1899738
2.
https://dev.mysql.com/doc/refman/8.4/en/sql-mode.html#sqlmode_pipes_as_concat
--
You are receiving this mail because:
You are the assignee for the bug.