On 4/17/2013 10:15 PM, John Hardin wrote: > On Wed, 17 Apr 2013, Ben Johnson wrote: > >> The first post on that page was the key. In particular, adding the >> following to each MySQL "CREATE TABLE" statement: >> >> ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin; > > Please check the SpamAssassin bugzilla to see if this situation is > already mentioned, and if not, add a bug. This seems pretty critical.
Mark Martinec opened three reports in relation to this issue (quoted from the archive thread cited in my previous post): [Bug 6624] BayesStore/MySQL.pm fails to update tokens due to MySQL server bug (wrong count of rows affected) https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6624 (^^ Fixed in 3.4 ^^) [Bug 6625] Bayes SQL schema treats bayes_token.token as char instead of binary, fails chset checks https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6625 (^^ Fixed in 3.4 ^^) [Bug 6626] Newer MySQL chokes on TYPE=MyISAM syntax https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6626 (^^ Fixed in 3.4 ^^) My concern now is that I am on 3.3.1, with little control over upgrades. I have read all three bug reports in their entirety and Bug 6624 seems to be a very legitimate concern. To quote Mark in the bug description: > The effect of the bug with SpamAssassin is that tokens are only able > to be inserted once, but their counts cannot increase, leading to > terrible bayes results if the bug is not noticed. Also the conversion > form db fails, as reported by Dave. > > Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to > provide a workaround for the MySQL server bug, and improved debug logging. How can I discern whether or not this bug does, in fact, affect me? Are my Bayes results being crippled as a result of this bug? > It's possible that there's a good reason the default script still uses > myISAM. If so, the documentation for this fix should at least be easier > to find. > If there is a good reason, I have yet to discern what it might be. The third bug from above (Mark's comments, specifically) imply that there is no particular reason for using MyISAM. I have good reason for wanting to use the InnoDB storage engine, and I have seen no performance hit as a result of so doing. (In fact, performance seems better than with MyISAM in my scripted, once-a-day training setup.) The perfectly acceptable performance I'm observing could be because a) the InnoDB-related resources allocated to MySQL are more than sufficient, b) the schema that I used has a newly-added INDEX whereas those prior to it did not, or c) I was sure to use the "MySQL" module instead of the "SQL" module with my InnoDB setup: bayes_store_module Mail::SpamAssassin::BayesStore::MySQL The bottom line seems to be that for those who have settings like these in their MySQL configurations > default_storage_engine=InnoDB > skip-character-set-client-handshake > collation_server=utf8_unicode_ci > character_set_server=utf8 it is absolutely necessary to include ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin; at the end of each CREATE TABLE statement (otherwise, the MySQL syntax error results and all Bayes SELECT statements fail). In any event, I'm a little concerned because while the majority of messages are now tagged with BAYES_* hits, I am now seeing this debug output on a significant percentage of messages ("cannot use bayes on this message; not enough usable tokens found"): # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)' -------------------------------------------------------------- Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388), bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778) Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3 Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1 Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham = 2342 Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176 Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this message; not enough usable tokens found Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830 (39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%), poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%), tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19 (0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%), tests_pri_-400: 15 (0.7%), check_bayes: 10 (0.5%), tests_pri_0: 1018 (48.3%), dkim_load_modules: 25 (1.2%), check_dkim_signature: 3 (0.2%), check_dkim_adsp: 16 (0.7%), check_spf: 78 (3.7%), check_razor2: 91 (4.3%), check_pyzor: 430 (20.4%), tests_pri_500: 50 (2.4%) -------------------------------------------------------------- I have done some searching-around on the string "cannot use bayes on this message; not enough usable tokens found" and have not found anything authoritative regarding what this message might mean and whether or not it can be ignored or if it is symptomatic of a larger Bayes problem. Thank you, -Ben