On 4/17/2013 10:15 PM, John Hardin wrote:
> On Wed, 17 Apr 2013, Ben Johnson wrote:
> 
>> The first post on that page was the key. In particular, adding the
>> following to each MySQL "CREATE TABLE" statement:
>>
>> ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
> 
> Please check the SpamAssassin bugzilla to see if this situation is
> already mentioned, and if not, add a bug. This seems pretty critical.

Mark Martinec opened three reports in relation to this issue (quoted
from the archive thread cited in my previous post):

[Bug 6624] BayesStore/MySQL.pm fails to update tokens due to
MySQL server bug (wrong count of rows affected)
  https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6624

(^^ Fixed in 3.4 ^^)

[Bug 6625] Bayes SQL schema treats bayes_token.token as char
instead of binary, fails chset checks
  https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6625

(^^ Fixed in 3.4 ^^)

[Bug 6626] Newer MySQL chokes on TYPE=MyISAM syntax
  https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6626

(^^ Fixed in 3.4 ^^)

My concern now is that I am on 3.3.1, with little control over upgrades.
I have read all three bug reports in their entirety and Bug 6624 seems
to be a very legitimate concern. To quote Mark in the bug description:

> The effect of the bug with SpamAssassin is that tokens are only able
> to be inserted once, but their counts cannot increase, leading to
> terrible bayes results if the bug is not noticed. Also the conversion
> form db fails, as reported by Dave.
> 
> Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to
> provide a workaround for the MySQL server bug, and improved debug logging.

How can I discern whether or not this bug does, in fact, affect me? Are
my Bayes results being crippled as a result of this bug?

> It's possible that there's a good reason the default script still uses
> myISAM. If so, the documentation for this fix should at least be easier
> to find.
> 

If there is a good reason, I have yet to discern what it might be. The
third bug from above (Mark's comments, specifically) imply that there is
no particular reason for using MyISAM.

I have good reason for wanting to use the InnoDB storage engine, and I
have seen no performance hit as a result of so doing. (In fact,
performance seems better than with MyISAM in my scripted, once-a-day
training setup.)

The perfectly acceptable performance I'm observing could be because a)
the InnoDB-related resources allocated to MySQL are more than
sufficient, b) the schema that I used has a newly-added INDEX whereas
those prior to it did not, or c) I was sure to use the "MySQL" module
instead of the "SQL" module with my InnoDB setup:

bayes_store_module              Mail::SpamAssassin::BayesStore::MySQL

The bottom line seems to be that for those who have settings like these
in their MySQL configurations

> default_storage_engine=InnoDB
> skip-character-set-client-handshake
> collation_server=utf8_unicode_ci
> character_set_server=utf8

it is absolutely necessary to include

ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

at the end of each CREATE TABLE statement (otherwise, the MySQL syntax
error results and all Bayes SELECT statements fail).

In any event, I'm a little concerned because while the majority of
messages are now tagged with BAYES_* hits, I am now seeing this debug
output on a significant percentage of messages ("cannot use bayes on
this message; not enough usable tokens found"):

# spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'

--------------------------------------------------------------
Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388),
bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis
Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778)
Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established
Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3
Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1
Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham
= 2342
Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176
Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this
message; not enough usable tokens found
Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef
Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830
(39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%),
poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%),
tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19
(0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%),
tests_pri_-400: 15 (0.7%), check_bayes: 10 (0.5%), tests_pri_0: 1018
(48.3%), dkim_load_modules: 25 (1.2%), check_dkim_signature: 3 (0.2%),
check_dkim_adsp: 16 (0.7%), check_spf: 78 (3.7%), check_razor2: 91
(4.3%), check_pyzor: 430 (20.4%), tests_pri_500: 50 (2.4%)
--------------------------------------------------------------

I have done some searching-around on the string "cannot use bayes on
this message; not enough usable tokens found" and have not found
anything authoritative regarding what this message might mean and
whether or not it can be ignored or if it is symptomatic of a larger
Bayes problem.

Thank you,

-Ben

Reply via email to