I guess this question is to Michael in particular.
I thought of a very simple optimization, but I can't test yet as I am
still recovering from having had my computer in the shop for repairs. Can you say if it makes sense and if it does try it?
In Bayes.pm, the subroutine scan gathers all the token thens uses map to call compute_prob_for_token once for each token in the message, which results in a call to tok_get. compute_prob_for_token is written to allow for the possibility that the data has already been fetched and is passed in, but that isn't done, so there is one call to tok_get per token.
tok_get in BayesStoreSQL.pm contains
"SELECT spam_count, ham_count, atime
FROM bayes_token
WHERE username = ?
AND token = ?";Would it be a lot more efficient in MySQL or other SQL engines if once scan had all the tokens from the message it could call a tok_get_all that used token IN ... to fetch all the tokens in the message in one select? scan could call tok_get_all and then the map could pass each set of values to compute_prob_for_token when it call it.
-- sidney
