On Sat, Mar 13, 2004 at 09:52:14AM +1300, Sidney Markowitz wrote:
> 
> I thought of a very simple optimization, but I can't test yet as I am
> still recovering from having had my computer in the shop for repairs. 
> Can you say if it makes sense and if it does try it?
> 
> In Bayes.pm, the subroutine scan gathers all the token thens uses map to
> call compute_prob_for_token once for each token in the message, which
> results in a call to tok_get. compute_prob_for_token is written to allow
> for the possibility that the data has already been fetched and is passed
> in, but that isn't done, so there is one call to tok_get per token.
> 
> tok_get in BayesStoreSQL.pm contains
> "SELECT spam_count, ham_count, atime
>    FROM bayes_token
>   WHERE username = ?
>     AND token = ?";
> 
> Would it be a lot more efficient in MySQL or other SQL engines if once
> scan had all the tokens from the message it could call a tok_get_all
> that used token IN ... to fetch all the tokens in the message in one
> select? scan could call tok_get_all and then the map could pass each set
> of values to compute_prob_for_token when it call it.

This certainly is a possibility.  I avoided that method for several
reasons:

1) Wanted to keep changes to Bayes.pm at a minimum.

2) The addition of the ability to pass token info into
   compute_prob_for_token was added very late in the game to support
   the dump function so wasn't really on the radar for the majority of
   the changes.

3) Using IN is problematic because you have to limit the number of
   elements you have, at least in Oracle, not totally sure about other
   databases.  Also, you can completely blow your query cache building
   up random dynamic select queries like this.  I suggest that if we
   do this then you limit it to some number X and do as many queries
   with X as the max that it takes to get all of the data.  It will
   take some experimentation to determine the optimal value for X.

4) Select is fast, it's update that is slow.

That said, 1 and 2 have no bearing now that the code is in place and
it can be tweaked at will.  3, well I already said how 3 is
overcome-able.  4 is still gonna be a problem, so perhaps it's worth
looking into it.

Give it a go, I'm interested to see how well it works.

Michael

Reply via email to