[Bug 3331] [review] Bayes option to keep original token as db data (not key).

bugzilla-daemon 14 Aug 2004 21:46:19 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=3331






------- Additional Comments From [EMAIL PROTECTED]  2004-08-14 14:46 -------
I agree that the 3 functions approach sounds good; having 1 API that receives
different combinations of parameters depending on when it's called is a bad API
design, in my opinion.  

(explanation: it's effectively 3 different APIs with different results and args
depending on what method it's called from.  By mixing them into one plugin API,
you move the work of ensuring the right code is run, from the API-design level
to the user-implementation-code level; in terms of Rusty Russell's "interface
simplicity spectrum" that's from level 5 to level 7.

  http://sourcefrog.net/weblog/software/aesthetics/interface-levels.html
  http://www.ozlabs.com/~rusty/ols-2003-keynote/ols-keynote-2003.html

Plus, there could be cases where users don't *want* to know about scan operation
token use, just learn-op token use.  This allows them to override just one of
the APIs and ignore the other.) 

Also: regarding 'a single "get the tokens" method' -- either the plugin
implementor would have to track the state, or Bayes.pm would have to track the
same state.  I think better let the plugin implementor do it. However, here's a
good way to avoid making it too hard -- add these APIs:

  - bayes_start_message($isspam, $generatedmsgid)
    - called when a single message is about to be learned
  - bayes_scan_tokens($toksref, $msgatime)
    - pass the list of tokens from that message, for a scan op
      (do we have $msgatime here?)
  - bayes_learn_tokens($toksref, $msgatime)
    - pass the list of tokens from that message, for a learn op
  - bayes_forget_tokens($toksref)
    - pass the list of tokens from that message, for a forget op
      ($msgatime is irrelevant)
  - bayes_finish_message()
    - called when that single message learning op has been completed

The bayes_start_message() and bayes_finish_message() APIs allow the plugin to
know when a new message is being operated on.  So the call order would be for
example:

    $plugin->bayes_start_message(...);
    $plugin->learn_tokens(...);     # dump the tokens
    $plugin->bayes_finish_message(...);
    $plugin->bayes_start_message(...);
    $plugin->forget_tokens(...);    # msg was previously learned
                                    # dump the tokens
    $plugin->learn_tokens(...);     # we've already dumped 'em, not again
    $plugin->bayes_finish_message(...);

So an implementor would just have to keep a single boolean that's set to 1
during a foo_tokens() call, set to 0 during bayes_start_message() -- then if
foo_tokens() find it already set to 1 it doesn't dump the tokens because that's
already been done once on that msg.

Theo: dump plugin APIs would be useful, good point -- maybe let dump_data()
rewrite the line arbitrarily.  Suggest something like this API:

$line = dump_data({
        origline => ...,
        prob => ...,
        ts => ...,
        th => ...,
        atime => ...,
        encoded_tok => ...
    })

(in other words, give it the *unstringified* data as well so that it doesn't
have to parse them out of the line.)

And then that's probably enough APIs for now ;)





------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 3331] [review] Bayes option to keep original token as db data (not key).

Reply via email to