RE: Tokens, Tokenizers, and Classifiers

Marty Lamb Thu, 06 Mar 2003 09:30:29 -0800

On Thu, 2003-03-06 at 12:14, Andrew Rose wrote:
> One thing though.  You talk about a Classification being
> a string category and a % probability of the mail falling into that
> category.  However, you don't define which categories are valid or what
> their meaning is.  If the core TarProxy software (as opposed to a
> plugin) is to use the categories, you'll need to define which ones are
> valid along with semantics for them.


You're absolutely right.  That takes place in a module I haven't
discussed yet, but it's addressed.

> Also, I wonder if an array of
> Classifications should be returned.  That way, classify(Token token)
> would return e.g. {{"Spam", 0.65}, {"Clean", 0.40}, {"Worm", 0.05}}
> with an entry for each possible classification.  Otherwise, how would
> it know which classification to return?  Alternatively, you could
> simply make this a Spam classifier (rather then trying
> hyper-generality) in which case the return from classify(Token token)
> is just the probability of the message so far being a spam.

That's interesting.  I'm not sure how the additional information could
be used, but it's worth investigating.  I was working with the
assumption that a Classifier would return the most probably category.

As for recognizing categories other than spam, the intent is to enable
different throttling behaviors.  For example, "worm" emails should
probably be throttled back far more than the more benign spam.  I'm not
sure how much of that is going to work in the initial version, but it
strikes me as being potentially useful.

- Marty

-- 
Marty Lamb
Martian Software
<mlamb at martiansoftware dot com>

RE: Tokens, Tokenizers, and Classifiers

Reply via email to