RE: Tokens, Tokenizers, and Classifiers

Andrew Rose Thu, 06 Mar 2003 09:15:20 -0800

 
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Marty,


I've looked over the interface document quickly.  This looks basically
fine.  I have to admit that I'm unlikely to actually write a classifier
or tokenizer (because in a couple of weeks I'm moving from one end of
the country to the other) so I haven't given it the closest combing
that I might.  One thing though.  You talk about a Classification being
a string category and a % probability of the mail falling into that
category.  However, you don't define which categories are valid or what
their meaning is.  If the core TarProxy software (as opposed to a
plugin) is to use the categories, you'll need to define which ones are
valid along with semantics for them.  Also, I wonder if an array of
Classifications should be returned.  That way, classify(Token token)
would return e.g. {{"Spam", 0.65}, {"Clean", 0.40}, {"Worm", 0.05}}
with an entry for each possible classification.  Otherwise, how would
it know which classification to return?  Alternatively, you could
simply make this a Spam classifier (rather then trying
hyper-generality) in which case the return from classify(Token token)
is just the probability of the message so far being a spam.

Cheers,

Andrew

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBPmeB+wXqoqbqowOrEQJ9YwCgm09rvGs3pgbIWgTxB12d71A4KSwAoNkU
yp2Foz8Jx5XzsD0UGg3s6YPE
=E6vk
-----END PGP SIGNATURE-----

RE: Tokens, Tokenizers, and Classifiers

Reply via email to