Another possible meta-token that might help detect word salad (probably what Skip had in mind):
percentage of unique word tokens that are not significant Whether or not this would help classify word salad better is anyone's guess. I would hope that your own correspondents have some messages in the training set, so a larger fraction of their obscure words would be significant clues than you'd expect of random text from other sources. Using a percentage rather than an absolute number may avoid bias towards large or small messages. Then again, having both percentage and total number versions of this meta-token may prove useful for some users' training sets, as their legitimate mail may tend towards large or small messages. If one version or the other is not useful for an end user, that meta-token will probably turn out to not be significant and will be excluded from the overall score. Using meta-information is a little scary, since the underlying tokens already contribute to the overall spam score. I think the trick is to devise meta-tokens that describe overall message characteristics and are relatively independent of individual token scores. -- Seth Goodman _______________________________________________ spambayes-dev mailing list [email protected] http://mail.python.org/mailman/listinfo/spambayes-dev
