[EMAIL PROTECTED] wrote on Saturday, February 03, 2007 3:17 PM -0600: > Seth> Another possible meta-token that might help detect word salad > Seth> (probably what Skip had in mind): > > Seth> percentage of unique word tokens that are not significant > > I see a chicken-and-egg situation developing when we try to compute > these sort of numbers. Start with an empty database. Train on a ham > message. No words are significant at that point, so having no > significant word tokens is a hammy clue. Train on a spam. By > definition all words in the database at this point are significant, > so only words not yet seen will be deemed not significant.
It definitely has chicken and egg properties. > > Lather, rinse, repeat. > > Maybe after you're done training on all available messages you can > toss all these percentage tokens and make a second pass over your > messages computing only those tokens. Are there better ways to > compute tokens such as this which depend on the contribution of > other messages in the database? I hope so. This is fundamentally different from drawing an inference from previously observed word frequencies. Numeric value meta-tokens are not the result of binary experiments. They exist for every message, whether ham or spam, and they are real numbers. We don't know their underlying distribution. The problem is to estimate the probability that a message that contains a token with a given numeric value is ham or spam based on the values of that token observed in trained ham and spam. This is a very raw idea, not even half-baked. I think this problem becomes tractable if we assume the tokens values are Gaussian distributed, even if we believe they aren't. It should be possible to estimate the likelihood that a given token value is from a spam message based on the distribution of that token's value in both trained ham and spam. If it's Gaussian, we only need to know the mean and variance of each distribution. If this turns out to work at all, we wouldn't need that much information in the database. For each numeric value token you model this way, you need at least the mean and variance for each of ham and spam. To untrain a value, I think you could get away with keeping only the intermediate values used to calculate variance, and I vaguely recall two of them. If you want to support arbitrary real values, these are all floats, with the possibility that the intermediate variables are double precision. -- Seth Goodman _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev