-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Sidney Markowitz writes: >Scott A Crosby wrote: >[... snip really neat Sean Quinlan trick ...] > >I like that! But thinking about it I realize that much of the purpose of >the trick is to optimize a case where collisions have to be accounted >for, as in the case of archival storage and retrieval (Venti). > >We can get away with "lossy" behavior, as long as the statistics are >"good enough". What that means is that we can decide what probability of >collision is acceptable, set the bit sizes accordingly, and not worry >about collisions. > >I was going to go on about the probability of collisions, but something >just occurred to me: > >The bayes database is being used in two different ways. When we are >calculating the spam probability, we are only interested in the most >significant 15 tokens in the message, and we are not writing to the >database. Is there any reason to have the majority of the tokens which >are rare available in the database for that purpose? Why do we need a >database with millions, or even hundreds of thousands of tokens at all >except for when we are performing a learn operation? The issue is that those 1-hit tokens provide several percent accuracy overall -- they're very important. Even though for most messages they may not appear individually, as a general set of "hapax tokens", over all messages scanned, they are common. They are frequently very strong tokens, too, making them useful and part of the top N tokens (150 btw) included in the calculation -- in fact, the most common tokens are often common in both ham and spam, making them useless for scanning purposes. >Can we run salearn in a batch mode and have it generate small databases >that are used by the Bayes rules? Well, we are trying to avoid "batch modes" ;) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFATlr7QTcbUG5Y7woRArD8AKCGSkVaNjzM3YjxTRGilzS+1D6gpwCg4RO3 tUwpFIZ4hI8ZiHgcuJrnX6g= =hOhd -----END PGP SIGNATURE-----
