-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sidney Markowitz writes:
>Scott A Crosby wrote:
>[... snip really neat Sean Quinlan trick ...]
>
>I like that! But thinking about it I realize that much of the purpose of 
>the trick is to optimize a case where collisions have to be accounted 
>for, as in the case of archival storage and retrieval (Venti).
>
>We can get away with "lossy" behavior, as long as the statistics are 
>"good enough". What that means is that we can decide what probability of 
>collision is acceptable, set the bit sizes accordingly, and not worry 
>about collisions.
>
>I was going to go on about the probability of collisions, but something 
>just occurred to me:
>
>The bayes database is being used in two different ways. When we are 
>calculating the spam probability, we are only interested in the most 
>significant 15 tokens in the message, and we are not writing to the 
>database. Is there any reason to have the majority of the tokens which 
>are rare available in the database for that purpose? Why do we need a 
>database with millions, or even hundreds of thousands of tokens at all 
>except for when we are performing a learn operation?

The issue is that those 1-hit tokens provide several percent accuracy
overall -- they're very important.   Even though for most messages they
may not appear individually, as a general set of "hapax tokens", over all
messages scanned, they are common.
  
They are frequently very strong tokens, too, making them useful and part
of the top N tokens (150 btw) included in the calculation -- in fact,
the most common tokens are often common in both ham and spam, making
them useless for scanning purposes.

>Can we run salearn in a batch mode and have it generate small databases 
>that are used by the Bayes rules?

Well, we are trying to avoid "batch modes" ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFATlr7QTcbUG5Y7woRArD8AKCGSkVaNjzM3YjxTRGilzS+1D6gpwCg4RO3
tUwpFIZ4hI8ZiHgcuJrnX6g=
=hOhd
-----END PGP SIGNATURE-----

Reply via email to