-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sidney Markowitz writes:
>Justin Mason wrote:
>> They are frequently very strong tokens, too, making them useful and part
>> of the top N tokens (150 btw) included in the calculation
>
>I'm talking about ranking tokens by strength. It would not matter how 
>common they are. What percent of all tokens in the db get picked as 
>being in the top 15 (or whatever we use) of any of the messages that are 
>looked at? How would it affect accuracy by not having the weakest N% of 
>tokens in the db available during the calculations?
>
>> the most common tokens are often common in both ham and spam, making
>> them useless for scanning purposes.
>
>Exactly. If they are useless why do we need them in the db that is used 
>when we are scanning? They are of course needed during training.

OK, that's an interesting idea.  hmm... I've never tested that.

>> Well, we are trying to avoid "batch modes" ;)
>
>Sonic.net already has to do that to some degree to attempt to deal with 
>I/O requirements. Messages are tokenized, messages in the form of token 
>summaries are written to a spool, and then a separate process does the 
>learning. It should be optional, but for scalability it should be easy 
>to separate the processes of scanning for spam and doing the training. 
>Whether or not you call it a "batch mode".

Yes, maybe for super-high-volume setups a batch mode is unavoidable.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFATmw/QTcbUG5Y7woRAp69AKDYZpo42fk8oR/tX1PVQIx2MNLYlQCgunNC
XEkWtY3I6DMrYkiPIyRH2/A=
=CJcm
-----END PGP SIGNATURE-----

Reply via email to