They are frequently very strong tokens, too, making them useful and part of the top N tokens (150 btw) included in the calculation
I'm talking about ranking tokens by strength. It would not matter how common they are. What percent of all tokens in the db get picked as being in the top 15 (or whatever we use) of any of the messages that are looked at? How would it affect accuracy by not having the weakest N% of tokens in the db available during the calculations?
the most common tokens are often common in both ham and spam, making them useless for scanning purposes.
Exactly. If they are useless why do we need them in the db that is used when we are scanning? They are of course needed during training.
Well, we are trying to avoid "batch modes" ;)
Sonic.net already has to do that to some degree to attempt to deal with I/O requirements. Messages are tokenized, messages in the form of token summaries are written to a spool, and then a separate process does the learning. It should be optional, but for scalability it should be easy to separate the processes of scanning for spam and doing the training. Whether or not you call it a "batch mode".
-- sidney
