> [-] I don't like having a "Junk Suspects" folder. To me, this just > means > that I have two spam folders to manage instead of one. As a work > around, I > have just set both spam and spam suspects to go to the same Junk > folder. I > just made this change, so I don't yet know if this will cause any > problems, > but I don't see why it should.
If you've left the thresholds at the defaults (which later messages indicate you have), you have effectively set both thresholds to 0.15. IOW, any message that scores more than 0.15 is spam. Any tokens that haven't been seen before score 0.5 - so unknown messages are much, much higher than your threshold. You'll end up with a *lot* of false positives. If you're certain you want to not have an unsure range, (a) consider a filter that isn't aimed at creating one, or (b) consider a threshold around 0.4 to 0.6. As to whether there should be one or not: Most users experience such a low false positive rate that they have no need to check messages classified as spam. This reduces the practical “spam workload”, defined as percentage of messages needing manual checking, to around 2-5% of the total mail stream. Any system that exhibits a non-trivial false positive rate requires the user to check all messages classified as spam to ensure that valuable mail is not lost, dramatically reducing the value of the spam filtering technology. SpamBayes allows the user to configure the size and position of the unsure range to ensure the number of messages classified as unsure is consistent with the user’s comfort level, training database and risk tolerance of false positives. A remarkable property of chi-combining is that people have generally been sympathetic to its ‘unsure’ ratings: people usually agree that messages classed unsure really are hard to categorize. For example, commercial HTML email from a company you do business with is quite likely to score as unsure the first time the classifier sees such a message from a particular company. Spam and commercial email both use the language and devices of advertising heavily, so it is hard to tell them apart. SpamBayes users typically experience no false positives; this is not from an inherent strength of SpamBayes over similar statistical (or other) filters, but as a result of the unsure range. Essentially, the messages that would otherwise have been false positives are classified as unsure. The advantage of this system is that the volume of mail that the user must scan to find errors (both false positives and false negatives) is greatly reduced; typically between one and five percent of messages are classified as unsure, which is generally much lower than the percentage of mail that is spam. As a result, users are more likely to take the time to scan the unsure folder than they would be to scan the entire spam folder, more able to identify the correct classification (rather than missing a false positive in a crowded spam folder) and more likely to appropriately train messages therein. The disadvantage of this system is that the percentage of messages that are classified as unsure is typically higher than the combined percentage of false negative and false positive messages obtained when using a classifier that does not include an unsure range. In simple terms, more messages must be manually corrected, but fewer messages must be manually examined. [Stolen, with minor adaption, from my papers: http://www.ceas.cc/papers-2004/136.pdf http://www.massey.ac.nz/~tameyer/research/spambayes/ tameyer_trec_2005.pdf] If you are concerned with possible false positives, then you will still scan your spam folder. However, there's scanning, and there's scanning. If there wasn't an unsure range, and the false positive rate was around 1%, then you'd have to carefully scan the spam folder. With a false positive rate close to 0%, you can quickly flick through the folder, probably just glancing at senders & subjects. Scanning through the unsure folder, since it contains many fewer messages than the spam/ham folders, is quick. Would you rather spend 20 minutes scanning through the spam folder daily, or 5 minutes scanning the spam folder each week and 2 minutes scanning the unsure folder each day? =Tony.Meyer -- Please always include the list (spambayes at python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
