On 9/25/07, David <[EMAIL PROTECTED]> wrote: > One feature that may be useful would be to let ASSP automatically scale > how much spam/nonspam it collects, based on a couple of factors. It is > often too easy to let the bayesian database get skewed to one side > (usually heavy on the spam side) due to imbalanced collecting (such as > 1:1, with a 80% spam rate). > > Perhaps ASSP could look at the rebuildrun.txt, see the value of the > weighted norm then decide if it needs to adjust the collecting in one > direction or another. Then it would also look at the Non-Local Mail > Blocked (or another spam ratio indicator) to see how far it needs to > skew the collecting (1:2, 1:4, I have close to 90% spam so I've been > using 1:10 to get my corpus norm down from 3.5) > > For cases like mine where the corpus was heavily skewed, it would need > to push the ratio even further (1:15, 1:20) then level out once the norm > nears 1.0 > > any thoughts?
I think this is a interesting idea. I have thought about this before, and have pondered if some simple mathmatics could be applied to the rebuildspamdb.pl: 1) check to see what the ratio of spam/ham is, and 2) automagically adjust the freqNonSpam and/or freqSpam accordingly to compensate for severely skewed ratios. -- ME2 ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Assp-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/assp-user
