Hello Nix, Friday, January 9, 2004, 12:28:39 PM, you wrote:
N> On Thu, 8 Jan 2004, Robert Menschel uttered the following: >> Yes, there are three reasons you might not want to use bigevil. >> >> 1) You like getting spam. >> >> 2) You run SA with a threshold level very different from the default 5.0 >> score, and don't have the time or ability to adjust the bigevil scores >> accordingly. >> >> 3) You are an end-user whose only control is through the user_prefs file, >> and therefore you cannot add additional rules to your SA processing. N> 4) you prefer to have such a large collection of rule/score combinations N> GAed before use, and consider a system that relies on some poor sod N> manually maintaining a huge list of regexes, with (as far as I can tell) N> decidedly ad-hoc hit-frequencies checking, to be a step backwards. Grin. Actually, though I greatly support Chris' activity, I don't use BigEvil myself, at least not yet. Mostly that's because I get so little unflagged spam these days that BigEvil won't buy *my* systems significant benefit, vs the cost of time involved in applying the file, changing the rule scores, testing them for FPs. (Well, since I test BigEvil for FPs anyway, I guess that last one doesn't count. :-) However, your comments about the GA interest me. As things have progressed, I'm beginning to see three classes of rules: * Evil rules, which match ONLY spam, and which can't ever match ham. BigEvil is like this, at least theoretically, where the URI rules point only at web sites run by spammers and referenced by spammers. * Additive rules, none of which are significant by themselves, but which flag spam in conjunction with many others. Popcorn is an example. * All the other rules. Evil rules, if/when guaranteed, can be scored at or above your spam threshold. An example from my personal files (where my spam threshold is 9): uri RM_u_530000x /530000x\.net/i describe RM_u_530000x body contains link to known spammer score RM_u_530000x 9.000 # 582s/0h of 81383 corpus Additive rules should be analyzed for how many hits should flag spam. If popcorn spam hits 7 popcorn rules, and no ham hits 6 popcorn rules, and your spam threshold is 5, then a good score for the popcorn family is 5/7. Or if you want to be conservative, 4.5/7, requiring something else to hit to complete the spam flag. All other rules should be subject to GA. The above is my stance when being agressively anti-spam. However, the other half of the time, I'm conservatively anti-spam, and I recognize that putting ALL the rules through a GA doesn't hurt the effort. It may weaken some rules, but it strengthens the overall effort. We can't all run a GA. I haven't even figured out how to do it yet. I've gotten quite good at running mass-checks on rules and rule sets, and run some mass-check on something almost daily. I also have a variety of algorithms by which I determine what scores to use for which rules. But those algorithms are based on flat corpus statistics, and not on any evolutionary exploration of the scores themselves. Lacking a simple way to run a GA, I find intelligent if flat one-pass algorithms to be very useful. And the simplest of those does apply to BigEvil -- if it's BigEvil, it's spam. N> (No offence, Chris. You're doing a hell of a job, but it seems like N> you're engaged in a Red Queen's race to me :) ) We all are. I still get an average of a dozen FNs a week. Based on that, I'm not making any progress. I just can't seem to get past the 99.8% accuracy point. But compared to where I was back in May, 99.8% is heaven. Bob Menschel ------------------------------------------------------- This SF.net email is sponsored by: Perforce Software. Perforce is the Fast Software Configuration Management System offering advanced branching capabilities and atomic changes on 50+ platforms. Free Eval! http://www.perforce.com/perforce/loadprog.html _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk