Hello Nix,

Friday, January 9, 2004, 12:28:39 PM, you wrote:

N> On Thu, 8 Jan 2004, Robert Menschel uttered the following:
>> Yes, there are three reasons you might not want to use bigevil.
>> 
>> 1) You like getting spam.
>> 
>> 2) You run SA with a threshold level very different from the default 5.0
>> score, and don't have the time or ability to adjust the bigevil scores
>> accordingly.
>> 
>> 3) You are an end-user whose only control is through the user_prefs file,
>> and therefore you cannot add additional rules to your SA processing.

N> 4) you prefer to have such a large collection of rule/score combinations
N> GAed before use, and consider a system that relies on some poor sod
N> manually maintaining a huge list of regexes, with (as far as I can tell)
N> decidedly ad-hoc hit-frequencies checking, to be a step backwards.

Grin.  Actually, though I greatly support Chris' activity, I don't use
BigEvil myself, at least not yet. Mostly that's because I get so little
unflagged spam these days that BigEvil won't buy *my* systems significant
benefit, vs the cost of time involved in applying the file, changing the
rule scores, testing them for FPs.

(Well, since I test BigEvil for FPs anyway, I guess that last one doesn't
count.  :-)

However, your comments about the GA interest me.  As things have
progressed, I'm beginning to see three classes of rules:
* Evil rules, which match ONLY spam, and which can't ever match ham.
  BigEvil is like this, at least theoretically, where the URI rules
  point only at web sites run by spammers and referenced by spammers.
* Additive rules, none of which are significant by themselves, but which
  flag spam in conjunction with many others. Popcorn is an example.
* All the other rules.

Evil rules, if/when guaranteed, can be scored at or above your spam
threshold. An example from my personal files (where my spam threshold is
9):
uri       RM_u_530000x           /530000x\.net/i
describe  RM_u_530000x           body contains link to known spammer
score     RM_u_530000x           9.000  # 582s/0h of 81383 corpus

Additive rules should be analyzed for how many hits should flag spam. If
popcorn spam hits 7 popcorn rules, and no ham hits 6 popcorn rules, and
your spam threshold is 5,  then a good score for the popcorn family is
5/7. Or if you want to be conservative, 4.5/7, requiring something else
to hit to complete the spam flag.

All other rules should be subject to GA.

The above is my stance when being agressively anti-spam.

However, the other half of the time, I'm conservatively anti-spam, and I
recognize that putting ALL the rules through a GA doesn't hurt the
effort. It may weaken some rules, but it strengthens the overall effort.

We can't all run a GA. I haven't even figured out how to do it yet. I've
gotten quite good at running mass-checks on rules and rule sets, and run
some mass-check on something almost daily. I also have a variety of
algorithms by which I determine what scores to use for which rules. But
those algorithms are based on flat corpus statistics, and not on any
evolutionary exploration of the scores themselves.

Lacking a simple way to run a GA, I find intelligent if flat one-pass
algorithms to be very useful. And the simplest of those does apply to
BigEvil -- if it's BigEvil, it's spam.

N> (No offence, Chris. You're doing a hell of a job, but it seems like
N> you're engaged in a Red Queen's race to me :) )

We all are. I still get an average of a dozen FNs a week. Based on that,
I'm not making any progress. I just can't seem to get past the 99.8%
accuracy point. But compared to where I was back in May, 99.8% is heaven.

Bob Menschel





-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to