Experiment: I removed the current sliding/shrinking window code and
replaced it with this simple bit:

  my ($lo, $hi);
  if ($is_nice{$test}) {
    $hi = 0;
    $lo = $ranking{$test} * -4.5;
  }
  else {
    $lo = 0;
    $hi = $ranking{$test} * 4.5;
  }

Which relies on the new RANKING code (that has a reasonably good
distribution of RANKS from low to high).  I then took last night's
corpus submission results and did a 10fcv:

BEFORE:

# TCR: 38.978094  SpamRecall: 98.376%  SpamPrec: 99.809%  FP: 0.16%  FN: 1.40%
# TCR: 39.409745  SpamRecall: 98.699%  SpamPrec: 99.750%  FP: 0.21%  FN: 1.12%
# TCR: 46.518954  SpamRecall: 98.693%  SpamPrec: 99.829%  FP: 0.15%  FN: 1.12%
# TCR: 43.135758  SpamRecall: 98.651%  SpamPrec: 99.804%  FP: 0.17%  FN: 1.16%
# TCR: 40.119504  SpamRecall: 98.491%  SpamPrec: 99.801%  FP: 0.17%  FN: 1.30%
# TCR: 41.669789  SpamRecall: 98.485%  SpamPrec: 99.821%  FP: 0.15%  FN: 1.30%
# TCR: 43.030230  SpamRecall: 98.491%  SpamPrec: 99.835%  FP: 0.14%  FN: 1.30%
# TCR: 43.879162  SpamRecall: 98.494%  SpamPrec: 99.843%  FP: 0.13%  FN: 1.30%
# TCR: 38.722524  SpamRecall: 98.556%  SpamPrec: 99.770%  FP: 0.20%  FN: 1.24%
# TCR: 42.063830  SpamRecall: 98.676%  SpamPrec: 99.787%  FP: 0.18%  FN: 1.14%

average TCR -> 41.752759

AFTER:

# TCR: 67.784762  SpamRecall: 99.213%  SpamPrec: 99.861%  FP: 0.12%  FN: 0.68%
# TCR: 76.040598  SpamRecall: 99.149%  SpamPrec: 99.907%  FP: 0.08%  FN: 0.73%
# TCR: 87.009780  SpamRecall: 99.174%  SpamPrec: 99.935%  FP: 0.06%  FN: 0.71%
# TCR: 85.340528  SpamRecall: 99.292%  SpamPrec: 99.907%  FP: 0.08%  FN: 0.61%
# TCR: 78.383260  SpamRecall: 99.118%  SpamPrec: 99.921%  FP: 0.07%  FN: 0.76%
# TCR: 76.038462  SpamRecall: 99.050%  SpamPrec: 99.926%  FP: 0.06%  FN: 0.82%
# TCR: 84.527316  SpamRecall: 99.056%  SpamPrec: 99.952%  FP: 0.04%  FN: 0.81%
# TCR: 77.193059  SpamRecall: 99.098%  SpamPrec: 99.921%  FP: 0.07%  FN: 0.78%
# TCR: 78.039474  SpamRecall: 99.070%  SpamPrec: 99.929%  FP: 0.06%  FN: 0.80%
# TCR: 79.080000  SpamRecall: 99.087%  SpamPrec: 99.929%  FP: 0.06%  FN: 0.79%

average TCR -> 78.9437239

Now, bearing in mind that we might not want to use RANK since inevitably
some of those low ranking rules will get removed and things would get
shifted around, this does suggest we should think about something a bit
more straightforward based on RANK or maybe S/O.  Whatever the current
windowing system does, I think it is limiting the scores a bit too much.
But, it's not quite that simple...

I suspected a lot of the benefit came merely from lowering the minimum
score to always be 0 (whereas the current ranging code sometimes forces
a rule to be a specific non-zero number like 2.800 or something, which
is absurd), giving the perceptron a lot more freedom.

This is where it gets interesting (Henry, thanks for the pointer in the
perceptron code)... the score ranges (mine or the original) aren't even
being used by the perceptron since that code got commented out somewhere
along the way.  The *only* effect of my change was to change the
scores.h file to go from about 431 non-mutable to 107 non-mutable rules.
The lowering is because the new ranging code I wrote doesn't do the
crazy "thou shall have a score of 2.800" thing.

While the improvement and fix was not exactly accidental, replacing ugly
complicated code with clean and simple code as a method to fix bugs is
(while quite valid and good) illustrative that we need to trim a lot of
fat.

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Reply via email to