hi Duncan -- that *is* good news ;) can you give a rough idea of what algorithm it uses?
I'm keen to see results once the "rules" are taken into account, btw, as it's quite easy for machine-learning systems to overfit against our training data in my experience otherwise, and/or to produce exploitable "holes" by offering negative scores for easily-forged rules. still, very cool! --j. Duncan Findlay writes: > Good news, everyone! > > As part of our 4th year Math & Engineering Design Project, Steven Birk > and I have been working to develop a better scoring algorithm for > SpamAssassin. > > We've come across an algorithm that shows some great promise: > > Using the 3.2.0 logs: > > scoreset 0: > > # SUMMARY for threshold 5.0: > # Correctly non-spam: 67528 99.97% > # Correctly spam: 100519 84.41% > # False positives: 22 0.03% > # False negatives: 18564 15.59% > # TCR(l=50): 6.055889 SpamRecall: 84.411% SpamPrec: 99.978% > > # SUMMARY for threshold 3.5: > # Correctly non-spam: 67446 99.85% > # Correctly spam: 108479 91.10% > # False positives: 104 0.15% > # False negatives: 10604 8.90% > # TCR(l=50): 7.534991 SpamRecall: 91.095% SpamPrec: 99.904% > > scoreset 1: > > # SUMMARY for threshold 5.0: > # Correctly non-spam: 67498 99.92% > # Correctly spam: 112670 94.61% > # False positives: 52 0.08% > # False negatives: 6413 5.39% > # TCR(l=50): 13.212360 SpamRecall: 94.615% SpamPrec: 99.954% > > scoreset 2: > > # SUMMARY for threshold 5.0: > # Correctly non-spam: 67517 99.95% > # Correctly spam: 115916 97.34% > # False positives: 33 0.05% > # False negatives: 3167 2.66% > # TCR(l=50): 24.721403 SpamRecall: 97.341% SpamPrec: 99.972% > > scoreset 3: > > # SUMMARY for threshold 5.0: > # Correctly non-spam: 67518 99.95% > # Correctly spam: 117809 98.93% > # False positives: 32 0.05% > # False negatives: 1274 1.07% > # TCR(l=50): 41.434586 SpamRecall: 98.930% SpamPrec: 99.973% > > # SUMMARY for threshold 5.2: > # Correctly non-spam: 67521 99.96% > # Correctly spam: 117727 98.86% > # False positives: 29 0.04% > # False negatives: 1356 1.14% > # TCR(l=50): 42.438703 SpamRecall: 98.861% SpamPrec: 99.975% > > These are using the same training and validation sets as bug 5270. The > run time is roughly of the same order of magnitude as the > perceptron. (The slow bit is the analog of the logs-to-c script.) > > Clearly from the set 0 results, we need to tune the algorithm some > more to get the threshold of 5.0 to be optimal. > > At this point, the algorithm breaks a number of our current score > generation "rules", so there is room for improvement. (We're working > on it). > > - Our handling of immutable rules is pretty much broken at this > point. (We assume all rules are mutable, evaluate the optimal > threshold value and scale our scores appropriately, and then only > update the mutable scores for evaluating against the validation > set. For our purposes, we also assumed BAYES_* is mutable.) I'm not > sure how hard this will be to fix, or if it's worth it. > > - We have no concept of max/min scores or score ranges. Many tests > get small negative scores and should simply be set to 0. We haven't > yet figured out what effect this has on the TCR. Also, some scores get > set really high -- i.e. BAYES_99 is scored 6.1 in scoreset 3. I'm not > sure people are comfortable with that. There's at least 2 ways we can > fix this -- adapting the algorithm to take into account min/max scores > (hard), simply capping the scores after they are generated (easy). A > quick look through the scores and score-ranges-from-freqs output > suggests that this will not hurt our performance all that much. > > Our project is due in a few weeks, and with any luck we'll have a > complete new score generation system for SpamAssassin. > > -- > Duncan Findlay
