Hi Duncan,
Thanks for your comments. I agree with the hindsight value of reuse,
and will try some post-processing of mass-check logs when submitted.
On Wed, 6 Jul 2005, Duncan Findlay wrote:
On Wed, Jul 06, 2005 at 07:48:36AM -0700, Peter Fritz wrote:
Some observations about mass-checks. Not sure if the instructions for
CorpusCleaning (on the wiki and previously in CORPUS_SUBMIT) are as
applicaple to mass-check with --reuse runs as full runs. My
understanding with --reuse is that if a network rule previously hit on a
message, it will be listed in the rules hit (spam.log) during
mass-checks but won't contribute to the recorded score of the message.
Hence, messages may have hit many network tests, but appear in the
spam.log with a low score, and therefore float to the top when reviewing
low-scoring spam, even though the original spam got a high score because
of network tests. Makes it hard to find false positives in the noise.
One solution to this would be to have the reuse flag record a more
accurate score in ham/spam.log for network tests, rather than zeroing
them out, but I don't have robust way of doing this yet.
http://wiki.apache.org/spamassassin/CorpusCleaning
--reuse is a dirty hack, as much as Dan might claim otherwise. :-)
That actually isn't a problem I had thought of (more obvious ones come
to mind).
Guess these and other problems will be resolved as the technique matures.
Found some clarifying notes in bug 4136 about the value of mass-check
scores too. http://bugzilla.spamassassin.org/show_bug.cgi?id=4136
Finally, some observations from some limited mass-checks locally.
Make sure you're generating scoreset 3 results.
Bingo! An oversight on my part failed to symlink config to config.set3.
Couldn't find a reference to this step in the masses directory (grep
"config\.set" ./masses), other than the "include config" in Makefile.
Results are better, feel BAYES_99 is a bit low, but a good start. Time
to double check for FN/FPs in my corpus.
# SUMMARY for threshold 5.0:
# Correctly non-spam: 314 92.90%
# Correctly spam: 14688 97.56%
# False positives: 24 7.10%
# False negatives: 368 2.44%
# Average score for spam: 21.294 ham: 1.3
# Average for false-pos: 7.230 false-neg: 3.0
# TOTAL: 15394 100.00%
score BAYES_00 -2.599 # not mutable
score BAYES_05 -0.413 # not mutable
score BAYES_40 -1.096 # not mutable
score BAYES_50 0.001 # not mutable
score BAYES_60 0.372 # not mutable
score BAYES_80 2.087 # not mutable
score BAYES_95 2.063 # not mutable
score BAYES_99 1.886 # not mutable
Cheers,
PF