http://bugzilla.spamassassin.org/show_bug.cgi?id=2129
------- Additional Comments From [EMAIL PROTECTED] 2004-03-13 17:16 ------- Subject: Re: Bayes tweaks to test -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 OK, here's the results. First pass: base: current SVN bug3118: with Henry's fix for bug 3118. In order to test this, I used an unbalanced corpus of 39987 ham and 23337 spam. decomp: using "decomposing" tokens: namely if the token "Foo!" appears, decompose that into "Foo!" "Foo" "foo!" and "foo". In other words, make dup tokens with nonalphanumerics and case stripped. dhm1: "dual header map" variant 1: Dan's first suggestion above; mapping "In-Reply-To" and "Message-Id" tokens into a shared token, so that a ref to a previously-learned Message-Id in the IRT header will be a hit. dhm2: similar for From, To and CC headers dhm3: similar for X-Mailer and User-Agent headers Then I threw in a couple of retests. Some of our old tokenizer tweaks may be smelling a little off by this stage, so they need a test. ignmid: ignore Message-Id headers -- just testing this out, as it's a large source of hapaxes. Results: base: 0.30/0.70 fp 3 fn 360 uh 193 us 3952 c 804.50 bug3118: 0.30/0.70 fp 2 fn 336 uh 207 us 4080 c 784.70 decomp: 0.30/0.70 fp 1 fn 324 uh 187 us 3981 c 750.80 dhm1: 0.30/0.70 fp 3 fn 344 uh 220 us 3867 c 782.70 dhm2: 0.30/0.70 fp 3 fn 343 uh 224 us 3709 c 766.30 dhm3: 0.30/0.70 fp 4 fn 342 uh 206 us 3886 c 791.20 ignmid: 0.30/0.70 fp 1 fn 383 uh 184 us 4020 c 813.40 (Don't forget -- compare all of these with "base", not with each other. They're all complementary so far.) Clearly decomp is a *big* win, by far! "ignmid" is not so hot, as there's a lot of missed spam as a result. "bug3118" looks good overall. dhm1 and dhm2 seem good, dhm3 borderline due to the new FP. Test set 2: try1: bug3118 + decomp + dhm1 + dhm2 -- ie best of previous run try2: bug3118 + decomp + dhm1 + dhm2 + dhm3 -- giving dhm3 a second chance. hdrs_no_num: try1, with an extra tweak; NO_NUMERIC_IN_HEADERS is turned on. I suspect the decomposed numeric tokens (ie. "8139" -> "N:NNNN") added to catch patterns, are no longer working well. no_num: same as hdrs_no_num, but also with no numeric tokens in the message body either. Results: hdrs_no_num: 0.30/0.70 fp 1 fn 266 uh 269 us 3804 c 683.30 no_num: 0.30/0.70 fp 1 fn 268 uh 260 us 3854 c 689.40 try1: 0.30/0.70 fp 2 fn 283 uh 238 us 3785 c 705.30 try2: 0.30/0.70 fp 2 fn 277 uh 251 us 3745 c 696.60 This time, try2 is looking good -- quite a bit better than try1. Also, clearly, dropping numeric tokens is now a good idea; both variants of that are a clear improvement. Test set 3: combined: try2 + no_num. combined:0.30/0.70 fp 2 fn 260 uh 267 us 3826 c 689.30 So that's what's gone in as r9447. I tried Dan's suggestion of looking up the dual-header-map tokens instead of making dupe copies of them -- unfortunately it didn't work, getting bad numbers, so I dropped that. Combining them into 1 duplicate header gets better accuracy for some reason. I also updated the Bayes 10fold cross-validation scripts to work again with current SVN, and wrote quite a bit more doco on how to run them. Note that "sa-learn --dump --dbpath" is required for these to work, so anyone who removes that will have to fix them ;) Next: I'll see if I can figure out a good invisible-text tweak. I may have to add a new rendering API for that, specifically for Bayes. - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAU7JqQTcbUG5Y7woRArjoAJwN2B3HltmR1VS1XIEQUtg34+CmNwCg4eXu q0wcV3wSfPy2VRep1BklaZQ= =/lK1 -----END PGP SIGNATURE----- ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
