http://bugzilla.spamassassin.org/show_bug.cgi?id=4505
------- Additional Comments From [EMAIL PROTECTED] 2005-07-28 10:12 ------- well, we disagree ;) I'd appreciate some comments from the rest of the committers on how they feel about this one. Here's a chat log between myself and H talking about it.... (09:49:33) henry: so about fixing up logs (09:50:19) henry: I'd rather that we didn't because: 1) You've only removed errors from 10% of the logs. 2) You haven't removed the errors that both you and SA has made. (09:50:25) henry: have made (09:51:00) jm: please respond via mail on this one, I suspect I'm not the only one who disagrees ;) (09:51:18) henry: sure (09:51:56) jm: imo we need to try and get the logs as clean as poss, even if we're missing 90% of the FPs/FNs (09:52:19) henry: we're just gaming the numbers (09:52:32) jm: even if the perceptron is able to deal with some noise, the logs are used for other things (STATISTICS.txt) that cannot deal with noise (09:52:36) henry: the learning algorithm would be useless if it couldn't work around a few mistakes (09:52:58) jm: we're not gaming it -- we're using it to build something nearer a "gold standard" in Cormack temrs (09:53:13) henry: and what I'm saying is that by correcting errors in only one direction, STATISTICS.txt will be worse off than it was before (09:53:24) henry: Cormack uses multiple classifiers to make his "gold standard" (09:56:27) jm: why are we correcting errors only in 1 dir? (09:56:31) jm: don't get that (09:56:54) henry: you're not correcting entries where both you and SA have erred (09:57:22) henry: so they look like TPs and TNs, but in fact they are FNs and FPs (09:57:52) jm: ok. but it's still *better* than the current logs (09:58:03) henry: I disagree (09:58:03) jm: in that there are *less* FPs and FNs overall (09:58:17) jm: even if there are still *some* FPs and FNs (09:58:19) henry: there are indeed less FPs and FNs overall (09:58:44) henry: but since we know how many errors we've seen, we can make some predictions about what's gone on in the other direction (09:59:49) jm: I disagree that that's useful ;) (09:59:58) jm: unless you want to fix the STATISTICS generating scripts as well... (10:01:30) henry: well, here's the thing (10:01:37) henry: from first look (10:01:47) henry: it seems that people have about the same amount misclassified in each direction (10:01:49) henry: that have been found (10:02:42) henry: so you could hypothesise that there are plenty that have gone the other way (10:03:29) henry: and that they are about the same proportion (10:03:34) henry: maybe (10:03:36) henry: I don't know (10:04:07) henry: all that I can say is that by fixing based solely on the suspected mistakes of the classifier, we're biasing the results to make things look better than they are (10:04:45) henry: and really.. at the end of the day, the numbers reflect how good the sample set is ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
