http://bugzilla.spamassassin.org/show_bug.cgi?id=2910
------- Additional Comments From [EMAIL PROTECTED] 2004-01-08 11:36 -------
oops -- a thread diverged on sa-dev without being cc'd to bugzilla-daemon. Here
it is:
Subject: Re: [Bug 2910] New: Fast SpamAssassin score learning tool.
From: Sidney Markowitz <[EMAIL PROTECTED]>
Date: Fri, 09 Jan 2004 07:23:13 +1300
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Henry Stern wrote:
> while the GA requires several hours to run on high-end
> machines, the perceptron requires only about 15 seconds
That's impressive. How close are the results to those of the GA? That's
actually two questions: 1) How close is the scoring that the perceptron
comes up with to the scoring that the GA comes up with? and 2) How much
difference in spam categorization results is there between using the
scores generated by the perceptron and those generated by the GA?
== sidney
From: "Henry Stern" <[EMAIL PROTECTED]>
Date: Thu, 8 Jan 2004 14:36:17 -0400 (10:36 PST)
To: "'Sidney Markowitz'" <[EMAIL PROTECTED]>,
<[EMAIL PROTECTED]>
> -----Original Message-----
> From: Sidney Markowitz [mailto:[EMAIL PROTECTED]
>
> That's impressive. How close are the results to those of the GA? That's
> actually two questions: 1) How close is the scoring that the perceptron
> comes up with to the scoring that the GA comes up with? and 2) How much
> difference in spam categorization results is there between using the
> scores generated by the perceptron and those generated by the GA?
The original perl implementation (with its own parser) was able to find a
scoreset that made fewer false positives and negatives than the GA on
2.60-set1. The much-faster C version uses the scripts in /masses to
generate C code. They have a lot of GA-related tweaks in them which need to
be turned off.
There also seem to be some "bad" rules that have been removed since the
masses were last run. The perceptron was able to find suitable scores for
them (quite low but non-zero) which reduced the number of false positives
and negatives. It might be worthwhile to put them back in and examine the
results.
I'm not going to burn the week's worth of CPU time to do a comparative
analysis of the two algorithms. However, if anyone feels like running a
10-fold cross validation with both the GA and several configurations of the
perceptron and then sending me the results, I will do the statistical
analysis part.
Henry
Also, I've asked Henry to send in a CLA and one is on the way -- either fax or
post:
(11:32:19) Justin: oh BTW -- now that the code is posted, any chance we can get
a CLA? ;)
(11:32:27) Justin: so we can use it
(11:32:37) Henry: the CLA is signed and in my clipboard
(11:32:45) Justin: cool
(11:32:54) Henry I'll head over to the post office
(11:33:00) Henry: and either fax it (if they do that) or mail it
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.