On Mon, Nov 21, 2005 at 10:39:41PM +1300, Sidney Markowitz wrote:
> Are you saying that there is no advantage in accuracy over using pure
> NaiveBayes, but you prefer to use SA-Train because it is simpler than
> ongoing incremental learning and the resulting model is smaller?
Well, the main reason for implementing SA-Train was that we were
using SpamAssassin at the institute, and creating user_prefs /
bayes_* files was easier than setting up a different system. Plus,
for someone with a background in ensemble learning having different
levels of systems to tune is an interesting challenge. I actually
found out only later that SpamBayes is competitive.

Yes, the resulting model is smaller for SA, and with the spamc daemon
I would expect it to be more scalable than SpamBayes; and no,
incremental learning is much harder (but not impossible) for SA-Train,
which initially led to the batch-style learning approach, and to the
other results. I've tested just training the NB model within SA, and
to some extent it works, but it is unclear how far you can go with
that... at some point it is likely to break down, and rule weight
have to be adapted.

Some other feedback: I have grown sceptical of the auto-whitelist,
especially for mailing lists which ocassionally get spam mails,
which will almost invariably lead to the next mail being lost. And
the auto-training for Bayes never worked for me or any of my
colleagues - switching it on invariably trained the wrong mails and
degraded the model significantly, so we switched it off for
everyone. Perhaps the European mails are indeed significantly different
from the US-American ones? It cannot be the language: german
language spam is a recent fad, and not very big.

In a way I have to thank SpamAssassin for a lot of inspiration and
ideas. I am now using SpamBayes mainly to see how well it fares and
to check incremental learning out for myself.  As I said, there is
really not much difference in performance, although I suspect that
lots of mails are needed to achieve this level and pooling multiple
users (i.e. we used eight users from the institute) is necessary.


> How does performance of SpamAssassin used with SA-Train for rules and
> Bayes compare with using the same training set to train SpamBayes?
Perhaps the best way to answer this question is to refer to
TR-2005-20 (see alex.seewald.at/spam), p.23, Fig. 5, which shows a
FP/FN rate trade-off curve (i.e. X-axis is the FP rate in logarithmic
scale, Y-axis is the FN rate in logarithmic scale). SA-Train.pl actually
implements SA-Simple and not SA-TUNE which is shown, but SA-Simple and SA-Tune
are very similar as well (see Fig.6). 


> Thank you. I'm glad that you won't mind if someone does decide to
> implement your idea this way and contribute it to the project. In the
> meantime, as I said we would be very happy to see you place a
> description and a link in the appropriate place in the SpamAssassin wiki.
Ok, I will do that.

Thanks,
Dr.techn. Alexander K. Seewald

Solutions for the 21st century   +43(664)1106886
------------------------------------------------
         Information wants to be free;
Information also wants to be expensive (S.Brant)
--------------- alex.seewald.at ----------------

Reply via email to