At 14:03 2004/03/27, Oliver Thalmann wrote:

is it possible somehow (has anyone already done it)
to "merge" the 2 databases into 1 "master reference",
which could then be  recopied over the 2 independent
databases, thus reverting to a common "opinion" on
what is bayesian spam and what is not.

In a multi-node scenario, where you have an array of content filtering boxes, you don't really want to be doing auto-learning on each box independently. As you point out, this leads to each box developing its own unique Bayes database, such that one box ends up better able to detect particular kinds of spam than the others. Another problem is that this makes mistake-based training more complicated, since you have to do your training corrections on the box that made the mistake.


A better solution is to just maintain one "master" database on one of these boxes, and make all of the other machines "slaves" that share the master's database. There are at least a couple of different ways to do this:

(1) You can use NFS, Samba, or some other remote file system to mount the master's database on each of the slaves. In read/write mode, all of the boxes can auto-learn, though the reliability of NFS locks may be an issue to contend with. All mistake-based training should take place on the master.

(2) You can choose to do all of your learning only on the master, and then use something like rsync to send read-only copies to the slaves at regular intervals. The slaves should have auto-learning disabled, and all mistake-based training should take place on the master.

If your users are "confirming" the status of the mail they receive as spam or ham somehow (e.g. with a quarantine management mechanism of some kind), then you don't need to use auto-learning at all. A script that runs at scheduled intervals can run sa-learn on the confirmed spam and confirmed ham in order to train your Bayes database. Copy that database (read-only) to all of your nodes, and you're pretty much done.


maybe this point of "common" bayes learning will become
"unimportant" (i don't remember the right word now) in
SA 3.x (or was it 2.70) when the Bayes DB can be stored
in a real SQL database, multiple hosts should the be able to
write to the same database ?

Yes, that's one of the ideas behind moving the Bayes database to an SQL server--it can be more easily shared across an array of content filters, without having to manage filesystem-level sharing, lockfiles, copying databases, etc.



Robert LeBlanc <[EMAIL PROTECTED]>
Renaissoft, Inc.
Maia Mailguard <http://www.renaissoft.com/maia/>





Reply via email to