Robert LeBlanc wrote:

>
> A better solution is to just maintain one "master" database on one of these
> boxes, and make all of the other machines "slaves" that share the master's
> database.  There are at least a couple of different ways to do this:

somehow, in a two box scenario which are supposed to be "twins", i don't
like the idea having one the two boxes being a "master" compared to the
other. because if the master crashes, the slave will be somewhat lost
(slave loses the remote fs mount, etc.), and should have some local
recovery procedure (use a local backup of the bayes database, or such)
in order to continue to work correctly

an independent "db master" node, which relies on rsync/filecopy methods to
synchronise the slaves look a lot more solid an fault-tolerant (if the master
crashes, the slaves just won't get any updated db)

the sql bayes implementation should be very interesting for this (Master-Slave
replication done on the database level)

> (1) You can use NFS, Samba, or some other remote file system to mount the
> master's database on each of the slaves.  In read/write mode, all of the
> boxes can auto-learn, though the reliability of NFS locks may be an issue
> to contend with.  All mistake-based training should take place on the master.

see above (if the master crashes, the slave has to have some kind of recovery
scenario, and if using nfs, the slave, at least it SA processes, be it amavis or
spamd,
will probably hang until nfs times out, i think..., what happens after
the timeout could be interesting)

>
> (2) You can choose to do all of your learning only on the master, and then
> use something like rsync to send read-only copies to the slaves at regular
> intervals.  The slaves should have auto-learning disabled, and all
> mistake-based training should take place on the master.

that would be an idea. In my scenario, this would mean
to place a supplemental SA/amavisd (slightly trimmed, without the DNS
blacklists, for example, as they are not user for the "learning score", if i
remember right) on the mailserver or another independent machine
which would rescan all email that got through, learn it as ham/spam into it's
own master db, then periodically copy the master db to both slaves.

Of course, it could (better solution ?) also be an independent box, which
receives a copy of every mail that got through (postfix's "always bcc" could
be handy), just for scanning/learning purposes

so at least, if this master crashes, both MX still receive and filter emails,
they just won't be up to date with the bayesian stuff (and if the 3rd
instance is the mailserver itself, they won't also be able to relay the
emails, but who cares at this point)

i'll investigate these solutions...

>
> If your users are "confirming" the status of the mail they receive as spam
> or ham somehow (e.g. with a quarantine management mechanism of some kind),
> then you don't need to use auto-learning at all.  A script that runs at
> scheduled intervals can run sa-learn on the confirmed spam and confirmed
> ham in order to train your Bayes database.  Copy that database (read-only)
> to all of your nodes, and you're pretty much done.

another idea, not having autolearn enabled on the 2 receiving boxes, and
somehow pull out the (quarantined by amavisd) mails that else would have
been autolearned by SA (how ..?), then simultaneously train bayes on both
nodes with those mails via a cronjob

>
> >maybe this point of "common" bayes learning will become
> >"unimportant" (i don't remember the right word now) in
> >SA 3.x (or was it 2.70) when the Bayes DB can be stored
> >in a real SQL database, multiple hosts should the be able to
> >write to the same database ?
>
> Yes, that's one of the ideas behind moving the Bayes database to an SQL
> server--it can be more easily shared across an array of content filters,
> without having to manage filesystem-level sharing, lockfiles, copying
> databases, etc.

will it be possible to do the update writes (multiple SA instances) to one
master db, but do the reads from (replicated & local) slave db's ?


Reply via email to