On Fri, 2009-02-13 at 15:21 -0500, Kris Deugau wrote:
> Lindsay Haisley wrote:
> > I have two servers.  Currently they're both running instances of spamd
> > with separate mysql databases, however I'd like run both instances from
> > the same database on one of the servers. There are two ways to do this:
> > 
> > 1.  I can give the -d option to spamc where it's invoked in the mail
> > system, with the target being spamd on the master spamassassin server
> > via the VPN that connects the two boxes.  spamd is already configured to
> > listen to it.
> 
> Mm, I don't think this does what you're hoping.  spamd on any given 
> system will use the configured database (local or otherwise) - this is 
> **NOT** something the client can request.
> 
>  From man spamc:
> 
>         -d host[,host2], --dest=host[,host2]
>             In TCP/IP mode, connect to spamd server on given host
>             (default: localhost).  Several hosts can be specified
>             if separated by commas.
> 
> This only affects which spamd server the client asks to process the 
> message;  it doesn't affect any aspect of the actual processing.

I think you misunderstand me.  If spamc on machine A is invoked with -d
<IP address of machine B> then spamc will use whatever databases and
configurations are in effect for spamd on machine B.  This is what the
-d option is for.  The "actual processing" is done by spamd, whichever
instance (machine A or B) is addressed by the spamc client, so I do have
a choice here, and that's what I want to decide on.  spamc is basically
just a passive client which reads and writes emails and passes off the
job of spam processing to spamd, wherever it may be.

If spamc on machine B uses it's local spamd instance (the same one
machine A is using) as a server, then the task I'm trying to do is
accomplished since both machines are ultimately using the same database.

> > Does anyone with some experience with spamassassin know which of these
> > two approaches would be better?  Which would be fastest?  Which would be
> > most conservative of bandwidth between the boxes?
> 
> A lot depends on the hardware you're using.  If you're trying to squeeze 
> some last bits of performance out of a heavily-loaded system by 
> eliminating the SQL duplication, you'll probably have to tune the spamd 
> instances differently as well (eg, the system running MySQL won't be 
> able to support as many spamd children as the other one).  You haven't 
> said what's in MySQL for SA;  IME anything more than a couple of hundred 
> users suck up too much IO for per-user Bayes and/or AWL (not to mention 
> the staggering disk requirements - even at today's disk prices).

The current load on what I've defined above as "machine B" and is quite
manageable, and this is the box that's now handling over 90% of traffic
to probably a couple of hundred mailboxes on the system.  The MySQL
tables used by SA are at well less than a gig on a box that has close to
half a TB of drive space on it, and SA has been running there for over a
year.  The system load avg runs consistently under 1 except when
cron-initiated maintenance happens.

> The cluster I'm doing most of my SA tuning on these days currently has 3 
> machines running spamd, and a fourth running MySQL (and some other 
> unrelated services, otherwise it would run spamd as well).  Each machine 
> has the same SA config pointing to the same database on that fourth 
> machine - but clients don't see this, and can't affect it.
> 
> If the machines are not on the same local Ethernet segment, you're 
> probably better off leaving well enough alone, because any gains you 
> make in eliminating the SQL duplication will be lost waiting for data to 
> move across the network.  Or worse.

My intention here is to optimize administration, both for migration and
for those parts of SA for which I've programmed customer UIs.
Considering the number of checks involved in email by the MTA, what with
top level RBL checking (done by the MTA) and hitting SA twice, I don't
think waiting for one more transaction will be problematic.

Although I appreciate your advice, my question here is not _whether_ I
should do the integration, but which of the two methods of integrating
the databases will be most efficient of bandwidth and other resources.

-- 
Lindsay Haisley       | "Everything works    |    Accredited
FMP Computer Services |       if you let it" |      by the
512-259-1190          |    (The Roadie)      |   Austin Better
http://www.fmp.com    |                      |  Business Bureau

Reply via email to