Re: Two servers, one database. A question

2009-02-15 Thread Martin Gregorie
On Sun, 2009-02-15 at 02:05 +0100, Karsten Bräckelmann wrote:
 Lindsay, if you end up doing some benchmarking, please let us know. I
 wouldn't be surprised if you're actually the first one to do this across
 the Internet. :)
 
Just a thought. Since getting message sizes and counts on traffic
between a client and server isn't the easiest thing to do unless they're
already instrumented to collect this information, the best approach may
be two pronged:

1) write a Perl or awk script that processes /var/log/maillog.* and
gathers message size statistics. The regex 'spamd.*bytes.$' will pick
the relevant log lines and the message size is the second to last field.
It would counting messages in size bands, e.g. 0-10KB, 10-100KB,
100-1MB, 1MB-250MB, 250MB to get some size and frequency statistics.

2) Pick a message from each band and run it through spamc manually while
using Wireshark to capture both spamc-spamd traffic and spamd-MySQL
traffic. Combining the message sizes and counts from the two streams
should give you enough information to correctly size the traffic flows. 


Question to developers on this list: Why is a message that exceeds the
maximunm size skipped entirely? Is there a case for passing its headers
through spamd and then combining the returned headers with the body in
spamc? It would give a bit more protection and doesn't look too
difficult to do since spamd is already capable of handling just the
headers.
 

Martin




Re: Two servers, one database. A question

2009-02-14 Thread Bob Proulx
Kris Deugau wrote:
 John Hardin wrote:
 The question is which is better, sending the message body (spamc -  
 spamd traffic) or database queries (spamd - mysql traffic) over the  
 expensive link?

 I would bet on Bayes/userpref queries being more efficient than the  
 spamc/spamd traffic.

I like that you are asking the question.  But I hate to guess at which
is better though.  The weakest benchmark data point is better than the
strongest guess.  Too often I have taken my best guess and been wrong.
In this case I would guess the opposite would be more efficient, that
the one spamc-spamd connection per message would be more efficient
than the many mysql queries per message, which is why I bring this up.

Bob


Re: Two servers, one database. A question

2009-02-14 Thread Lindsay Haisley
On Sat, 2009-02-14 at 15:04 -0600, Bob Proulx wrote:
  I would bet on Bayes/userpref queries being more efficient than
 the  
  spamc/spamd traffic.
 
 I like that you are asking the question.  But I hate to guess at which
 is better though.  The weakest benchmark data point is better than the
 strongest guess.  Too often I have taken my best guess and been wrong.
 In this case I would guess the opposite would be more efficient, that
 the one spamc-spamd connection per message would be more efficient
 than the many mysql queries per message, which is why I bring this up.

Well that's something to consider.  I had hoped when I subscribed to
this list to ask this question that I'd find people, possibly SA
developers on it, who had benchmarked the options I presented for
decision and could give me some definitive answers based on this, but it
appears that this isn't the case.  Instead I've found several people of
good will who don't seem to know a whole lot more about SA than I do,
but have given me some good points to think about.

Do you have any idea where I might inquire to get advice from people
with more precise knowledge?

-- 
Lindsay Haisley   | Everything works|Accredited
FMP Computer Services |   if you let it |  by the
512-259-1190  |(The Roadie)  |   Austin Better
http://www.fmp.com|  |  Business Bureau



Re: Two servers, one database. A question

2009-02-14 Thread Michael Parker


On Feb 14, 2009, at 3:47 PM, Lindsay Haisley wrote:


On Sat, 2009-02-14 at 15:04 -0600, Bob Proulx wrote:

I would bet on Bayes/userpref queries being more efficient than

the

spamc/spamd traffic.


I like that you are asking the question.  But I hate to guess at  
which
is better though.  The weakest benchmark data point is better than  
the
strongest guess.  Too often I have taken my best guess and been  
wrong.

In this case I would guess the opposite would be more efficient, that
the one spamc-spamd connection per message would be more efficient
than the many mysql queries per message, which is why I bring this  
up.


Well that's something to consider.  I had hoped when I subscribed to
this list to ask this question that I'd find people, possibly SA
developers on it, who had benchmarked the options I presented for
decision and could give me some definitive answers based on this,  
but it
appears that this isn't the case.  Instead I've found several people  
of

good will who don't seem to know a whole lot more about SA than I do,
but have given me some good points to think about.

Do you have any idea where I might inquire to get advice from people
with more precise knowledge?



This is the best place.  Its not a common setup so I don't doubt that  
anyone really knows the correct answer.


One data point I'll add is that spamc has a compress mode that might  
be useful (spamc -z).  Also, it would take a little work on your end  
but you can also pass in --headers to further reduce the spamc/spamc  
traffic.  Check out the spamc man page for more info.


One other thing related to MySQL.  I've never personally done it but  
I'm certain there are ways you could use MySQL proxy or perhaps even  
federated tables to manage this sort of thing.  MySQL proxy has lots  
of different functions, I'm sure compression is either one of them or  
at least something that can be easily bolted on.


Michael





--
Lindsay Haisley   | Everything works|Accredited
FMP Computer Services |   if you let it |  by the
512-259-1190  |(The Roadie)  |   Austin Better
http://www.fmp.com|  |  Business Bureau





Re: Two servers, one database. A question

2009-02-14 Thread Karsten Bräckelmann
On Sat, 2009-02-14 at 17:07 -0600, Michael Parker wrote:
 On Feb 14, 2009, at 3:47 PM, Lindsay Haisley wrote:

  Well that's something to consider.  I had hoped when I subscribed to
  this list to ask this question that I'd find people, possibly SA
  developers on it, who had benchmarked the options I presented for
  decision and could give me some definitive answers based on this, but it
  appears that this isn't the case.  Instead I've found several people of
  good will who don't seem to know a whole lot more about SA than I do,
  but have given me some good points to think about.

Being a SA dev doesn't necessarily imply any need to use SQL based
storage. Let alone scanning on an off-site server. :)  I, for one,
don't. So take it with a grain of salt.

  Do you have any idea where I might inquire to get advice from people
  with more precise knowledge?
 
 This is the best place.  Its not a common setup so I don't doubt that  
 anyone really knows the correct answer.
 
 One data point I'll add is that spamc has a compress mode that might  
 be useful (spamc -z).  Also, it would take a little work on your end  
 but you can also pass in --headers to further reduce the spamc/spamc  
 traffic.  Check out the spamc man page for more info.

Ah, good one -- I forgot about the -z option, otherwise I would have
chipped in before. The headers option is something I was thinking about
already. This basically reduces the traffic from 2 times the mail stream
(as mentioned), to one times.

Regarding SQL traffic and Bayes -- tokenizing a message into unique
tokens, then adding the SQL overhead. Would that really be less than the
raw average message? Another thing to keep in mind is latency, iff there
are multiple queries involved. Versus the single round-trip of spamc.

On the other hand, there is manageability. Single spamd is easier, than
keeping two in sync. Probably not too challenging, though. ;)

To throw in another crack idea: What about consolidating the MXs? And
then internally forwarding the already processed messages?


Lindsay, if you end up doing some benchmarking, please let us know. I
wouldn't be surprised if you're actually the first one to do this across
the Internet. :)

  guenther


-- 
char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Two servers, one database. A question

2009-02-13 Thread Andre


On Thu, 12 Feb 2009, Lindsay Haisley wrote:

 I have two servers.  Currently they're both running instances of spamd
 with separate mysql databases, however I'd like run both instances from
 the same database on one of the servers. There are two ways to do this:

 1.  I can give the -d option to spamc where it's invoked in the mail
 system, with the target being spamd on the master spamassassin server
 via the VPN that connects the two boxes.  spamd is already configured to
 listen to it.

I'd prefer the above for the following reason: you only need to worry
about a single spamassassin server (as long as it can hold up to the
load). You prevent inconsistencies when upgrading etc.


 2.  I can let spamc invoke spamd on the local system but set the various
 dsn params in secrets.cf to point to the MySQL database on the master
 spamassassin server.  The mysql server on this box is already listening
 for queries from the other system via the VPN that connects them.

 Does anyone with some experience with spamassassin know which of these
 two approaches would be better?  Which would be fastest?  Which would be
 most conservative of bandwidth between the boxes?

'Fastest' depends on the load on the servers.
Bandwidth will depend on how large your average message is, and what you
store in the database (user prefs, awl, bayes...)

-andre


 --
 Lindsay Haisley   | Everything works|Accredited
 FMP Computer Services |   if you let it |  by the
 512-259-1190  |(The Roadie)  |   Austin Better
 http://www.fmp.com|  |  Business Bureau




Re: Two servers, one database. A question

2009-02-13 Thread Kris Deugau

Lindsay Haisley wrote:

I have two servers.  Currently they're both running instances of spamd
with separate mysql databases, however I'd like run both instances from
the same database on one of the servers. There are two ways to do this:

1.  I can give the -d option to spamc where it's invoked in the mail
system, with the target being spamd on the master spamassassin server
via the VPN that connects the two boxes.  spamd is already configured to
listen to it.


Mm, I don't think this does what you're hoping.  spamd on any given 
system will use the configured database (local or otherwise) - this is 
**NOT** something the client can request.


From man spamc:

   -d host[,host2], --dest=host[,host2]
   In TCP/IP mode, connect to spamd server on given host
   (default: localhost).  Several hosts can be specified
   if separated by commas.

This only affects which spamd server the client asks to process the 
message;  it doesn't affect any aspect of the actual processing.



2.  I can let spamc invoke spamd on the local system but set the various
dsn params in secrets.cf to point to the MySQL database on the master
spamassassin server.  The mysql server on this box is already listening
for queries from the other system via the VPN that connects them.


If all you're looking to do is use a single MySQL instance, then this is 
your only choice.



Does anyone with some experience with spamassassin know which of these
two approaches would be better?  Which would be fastest?  Which would be
most conservative of bandwidth between the boxes?


A lot depends on the hardware you're using.  If you're trying to squeeze 
some last bits of performance out of a heavily-loaded system by 
eliminating the SQL duplication, you'll probably have to tune the spamd 
instances differently as well (eg, the system running MySQL won't be 
able to support as many spamd children as the other one).  You haven't 
said what's in MySQL for SA;  IME anything more than a couple of hundred 
users suck up too much IO for per-user Bayes and/or AWL (not to mention 
the staggering disk requirements - even at today's disk prices).


The cluster I'm doing most of my SA tuning on these days currently has 3 
machines running spamd, and a fourth running MySQL (and some other 
unrelated services, otherwise it would run spamd as well).  Each machine 
has the same SA config pointing to the same database on that fourth 
machine - but clients don't see this, and can't affect it.


If the machines are not on the same local Ethernet segment, you're 
probably better off leaving well enough alone, because any gains you 
make in eliminating the SQL duplication will be lost waiting for data to 
move across the network.  Or worse.


-kgd


Re: Two servers, one database. A question

2009-02-13 Thread Lindsay Haisley
On Fri, 2009-02-13 at 15:24 -0600, Lindsay Haisley wrote:
 Although I appreciate your advice, my question here is not _whether_ I
 should do the integration, but which of the two methods of integrating
 the databases will be most efficient of bandwidth and other resources.

After thinking about it, Kris, I do think you're right about the choice,
although not for the reasons you gave.  spamc must pass an entire copy
of each email over the Internet to spamd on the 2nd box.  If I keep the
SA configurations synchronized between boxes, then the only thing which
needs to be shared across the Internet is Bayes processing, plus several
per-user choices as represented in the userpref table.  This _seems_ on
the face of it more efficient that passing off the entire email traffic,
which would have to transit the Internet connection between the boxes
twice.

-- 
Lindsay Haisley   | Everything works|Accredited
FMP Computer Services |   if you let it |  by the
512-259-1190  |(The Roadie)  |   Austin Better
http://www.fmp.com|  |  Business Bureau



Re: Two servers, one database. A question

2009-02-13 Thread Lindsay Haisley
On Fri, 2009-02-13 at 15:21 -0500, Kris Deugau wrote:
 Lindsay Haisley wrote:
  I have two servers.  Currently they're both running instances of spamd
  with separate mysql databases, however I'd like run both instances from
  the same database on one of the servers. There are two ways to do this:
  
  1.  I can give the -d option to spamc where it's invoked in the mail
  system, with the target being spamd on the master spamassassin server
  via the VPN that connects the two boxes.  spamd is already configured to
  listen to it.
 
 Mm, I don't think this does what you're hoping.  spamd on any given 
 system will use the configured database (local or otherwise) - this is 
 **NOT** something the client can request.
 
  From man spamc:
 
 -d host[,host2], --dest=host[,host2]
 In TCP/IP mode, connect to spamd server on given host
 (default: localhost).  Several hosts can be specified
 if separated by commas.
 
 This only affects which spamd server the client asks to process the 
 message;  it doesn't affect any aspect of the actual processing.

I think you misunderstand me.  If spamc on machine A is invoked with -d
IP address of machine B then spamc will use whatever databases and
configurations are in effect for spamd on machine B.  This is what the
-d option is for.  The actual processing is done by spamd, whichever
instance (machine A or B) is addressed by the spamc client, so I do have
a choice here, and that's what I want to decide on.  spamc is basically
just a passive client which reads and writes emails and passes off the
job of spam processing to spamd, wherever it may be.

If spamc on machine B uses it's local spamd instance (the same one
machine A is using) as a server, then the task I'm trying to do is
accomplished since both machines are ultimately using the same database.

  Does anyone with some experience with spamassassin know which of these
  two approaches would be better?  Which would be fastest?  Which would be
  most conservative of bandwidth between the boxes?
 
 A lot depends on the hardware you're using.  If you're trying to squeeze 
 some last bits of performance out of a heavily-loaded system by 
 eliminating the SQL duplication, you'll probably have to tune the spamd 
 instances differently as well (eg, the system running MySQL won't be 
 able to support as many spamd children as the other one).  You haven't 
 said what's in MySQL for SA;  IME anything more than a couple of hundred 
 users suck up too much IO for per-user Bayes and/or AWL (not to mention 
 the staggering disk requirements - even at today's disk prices).

The current load on what I've defined above as machine B and is quite
manageable, and this is the box that's now handling over 90% of traffic
to probably a couple of hundred mailboxes on the system.  The MySQL
tables used by SA are at well less than a gig on a box that has close to
half a TB of drive space on it, and SA has been running there for over a
year.  The system load avg runs consistently under 1 except when
cron-initiated maintenance happens.

 The cluster I'm doing most of my SA tuning on these days currently has 3 
 machines running spamd, and a fourth running MySQL (and some other 
 unrelated services, otherwise it would run spamd as well).  Each machine 
 has the same SA config pointing to the same database on that fourth 
 machine - but clients don't see this, and can't affect it.
 
 If the machines are not on the same local Ethernet segment, you're 
 probably better off leaving well enough alone, because any gains you 
 make in eliminating the SQL duplication will be lost waiting for data to 
 move across the network.  Or worse.

My intention here is to optimize administration, both for migration and
for those parts of SA for which I've programmed customer UIs.
Considering the number of checks involved in email by the MTA, what with
top level RBL checking (done by the MTA) and hitting SA twice, I don't
think waiting for one more transaction will be problematic.

Although I appreciate your advice, my question here is not _whether_ I
should do the integration, but which of the two methods of integrating
the databases will be most efficient of bandwidth and other resources.

-- 
Lindsay Haisley   | Everything works|Accredited
FMP Computer Services |   if you let it |  by the
512-259-1190  |(The Roadie)  |   Austin Better
http://www.fmp.com|  |  Business Bureau



Re: Two servers, one database. A question

2009-02-13 Thread Kris Deugau

Lindsay Haisley wrote:

I think you misunderstand me.  If spamc on machine A is invoked with -d
IP address of machine B then spamc will use whatever databases and
configurations are in effect for spamd on machine B.  This is what the
-d option is for.  The actual processing is done by spamd, whichever
instance (machine A or B) is addressed by the spamc client, so I do have
a choice here, and that's what I want to decide on.  spamc is basically
just a passive client which reads and writes emails and passes off the
job of spam processing to spamd, wherever it may be.

If spamc on machine B uses it's local spamd instance (the same one
machine A is using) as a server, then the task I'm trying to do is
accomplished since both machines are ultimately using the same database.


read reread  Ah, I think I see what you're asking.

I read that you were asking about whether/how to consolidate two 
separate MySQL instances each serving a local spamd on the same machine, 
to a single MySQL instance serving both machines' spamd.



The current load on what I've defined above as machine B and is quite
manageable, and this is the box that's now handling over 90% of traffic
to probably a couple of hundred mailboxes on the system.  The MySQL
tables used by SA are at well less than a gig on a box that has close to
half a TB of drive space on it, and SA has been running there for over a
year.  The system load avg runs consistently under 1 except when
cron-initiated maintenance happens.


Ah.  hardware status == overkill  g


Although I appreciate your advice, my question here is not _whether_ I
should do the integration, but which of the two methods of integrating
the databases will be most efficient of bandwidth and other resources.


read  reread again  I'm getting confused again.  What components do 
you have running on which systems, and what are you trying to consolidate?


-kgd


Re: Two servers, one database. A question

2009-02-13 Thread Kris Deugau

Lindsay Haisley wrote:

On Fri, 2009-02-13 at 15:24 -0600, Lindsay Haisley wrote:

Although I appreciate your advice, my question here is not _whether_ I
should do the integration, but which of the two methods of integrating
the databases will be most efficient of bandwidth and other resources.


After thinking about it, Kris, I do think you're right about the choice,
although not for the reasons you gave.  spamc must pass an entire copy
of each email over the Internet to spamd on the 2nd box.  If I keep the
SA configurations synchronized between boxes, then the only thing which
needs to be shared across the Internet is Bayes processing, plus several
per-user choices as represented in the userpref table.  This _seems_ on
the face of it more efficient that passing off the entire email traffic,
which would have to transit the Internet connection between the boxes
twice.


*nod*  I don't know what kind of data size the Bayes SQL queries run, 
but it probably averages out somewhere close to a order of magnitude 
less than the full email.


I think I misread your original email, and I'm still not sure I 
understand exactly what your current configuration is, and what you're 
trying to achieve though.


-kgd


Re: Two servers, one database. A question

2009-02-13 Thread John Hardin

On Fri, 13 Feb 2009, Kris Deugau wrote:


 Although I appreciate your advice, my question here is not _whether_ I
 should do the integration, but which of the two methods of integrating
 the databases will be most efficient of bandwidth and other resources.


read  reread again  I'm getting confused again.  What components do you 
have running on which systems, and what are you trying to consolidate?


If I may try:

The question is which is better, sending the message body (spamc - spamd 
traffic) or database queries (spamd - mysql traffic) over the expensive 
link?


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Insofar as the police deter by their presence, they are very, very
  good. Criminals take great pains not to commit a crime in front of
  them. -- Jeffrey Snyder
---
 9 days until George Washington's 277th Birthday


Re: Two servers, one database. A question

2009-02-13 Thread Lindsay Haisley
On Fri, 2009-02-13 at 14:27 -0800, John Hardin wrote:
 If I may try:
 
 The question is which is better, sending the message body (spamc - spamd 
 traffic) or database queries (spamd - mysql traffic) over the expensive 
 link?

Implicit point well make :-)  I think I agree with you.

-- 
Lindsay Haisley   | Everything works|Accredited
FMP Computer Services |   if you let it |  by the
512-259-1190  |(The Roadie)  |   Austin Better
http://www.fmp.com|  |  Business Bureau



Re: Two servers, one database. A question

2009-02-13 Thread Lindsay Haisley
On Fri, 2009-02-13 at 17:26 -0500, Kris Deugau wrote:
 *nod*  I don't know what kind of data size the Bayes SQL queries run, 
 but it probably averages out somewhere close to a order of magnitude 
 less than the full email.
 
 I think I misread your original email, and I'm still not sure I 
 understand exactly what your current configuration is, and what you're 
 trying to achieve though.

Currently I have two servers, A and B.  B is the older of the two and
currently hosts _most_ of the mail accounts.  They are functionally
identical boxes.

Currently _both_ are running spamd and _both_ have AWL/Bayes/userpref
database tables on MySQL which are accessed locally and identically by
the spamd instance on each box.

My objective is only to unify the database tables supporting Bayes and
user preferences so that there's only one set of MySQL tables for the
users on both boxes.  Whether this involves the use of two spamd daemons
or one is the question.

Scenario 1:  spamc on box A communicates _over the network_ with spamd
on box B, which uses its _local_ config and Bayes/usrpref database to do
its work.

Scenario 2:  spamc on box A communicates with a _local_ spamd, which
accesses local config files but uses a MySQL connection _over the
network_ to box A to access the Bayes/userpref database.

Sorry if I wasn't entirely clear before.  I hope this clarifies the
choice, which looks at this point as if I'd be better off with #2.

-- 
Lindsay Haisley   | Everything works|Accredited
FMP Computer Services |   if you let it |  by the
512-259-1190  |(The Roadie)  |   Austin Better
http://www.fmp.com|  |  Business Bureau



Re: Two servers, one database. A question - a correction.

2009-02-13 Thread Lindsay Haisley
On Fri, 2009-02-13 at 16:51 -0600, Lindsay Haisley wrote:
 Scenario 2:  spamc on box A communicates with a _local_ spamd, which
 accesses local config files but uses a MySQL connection _over the
 network_ to box A to access the Bayes/userpref database.

Sorry, this should read:

Scenario 2:  spamc on box A communicates with a _local_ spamd, which
accesses local config files but uses a MySQL connection _over the
network_ to box B to access the Bayes/userpref database.
-

My bad.

-- 
Lindsay Haisley   | Everything works|Accredited
FMP Computer Services |   if you let it |  by the
512-259-1190  |(The Roadie)  |   Austin Better
http://www.fmp.com|  |  Business Bureau



Re: Two servers, one database. A question

2009-02-13 Thread Kris Deugau

John Hardin wrote:

If I may try:

The question is which is better, sending the message body (spamc - 
spamd traffic) or database queries (spamd - mysql traffic) over the 
expensive link?


Yeah, after going back and forth I think I've finally got that.  g

I would bet on Bayes/userpref queries being more efficient than the 
spamc/spamd traffic.


-kgd


Re: Two servers, one database. A question

2009-02-13 Thread Lindsay Haisley
On Fri, 2009-02-13 at 18:11 -0500, Kris Deugau wrote:
 I would bet on Bayes/userpref queries being more efficient than the 
 spamc/spamd traffic.

I think we have a consensus here :-)  I didn't get any definitive
answers here but the folks who responded made me think about the problem
a little more intelligently.

Thanks!

-- 
Lindsay Haisley   | Everything works|Accredited
FMP Computer Services |   if you let it |  by the
512-259-1190  |(The Roadie)  |   Austin Better
http://www.fmp.com|  |  Business Bureau



Two servers, one database. A question

2009-02-12 Thread Lindsay Haisley
I have two servers.  Currently they're both running instances of spamd
with separate mysql databases, however I'd like run both instances from
the same database on one of the servers. There are two ways to do this:

1.  I can give the -d option to spamc where it's invoked in the mail
system, with the target being spamd on the master spamassassin server
via the VPN that connects the two boxes.  spamd is already configured to
listen to it.

2.  I can let spamc invoke spamd on the local system but set the various
dsn params in secrets.cf to point to the MySQL database on the master
spamassassin server.  The mysql server on this box is already listening
for queries from the other system via the VPN that connects them.

Does anyone with some experience with spamassassin know which of these
two approaches would be better?  Which would be fastest?  Which would be
most conservative of bandwidth between the boxes?

-- 
Lindsay Haisley   | Everything works|Accredited
FMP Computer Services |   if you let it |  by the
512-259-1190  |(The Roadie)  |   Austin Better
http://www.fmp.com|  |  Business Bureau