Re: Two servers, one database. A question
On Sun, 2009-02-15 at 02:05 +0100, Karsten Bräckelmann wrote: > Lindsay, if you end up doing some benchmarking, please let us know. I > wouldn't be surprised if you're actually the first one to do this across > the Internet. :) > Just a thought. Since getting message sizes and counts on traffic between a client and server isn't the easiest thing to do unless they're already instrumented to collect this information, the best approach may be two pronged: 1) write a Perl or awk script that processes /var/log/maillog.* and gathers message size statistics. The regex 'spamd.*bytes.$' will pick the relevant log lines and the message size is the second to last field. It would counting messages in size bands, e.g. 0-10KB, 10-100KB, 100-1MB, 1MB-250MB, >250MB to get some size and frequency statistics. 2) Pick a message from each band and run it through spamc manually while using Wireshark to capture both spamc-spamd traffic and spamd-MySQL traffic. Combining the message sizes and counts from the two streams should give you enough information to correctly size the traffic flows. Question to developers on this list: Why is a message that exceeds the maximunm size skipped entirely? Is there a case for passing its headers through spamd and then combining the returned headers with the body in spamc? It would give a bit more protection and doesn't look too difficult to do since spamd is already capable of handling just the headers. Martin
Re: Two servers, one database. A question
On Sat, 2009-02-14 at 17:07 -0600, Michael Parker wrote: > On Feb 14, 2009, at 3:47 PM, Lindsay Haisley wrote: > > Well that's something to consider. I had hoped when I subscribed to > > this list to ask this question that I'd find people, possibly SA > > developers on it, who had benchmarked the options I presented for > > decision and could give me some definitive answers based on this, but it > > appears that this isn't the case. Instead I've found several people of > > good will who don't seem to know a whole lot more about SA than I do, > > but have given me some good points to think about. Being a SA dev doesn't necessarily imply any need to use SQL based storage. Let alone scanning on an off-site server. :) I, for one, don't. So take it with a grain of salt. > > Do you have any idea where I might inquire to get advice from people > > with more precise knowledge? > > This is the best place. Its not a common setup so I don't doubt that > anyone really knows the correct answer. > > One data point I'll add is that spamc has a compress mode that might > be useful (spamc -z). Also, it would take a little work on your end > but you can also pass in --headers to further reduce the spamc/spamc > traffic. Check out the spamc man page for more info. Ah, good one -- I forgot about the -z option, otherwise I would have chipped in before. The headers option is something I was thinking about already. This basically reduces the traffic from 2 times the mail stream (as mentioned), to one times. Regarding SQL traffic and Bayes -- tokenizing a message into unique tokens, then adding the SQL overhead. Would that really be less than the raw average message? Another thing to keep in mind is latency, iff there are multiple queries involved. Versus the single round-trip of spamc. On the other hand, there is manageability. Single spamd is easier, than keeping two in sync. Probably not too challenging, though. ;) To throw in another crack idea: What about consolidating the MXs? And then internally forwarding the already processed messages? Lindsay, if you end up doing some benchmarking, please let us know. I wouldn't be surprised if you're actually the first one to do this across the Internet. :) guenther -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Two servers, one database. A question
On Feb 14, 2009, at 3:47 PM, Lindsay Haisley wrote: On Sat, 2009-02-14 at 15:04 -0600, Bob Proulx wrote: I would bet on Bayes/userpref queries being more efficient than the spamc/spamd traffic. I like that you are asking the question. But I hate to guess at which is better though. The weakest benchmark data point is better than the strongest guess. Too often I have taken my best guess and been wrong. In this case I would guess the opposite would be more efficient, that the one spamc-spamd connection per message would be more efficient than the many mysql queries per message, which is why I bring this up. Well that's something to consider. I had hoped when I subscribed to this list to ask this question that I'd find people, possibly SA developers on it, who had benchmarked the options I presented for decision and could give me some definitive answers based on this, but it appears that this isn't the case. Instead I've found several people of good will who don't seem to know a whole lot more about SA than I do, but have given me some good points to think about. Do you have any idea where I might inquire to get advice from people with more precise knowledge? This is the best place. Its not a common setup so I don't doubt that anyone really knows the correct answer. One data point I'll add is that spamc has a compress mode that might be useful (spamc -z). Also, it would take a little work on your end but you can also pass in --headers to further reduce the spamc/spamc traffic. Check out the spamc man page for more info. One other thing related to MySQL. I've never personally done it but I'm certain there are ways you could use MySQL proxy or perhaps even federated tables to manage this sort of thing. MySQL proxy has lots of different functions, I'm sure compression is either one of them or at least something that can be easily bolted on. Michael -- Lindsay Haisley | "Everything works|Accredited FMP Computer Services | if you let it" | by the 512-259-1190 |(The Roadie) | Austin Better http://www.fmp.com| | Business Bureau
Re: Two servers, one database. A question
On Sat, 2009-02-14 at 15:04 -0600, Bob Proulx wrote: > > I would bet on Bayes/userpref queries being more efficient than > the > > spamc/spamd traffic. > > I like that you are asking the question. But I hate to guess at which > is better though. The weakest benchmark data point is better than the > strongest guess. Too often I have taken my best guess and been wrong. > In this case I would guess the opposite would be more efficient, that > the one spamc-spamd connection per message would be more efficient > than the many mysql queries per message, which is why I bring this up. Well that's something to consider. I had hoped when I subscribed to this list to ask this question that I'd find people, possibly SA developers on it, who had benchmarked the options I presented for decision and could give me some definitive answers based on this, but it appears that this isn't the case. Instead I've found several people of good will who don't seem to know a whole lot more about SA than I do, but have given me some good points to think about. Do you have any idea where I might inquire to get advice from people with more precise knowledge? -- Lindsay Haisley | "Everything works|Accredited FMP Computer Services | if you let it" | by the 512-259-1190 |(The Roadie) | Austin Better http://www.fmp.com| | Business Bureau
Re: Two servers, one database. A question
Kris Deugau wrote: > John Hardin wrote: >> The question is which is better, sending the message body (spamc <-> >> spamd traffic) or database queries (spamd <-> mysql traffic) over the >> expensive link? > > I would bet on Bayes/userpref queries being more efficient than the > spamc/spamd traffic. I like that you are asking the question. But I hate to guess at which is better though. The weakest benchmark data point is better than the strongest guess. Too often I have taken my best guess and been wrong. In this case I would guess the opposite would be more efficient, that the one spamc-spamd connection per message would be more efficient than the many mysql queries per message, which is why I bring this up. Bob
Re: Two servers, one database. A question
On Fri, 2009-02-13 at 18:11 -0500, Kris Deugau wrote: > I would bet on Bayes/userpref queries being more efficient than the > spamc/spamd traffic. I think we have a consensus here :-) I didn't get any definitive answers here but the folks who responded made me think about the problem a little more intelligently. Thanks! -- Lindsay Haisley | "Everything works|Accredited FMP Computer Services | if you let it" | by the 512-259-1190 |(The Roadie) | Austin Better http://www.fmp.com| | Business Bureau
Re: Two servers, one database. A question
John Hardin wrote: If I may try: The question is which is better, sending the message body (spamc <-> spamd traffic) or database queries (spamd <-> mysql traffic) over the expensive link? Yeah, after going back and forth I think I've finally got that. I would bet on Bayes/userpref queries being more efficient than the spamc/spamd traffic. -kgd
Re: Two servers, one database. A question - a correction.
On Fri, 2009-02-13 at 16:51 -0600, Lindsay Haisley wrote: > Scenario 2: spamc on box A communicates with a _local_ spamd, which > accesses local config files but uses a MySQL connection _over the > network_ to box A to access the Bayes/userpref database. Sorry, this should read: Scenario 2: spamc on box A communicates with a _local_ spamd, which accesses local config files but uses a MySQL connection _over the network_ to box >>B<< to access the Bayes/userpref database. - My bad. -- Lindsay Haisley | "Everything works|Accredited FMP Computer Services | if you let it" | by the 512-259-1190 |(The Roadie) | Austin Better http://www.fmp.com| | Business Bureau
Re: Two servers, one database. A question
On Fri, 2009-02-13 at 17:26 -0500, Kris Deugau wrote: > *nod* I don't know what kind of data size the Bayes SQL queries run, > but it probably averages out somewhere close to a order of magnitude > less than the full email. > > I think I misread your original email, and I'm still not sure I > understand exactly what your current configuration is, and what you're > trying to achieve though. Currently I have two servers, A and B. B is the older of the two and currently hosts _most_ of the mail accounts. They are functionally identical boxes. Currently _both_ are running spamd and _both_ have AWL/Bayes/userpref database tables on MySQL which are accessed locally and identically by the spamd instance on each box. My objective is only to unify the database tables supporting Bayes and user preferences so that there's only one set of MySQL tables for the users on both boxes. Whether this involves the use of two spamd daemons or one is the question. Scenario 1: spamc on box A communicates _over the network_ with spamd on box B, which uses its _local_ config and Bayes/usrpref database to do its work. Scenario 2: spamc on box A communicates with a _local_ spamd, which accesses local config files but uses a MySQL connection _over the network_ to box A to access the Bayes/userpref database. Sorry if I wasn't entirely clear before. I hope this clarifies the choice, which looks at this point as if I'd be better off with #2. -- Lindsay Haisley | "Everything works|Accredited FMP Computer Services | if you let it" | by the 512-259-1190 |(The Roadie) | Austin Better http://www.fmp.com| | Business Bureau
Re: Two servers, one database. A question
On Fri, 2009-02-13 at 14:27 -0800, John Hardin wrote: > If I may try: > > The question is which is better, sending the message body (spamc <-> spamd > traffic) or database queries (spamd <-> mysql traffic) over the expensive > link? Implicit point well make :-) I think I agree with you. -- Lindsay Haisley | "Everything works|Accredited FMP Computer Services | if you let it" | by the 512-259-1190 |(The Roadie) | Austin Better http://www.fmp.com| | Business Bureau
Re: Two servers, one database. A question
On Fri, 13 Feb 2009, Kris Deugau wrote: Although I appreciate your advice, my question here is not _whether_ I should do the integration, but which of the two methods of integrating the databases will be most efficient of bandwidth and other resources. I'm getting confused again. What components do you have running on which systems, and what are you trying to consolidate? If I may try: The question is which is better, sending the message body (spamc <-> spamd traffic) or database queries (spamd <-> mysql traffic) over the expensive link? -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Insofar as the police deter by their presence, they are very, very good. Criminals take great pains not to commit a crime in front of them. -- Jeffrey Snyder --- 9 days until George Washington's 277th Birthday
Re: Two servers, one database. A question
Lindsay Haisley wrote: On Fri, 2009-02-13 at 15:24 -0600, Lindsay Haisley wrote: Although I appreciate your advice, my question here is not _whether_ I should do the integration, but which of the two methods of integrating the databases will be most efficient of bandwidth and other resources. After thinking about it, Kris, I do think you're right about the choice, although not for the reasons you gave. spamc must pass an entire copy of each email over the Internet to spamd on the 2nd box. If I keep the SA configurations synchronized between boxes, then the only thing which needs to be shared across the Internet is Bayes processing, plus several per-user choices as represented in the userpref table. This _seems_ on the face of it more efficient that passing off the entire email traffic, which would have to transit the Internet connection between the boxes twice. *nod* I don't know what kind of data size the Bayes SQL queries run, but it probably averages out somewhere close to a order of magnitude less than the full email. I think I misread your original email, and I'm still not sure I understand exactly what your current configuration is, and what you're trying to achieve though. -kgd
Re: Two servers, one database. A question
Lindsay Haisley wrote: I think you misunderstand me. If spamc on machine A is invoked with -d then spamc will use whatever databases and configurations are in effect for spamd on machine B. This is what the -d option is for. The "actual processing" is done by spamd, whichever instance (machine A or B) is addressed by the spamc client, so I do have a choice here, and that's what I want to decide on. spamc is basically just a passive client which reads and writes emails and passes off the job of spam processing to spamd, wherever it may be. If spamc on machine B uses it's local spamd instance (the same one machine A is using) as a server, then the task I'm trying to do is accomplished since both machines are ultimately using the same database. Ah, I think I see what you're asking. I read that you were asking about whether/how to consolidate two separate MySQL instances each serving a local spamd on the same machine, to a single MySQL instance serving both machines' spamd. The current load on what I've defined above as "machine B" and is quite manageable, and this is the box that's now handling over 90% of traffic to probably a couple of hundred mailboxes on the system. The MySQL tables used by SA are at well less than a gig on a box that has close to half a TB of drive space on it, and SA has been running there for over a year. The system load avg runs consistently under 1 except when cron-initiated maintenance happens. Ah. "hardware status == overkill" Although I appreciate your advice, my question here is not _whether_ I should do the integration, but which of the two methods of integrating the databases will be most efficient of bandwidth and other resources. I'm getting confused again. What components do you have running on which systems, and what are you trying to consolidate? -kgd
Re: Two servers, one database. A question
On Fri, 2009-02-13 at 15:24 -0600, Lindsay Haisley wrote: > Although I appreciate your advice, my question here is not _whether_ I > should do the integration, but which of the two methods of integrating > the databases will be most efficient of bandwidth and other resources. After thinking about it, Kris, I do think you're right about the choice, although not for the reasons you gave. spamc must pass an entire copy of each email over the Internet to spamd on the 2nd box. If I keep the SA configurations synchronized between boxes, then the only thing which needs to be shared across the Internet is Bayes processing, plus several per-user choices as represented in the userpref table. This _seems_ on the face of it more efficient that passing off the entire email traffic, which would have to transit the Internet connection between the boxes twice. -- Lindsay Haisley | "Everything works|Accredited FMP Computer Services | if you let it" | by the 512-259-1190 |(The Roadie) | Austin Better http://www.fmp.com| | Business Bureau
Re: Two servers, one database. A question
On Fri, 2009-02-13 at 15:21 -0500, Kris Deugau wrote: > Lindsay Haisley wrote: > > I have two servers. Currently they're both running instances of spamd > > with separate mysql databases, however I'd like run both instances from > > the same database on one of the servers. There are two ways to do this: > > > > 1. I can give the -d option to spamc where it's invoked in the mail > > system, with the target being spamd on the master spamassassin server > > via the VPN that connects the two boxes. spamd is already configured to > > listen to it. > > Mm, I don't think this does what you're hoping. spamd on any given > system will use the configured database (local or otherwise) - this is > **NOT** something the client can request. > > From man spamc: > > -d host[,host2], --dest=host[,host2] > In TCP/IP mode, connect to spamd server on given host > (default: localhost). Several hosts can be specified > if separated by commas. > > This only affects which spamd server the client asks to process the > message; it doesn't affect any aspect of the actual processing. I think you misunderstand me. If spamc on machine A is invoked with -d then spamc will use whatever databases and configurations are in effect for spamd on machine B. This is what the -d option is for. The "actual processing" is done by spamd, whichever instance (machine A or B) is addressed by the spamc client, so I do have a choice here, and that's what I want to decide on. spamc is basically just a passive client which reads and writes emails and passes off the job of spam processing to spamd, wherever it may be. If spamc on machine B uses it's local spamd instance (the same one machine A is using) as a server, then the task I'm trying to do is accomplished since both machines are ultimately using the same database. > > Does anyone with some experience with spamassassin know which of these > > two approaches would be better? Which would be fastest? Which would be > > most conservative of bandwidth between the boxes? > > A lot depends on the hardware you're using. If you're trying to squeeze > some last bits of performance out of a heavily-loaded system by > eliminating the SQL duplication, you'll probably have to tune the spamd > instances differently as well (eg, the system running MySQL won't be > able to support as many spamd children as the other one). You haven't > said what's in MySQL for SA; IME anything more than a couple of hundred > users suck up too much IO for per-user Bayes and/or AWL (not to mention > the staggering disk requirements - even at today's disk prices). The current load on what I've defined above as "machine B" and is quite manageable, and this is the box that's now handling over 90% of traffic to probably a couple of hundred mailboxes on the system. The MySQL tables used by SA are at well less than a gig on a box that has close to half a TB of drive space on it, and SA has been running there for over a year. The system load avg runs consistently under 1 except when cron-initiated maintenance happens. > The cluster I'm doing most of my SA tuning on these days currently has 3 > machines running spamd, and a fourth running MySQL (and some other > unrelated services, otherwise it would run spamd as well). Each machine > has the same SA config pointing to the same database on that fourth > machine - but clients don't see this, and can't affect it. > > If the machines are not on the same local Ethernet segment, you're > probably better off leaving well enough alone, because any gains you > make in eliminating the SQL duplication will be lost waiting for data to > move across the network. Or worse. My intention here is to optimize administration, both for migration and for those parts of SA for which I've programmed customer UIs. Considering the number of checks involved in email by the MTA, what with top level RBL checking (done by the MTA) and hitting SA twice, I don't think waiting for one more transaction will be problematic. Although I appreciate your advice, my question here is not _whether_ I should do the integration, but which of the two methods of integrating the databases will be most efficient of bandwidth and other resources. -- Lindsay Haisley | "Everything works|Accredited FMP Computer Services | if you let it" | by the 512-259-1190 |(The Roadie) | Austin Better http://www.fmp.com| | Business Bureau
Re: Two servers, one database. A question
Lindsay Haisley wrote: I have two servers. Currently they're both running instances of spamd with separate mysql databases, however I'd like run both instances from the same database on one of the servers. There are two ways to do this: 1. I can give the -d option to spamc where it's invoked in the mail system, with the target being spamd on the master spamassassin server via the VPN that connects the two boxes. spamd is already configured to listen to it. Mm, I don't think this does what you're hoping. spamd on any given system will use the configured database (local or otherwise) - this is **NOT** something the client can request. From man spamc: -d host[,host2], --dest=host[,host2] In TCP/IP mode, connect to spamd server on given host (default: localhost). Several hosts can be specified if separated by commas. This only affects which spamd server the client asks to process the message; it doesn't affect any aspect of the actual processing. 2. I can let spamc invoke spamd on the local system but set the various dsn params in secrets.cf to point to the MySQL database on the master spamassassin server. The mysql server on this box is already listening for queries from the other system via the VPN that connects them. If all you're looking to do is use a single MySQL instance, then this is your only choice. Does anyone with some experience with spamassassin know which of these two approaches would be better? Which would be fastest? Which would be most conservative of bandwidth between the boxes? A lot depends on the hardware you're using. If you're trying to squeeze some last bits of performance out of a heavily-loaded system by eliminating the SQL duplication, you'll probably have to tune the spamd instances differently as well (eg, the system running MySQL won't be able to support as many spamd children as the other one). You haven't said what's in MySQL for SA; IME anything more than a couple of hundred users suck up too much IO for per-user Bayes and/or AWL (not to mention the staggering disk requirements - even at today's disk prices). The cluster I'm doing most of my SA tuning on these days currently has 3 machines running spamd, and a fourth running MySQL (and some other unrelated services, otherwise it would run spamd as well). Each machine has the same SA config pointing to the same database on that fourth machine - but clients don't see this, and can't affect it. If the machines are not on the same local Ethernet segment, you're probably better off leaving well enough alone, because any gains you make in eliminating the SQL duplication will be lost waiting for data to move across the network. Or worse. -kgd
Re: Two servers, one database. A question
On Thu, 12 Feb 2009, Lindsay Haisley wrote: > I have two servers. Currently they're both running instances of spamd > with separate mysql databases, however I'd like run both instances from > the same database on one of the servers. There are two ways to do this: > > 1. I can give the -d option to spamc where it's invoked in the mail > system, with the target being spamd on the master spamassassin server > via the VPN that connects the two boxes. spamd is already configured to > listen to it. I'd prefer the above for the following reason: you only need to worry about a single spamassassin server (as long as it can hold up to the load). You prevent inconsistencies when upgrading etc. > > 2. I can let spamc invoke spamd on the local system but set the various > dsn params in secrets.cf to point to the MySQL database on the master > spamassassin server. The mysql server on this box is already listening > for queries from the other system via the VPN that connects them. > > Does anyone with some experience with spamassassin know which of these > two approaches would be better? Which would be fastest? Which would be > most conservative of bandwidth between the boxes? 'Fastest' depends on the load on the servers. Bandwidth will depend on how large your average message is, and what you store in the database (user prefs, awl, bayes...) -andre > > -- > Lindsay Haisley | "Everything works|Accredited > FMP Computer Services | if you let it" | by the > 512-259-1190 |(The Roadie) | Austin Better > http://www.fmp.com| | Business Bureau >