Re: Bayes and MySQL - does it actually work?
On Dec 23, 2011, at 8:15 AM, Robert Schetterer wrote: Am 23.12.2011 02:45, schrieb Marc Perkel: This is handling ~250K messages/day, although with some tweaks to serialize mail delivery a little more to level off the extreme peaks in messages/second it should probably be able to handle a lot more volume. We also have several SA instances - on the inbound side, the first pass has ~25 of the top-scoring only-hits-spam rules (mostly DNSBLs) to skim off the junk that would usually score 15+ on a full ruleset. Anything that gets past that is then passed to a full SA instance with a long list of local rules targeted at the ones reported as missed spam by customers. That first pass tags more than 80% of the junk for far less processing cost than feeding it all through the full ruleset. We are processing 300k+ mails (peaks up to 1M/day) with 3 mail servers + 1 dedicated MySQL server replicated to one old server and so far, we haven't seen any performance degradations by using Bayes in MySQL InnoDB engine. Mail servers are dual socket Xeon servers with 8G RAM, while MySQL server is dual-socket Xeon with 48G RAM, but SA Bayes is not the most used database on that server. We are using amavisd-new instead of spamd. However, we've seen some degradations when we moved to new MySQL server, but some tweaking did help: - correctly sizing InnoDB engine - optimizing MySQL buffer sizes - disable RAID battery autolearn period - optimizing I/O scheduler - optimizing network kernel stuff - optimize kernel swappiness level - using Mail::SpamAssassin::BayesStore::MySQL instead of Mail::SpamAssassin::BayesStore::SQL - manually pruning auto-whitelisting data and bayes data Currently our MySQL bayes data has over 2M tokens in place and we don't see any performance impact on SpamAssassin. Our backup setup runs on replicated database, so there is no performance impact on our primary MySQL server. I don't have any numbers to compare MySQL and PostgreSQL, but I believe that newer versions of MySQL and its derivates (Percona Server etc.) did improve quite a lot, compared to older ones. regards, Jernej
Re: Bayes and MySQL - does it actually work?
On Wed, Dec 21, 2011 at 01:10:27PM -0500, Kris Deugau wrote: Marc Perkel wrote: I've been trying for a long time to get bayes/mysql to actually work. Running a dedicated server with MySQL. Several servers running SA configured to talk to it. I'm running big servers with lots of ram and raid 0 flash drives for speed. Also using InnoDB. I'm beginning to wonder if it is ever going to work and if someone is going to fix it? I'm not sure what official testing has been done, but some testing I did about a year ago when upgrading the SA cluster here showed pretty much the same IO load for a global Bayes no matter what combination of MyISAM, InnoDB, generic SQL, or MySQL-specific SA modules I used. Enabling MySQL replication also bogged things down pretty badly. Performance with the database on physical disks simply wasn't keeping up with more than about double the average message rate (if that...), so I fell back to the good enough setup of putting the SA database on a RAMdisk, and tweaking the MySQL init script to reload the database on startup. A database dump is done once a day, about a half-hour after a Bayes expiry run. This is handling ~250K messages/day, although with some tweaks to serialize mail delivery a little more to level off the extreme peaks in messages/second it should probably be able to handle a lot more volume. I guess it still boils down to basics. No matter what the database server is used for, same principles apply. If you have slooow disks, then things are going to be slow. Ideally you should compile newest MySQL by hand. Older versions don't use the new faster InnoDB Plugin codebase. Disk / fsync() is almost always the bottleneck. If you don't have critical stuff in the same database, look at all the relevant options (innodb_flush_log_at_trx_commit=0, sync_binlog=0 etc). You could even run separate instance for SA only with all the fastest options. Probably some similar options for replication exist (speed vs reliability), no experience with that. Also you can tune the default schema. Drop atime index, it's pointless when using manual expiry. If you have simple global bayes, change id column to tinyint, it will cut your database size in half. I've also changed spam_count and ham_count to smallint, since I don't have that much traffic. Since these issues pop up here every now and then, I guess SA needs own tutorial/howto for MySQL tuning..
Re: Bayes and MySQL - does it actually work?
On Fri, 23 Dec 2011 13:29:00 +0200, Henrik K wrote: Since these issues pop up here every now and then, I guess SA needs own tutorial/howto for MySQL tuning.. google mysqltuner was a help for me even i have not much trafic here http://www.google.dk/search?aq=fsourceid=chromeie=UTF-8q=mysqltuner can sa use mysqlcluster btw ? spread innodb to more mysqlcluster db, where the cluster it self sync diggest marry xmax btw
Re: dccproc/dccifd error
On 12/22/11 9:44 PM, dar...@chaosreigns.com wrote: On 12/22, dar...@chaosreigns.com wrote: The author did say I believe it is entirely upward compatible. in November, which was well after the DCC 1.3.140 release, so it probably works. I'd be interested to hear how that works if you try it. Might be worth posting the results to that bug. found the issue, twofold. #1, the upstream email provider is adding X-DCC-Metrics headers (but they are disconnected from global DCC network) #2, bug.. yep, bug. Vernon (author of DCC) will investigate and fix it, and update the SA BUGzilla soon. (so, yes, this would be a bug in 3.4 if released, but only shows up under one certain condition) -- Michael Scheidell, CTO o: 561-999-5000 d: 561-948-2259 *| *SECNAP Network Security Corporation * Best Mobile Solutions Product of 2011 * Best Intrusion Prevention Product * Hot Company Finalist 2011 * Best Email Security Product * Certified SNORT Integrator __ This email has been scanned and certified safe by SpammerTrap(r). For Information please see http://www.spammertrap.com/ __
Re: Bayes and MySQL - does it actually work?
On 23/12/11 11:29, Henrik K wrote: Performance with the database on physical disks simply wasn't keeping up with more than about double the average message rate (if that...), so I fell back to the good enough setup of putting the SA database on a RAMdisk, I guess it still boils down to basics. No matter what the database server is used for, same principles apply. If you have slooow disks, then things are going to be slow. As I understand it, if the MySQL query cache is tuned appropriately, then most of the queries should not be touching disk anyway? -- Mike Cardwell https://grepular.com/ https://twitter.com/mickeyc Professional http://cardwellit.com/ http://linkedin.com/in/mikecardwell PGP.mit.edu 0018461F/35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F signature.asc Description: OpenPGP digital signature
Re: Bayes and MySQL - does it actually work?
On Fri, Dec 23, 2011 at 02:10:16PM +, spamassas...@lists.grepular.com wrote: On 23/12/11 11:29, Henrik K wrote: Performance with the database on physical disks simply wasn't keeping up with more than about double the average message rate (if that...), so I fell back to the good enough setup of putting the SA database on a RAMdisk, I guess it still boils down to basics. No matter what the database server is used for, same principles apply. If you have slooow disks, then things are going to be slow. As I understand it, if the MySQL query cache is tuned appropriately, then most of the queries should not be touching disk anyway? Enabling query cache will probably (marginally) slow things down. Bayes queries are extremely random, so there's nothing to cache. Any write to the table will invalidate caches anyway. And those writes happen every time a token is read (atime is updated).
Re: Bayes and MySQL - does it actually work?
I don't believe any kind of SQL database is the best choice for Bayes (which involves simple keyed lookups). We use Dan Bernsteins cdb file format with great success. Each user has his or her own CDB file as well as a sitewide file containing 5.7 million tokens. The CDB software uses mmap() to map the CDB file into memory. As long as your server has lots of memory, the OS's memory management system keeps heavily-used CDB files in memory... no arcane tuning required. [Actually, this is the key for any kind of fast Bayes lookup: Build a server with huge gobs of memory. :)] I realize SpamAssassin does not use CDB files for Bayes. But if the developers are looking for a new back-end, I highly recommend CDB for its excellent performance. The only downside to CDB is that incremental updates are not possible. To train, you need to rebuild the entire CDB file. For us, that's an acceptable tradeoff, but YMMV. Regards, David.
Re: Bayes and MySQL - does it actually work?
On 23/12/11 14:20, Henrik K wrote: As I understand it, if the MySQL query cache is tuned appropriately, then most of the queries should not be touching disk anyway? Enabling query cache will probably (marginally) slow things down. Bayes queries are extremely random, so there's nothing to cache. Any write to the table will invalidate caches anyway. And those writes happen every time a token is read (atime is updated). To stop the query cache being invalidated, it would probably be better if the writes were queued and then done in batches. Can SpamAssassin handle this sort of queue internally, or would some sort of additional technology be required? I don't know what the point of the atime data is, but is there any need to update the atime on every read? Could that write be skipped if the atime is already within a certain period of time? Ie, if the atime has already been updated in the last 5 minutes, is there any point in doing it again? -- Mike Cardwell https://grepular.com/ https://twitter.com/mickeyc Professional http://cardwellit.com/ http://linkedin.com/in/mikecardwell PGP.mit.edu 0018461F/35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F signature.asc Description: OpenPGP digital signature
Re: Bayes and MySQL - does it actually work?
On 23/12/11 14:25, David F. Skoll wrote: The only downside to CDB is that incremental updates are not possible. To train, you need to rebuild the entire CDB file. For us, that's an acceptable tradeoff, but YMMV. Another major downside to this approach compared to using MySQL, is that it doesn't allow you to access the same bayes db from multiple machines at the same time. Unless I'm mistaken..? -- Mike Cardwell https://grepular.com/ https://twitter.com/mickeyc Professional http://cardwellit.com/ http://linkedin.com/in/mikecardwell PGP.mit.edu 0018461F/35BC AF1D 3AA2 1F84 3DC3 B0CF 70A5 F512 0018 461F signature.asc Description: OpenPGP digital signature
Re: Bayes and MySQL - does it actually work?
On Fri, 23 Dec 2011 15:05:42 + spamassas...@lists.grepular.com wrote: Another major downside to this approach compared to using MySQL, is that it doesn't allow you to access the same bayes db from multiple machines at the same time. Unless I'm mistaken..? You're correct. We rsync the CDB files around to our scanners. In this way, your available disk banwidth scales up with the number of scanners. For setups with a large number of scanners where the rsyncs get annoying, we have an experimental Bayes server that takes a token list and returns the probability. It works pretty well. Regards, David.
Re: Bayes and MySQL - does it actually work?
On Fri, Dec 23, 2011 at 03:03:09PM +, spamassas...@lists.grepular.com wrote: On 23/12/11 14:20, Henrik K wrote: As I understand it, if the MySQL query cache is tuned appropriately, then most of the queries should not be touching disk anyway? Enabling query cache will probably (marginally) slow things down. Bayes queries are extremely random, so there's nothing to cache. Any write to the table will invalidate caches anyway. And those writes happen every time a token is read (atime is updated). To stop the query cache being invalidated, it would probably be better if the writes were queued and then done in batches. Can SpamAssassin handle this sort of queue internally, or would some sort of additional technology be required? You need to consider that tokens are done in batches of 50 or so (token in ('token1','token2','token3'...)). Since MySQL caches/hashes the query _exactly_ as written, it's unlikely you'll ever get two same SQL clauses. I don't know what the point of the atime data is, but is there any need to update the atime on every read? Could that write be skipped if the atime is already within a certain period of time? Ie, if the atime has already been updated in the last 5 minutes, is there any point in doing it again? That's a question worth entering into bugzilla. I doubt it even makes difference it the time frame would be 1 day. After all the only point for atime is to expire very old unused tokens. Would be fun to benchmark if I had time.
Re: Bayes and MySQL - does it actually work?
On Fri, 23 Dec 2011, spamassas...@lists.grepular.com wrote: On 23/12/11 14:25, David F. Skoll wrote: The only downside to CDB is that incremental updates are not possible. To train, you need to rebuild the entire CDB file. For us, that's an acceptable tradeoff, but YMMV. Another major downside to this approach compared to using MySQL, is that it doesn't allow you to access the same bayes db from multiple machines at the same time. Unless I'm mistaken..? Each machine would have its own copy of the latest database. Learning would be to a master that is not being read, and that master would be periodically distributed to SA hosts. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Bother, said Pooh as he struggled with /etc/sendmail.cf, it never does quite what I want. I wish Christopher Robin was here. -- Peter da Silva in a.s.r --- 2 days until Christmas
Re: dccproc/dccifd error
On 12/23, Michael Scheidell wrote: #2, bug.. yep, bug. Vernon (author of DCC) will investigate and fix it, and update the SA BUGzilla soon. (so, yes, this would be a bug in 3.4 if released, but only shows up under one certain condition) Please post the bug to https://issues.apache.org/SpamAssassin/ so we can keep track of it, and make sure 3.4.0 doesn't get released with it. -- Life is either a daring adventure or it is nothing at all. - Helen Keller http://www.ChaosReigns.com
Re: dccproc/dccifd error
I am going to update the original bug with patch. Ill have mark look at it first. -- Michael Scheidell, CTO SECNAP Network Security -Original message- From: dar...@chaosreigns.com dar...@chaosreigns.com To: Michael Scheidell michael.scheid...@secnap.com Cc: users@spamassassin.apache.org users@spamassassin.apache.org Sent: Fri, Dec 23, 2011 17:28:28 GMT+00:00 Subject: Re: dccproc/dccifd error On 12/23, Michael Scheidell wrote: #2, bug.. yep, bug. Vernon (author of DCC) will investigate and fix it, and update the SA BUGzilla soon. (so, yes, this would be a bug in 3.4 if released, but only shows up under one certain condition) Please post the bug to https://issues.apache.org/SpamAssassin/ so we can keep track of it, and make sure 3.4.0 doesn't get released with it. -- Life is either a daring adventure or it is nothing at all. - Helen Keller http://www.ChaosReigns.com
Am i sending spam?
http://pastebin.com/78gUdaCj -- Med vänliga hälsningar/Regards Lars Ebeling Rentier http://leopg9.no-ip.org I am not young enough to know everything. -- Oscar Wilde
Re: Am i sending spam?
On Fri, 23 Dec 2011 22:10:22 +0100 Lars Ebeling lars.ebel...@leopg9.no-ip.org wrote: http://pastebin.com/78gUdaCj You are not sending spam. Someone on the machine SR1S4.mesa.gmu.edu [129.174.112.124 connected to your machine and said: HELO leopg9.no-ip.org In other words, the HELO domain was faked. We automatically block mail from anyone who HELOs as our machine (unless it really *is* from our machine, of course!) Regards, David.
Re: Am i sending spam?
On Fri, 23 Dec 2011, David F. Skoll wrote: On Fri, 23 Dec 2011 22:10:22 +0100 Lars Ebeling lars.ebel...@leopg9.no-ip.org wrote: http://pastebin.com/78gUdaCj You are not sending spam. Someone on the machine SR1S4.mesa.gmu.edu [129.174.112.124 connected to your machine and said: HELO leopg9.no-ip.org In other words, the HELO domain was faked. We automatically block mail from anyone who HELOs as our machine (unless it really *is* from our machine, of course!) Not to mention the fact that IP addr is listed in cbl.abuseat.org as a malware source and that message.bat attachment looks -very- suspicious. Do you have any kind of AV running in your mail system? The original of that message gets identified as Worm.Mydoom.M FOUND by ClamAV. We run ClamAV as an input milter filter ahead of spamassasin, no sense wasting time/cycles on known viri. ;) -- Dave Funk University of Iowa dbfunk (at) engineering.uiowa.eduCollege of Engineering 319/335-5751 FAX: 319/384-0549 1256 Seamans Center Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527 #include std_disclaimer.h Better is not better, 'standard' is better. B{
Re: Am i sending spam?
On Fri, 23 Dec 2011, David B Funk wrote: On Fri, 23 Dec 2011, David F. Skoll wrote: On Fri, 23 Dec 2011 22:10:22 +0100 Lars Ebeling lars.ebel...@leopg9.no-ip.org wrote: http://pastebin.com/78gUdaCj You are not sending spam. Someone on the machine SR1S4.mesa.gmu.edu [129.174.112.124 connected to your machine and said: HELO leopg9.no-ip.org In other words, the HELO domain was faked. We automatically block mail from anyone who HELOs as our machine (unless it really *is* from our machine, of course!) Not to mention the fact that IP addr is listed in cbl.abuseat.org as a malware source and that message.bat attachment looks -very- suspicious. Do you have any kind of AV running in your mail system? The original of that message gets identified as Worm.Mydoom.M FOUND by ClamAV. We run ClamAV as an input milter filter ahead of spamassasin, no sense wasting time/cycles on known viri. ;) One additional odd-tristing thing about that message; That IP addr ([129.174.112.124]) is listed in multiple DNSBLS (eg cbl.abuseat.org, zen.spamhaus ) but gets a whitelist rating from hostkarma.junkemailfilter.com. So if I were to actually believe hostkarma I wouldn't have filtered that message at all. ;( Does anybody actually believe hostkarma's whitelist ratings? I've seen lots of blatant spammers get whitelist. I used to report them to Marc but gave up when after reporting a whitelisted malware/phish message he replied 'looks ok to me'. -- Dave Funk University of Iowa dbfunk (at) engineering.uiowa.eduCollege of Engineering 319/335-5751 FAX: 319/384-0549 1256 Seamans Center Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527 #include std_disclaimer.h Better is not better, 'standard' is better. B{
Re: Am i sending spam?
- Original Message - From: David F. Skoll d...@roaringpenguin.com To: users@spamassassin.apache.org Sent: Friday, December 23, 2011 10:14 PM Subject: Re: Am i sending spam? On Fri, 23 Dec 2011 22:10:22 +0100 Lars Ebeling lars.ebel...@leopg9.no-ip.org wrote: http://pastebin.com/78gUdaCj You are not sending spam. Someone on the machine SR1S4.mesa.gmu.edu [129.174.112.124 connected to your machine and said: HELO leopg9.no-ip.org In other words, the HELO domain was faked. We automatically block mail from anyone who HELOs as our machine (unless it really *is* from our machine, of course!) how do you do that? Regards, David.
Re: Am i sending spam?
On Fri, 23 Dec 2011 23:13:43 +0100 Lars Ebeling lars.ebel...@leopg9.no-ip.org wrote: We automatically block mail from anyone who HELOs as our machine (unless it really *is* from our machine, of course!) how do you do that? We use MIMEDefang which lets you code tests like that in Perl. (So this is done outside of SpamAssassin, but you may be able to hack a SpamAssassin rule to do it too.) Regards, David.
Re: Am i sending spam?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/23/2011 4:23 PM, David F. Skoll wrote: On Fri, 23 Dec 2011 23:13:43 +0100 Lars Ebeling lars.ebel...@leopg9.no-ip.org wrote: We automatically block mail from anyone who HELOs as our machine (unless it really *is* from our machine, of course!) how do you do that? We use MIMEDefang which lets you code tests like that in Perl. (So this is done outside of SpamAssassin, but you may be able to hack a SpamAssassin rule to do it too.) Regards, David. In Exim, I do the following: # kill off the folks that use OUR ip's in HELO Nice and Early. drop message= Forged IP detected in HELO: $sender_helo_name hosts = !+relay_from_hosts !authenticated = * condition = ${if \ eq{$sender_helo_name}{$interface_address}{yes}{no}} # Forged hostname - HELOs as my own hostname or domain (early as well) drop message= Forged hostname detected in HELO: $sender_helo_name hosts = !+relay_from_hosts !authenticated = * condition = ${lookup {$sender_helo_name} \ lsearch{/usr/local/etc/exim/checkfiles/our_host_names}{yes}{no}} - -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 512-248-2683 E-Mail: l...@lerctr.org US Mail: 430 Valona Loop, Round Rock, TX 78681-3893 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJO9QKnAAoJENC8dtAvA1zmv9EIAKReeH0gP6j2oOojXIJ9fMjJ y32vFdjm8wvzBFxdHIHsqZ88yV//LDEUqq1JPWeFbz0XvXirRAmgJXuF8JAwWIiP WqttoEsm9ljreZFOTrkH6Ak8DwR0Jx8fBSMIWVU9dcUOLAV2pxnATWAcuoLAIJ5N dtM4SEiKlypcAEh46D5ih7d4iztMGCDIZLKxSokiUNfRIDU2COVLBdajYUQn2vd6 cmuY2Mr8UlDVETnZZVwJnFGfjsIsWSUsLvV/LFop/Dpq++nlZNxWxaX7QVj+ZoY2 vsQtgj0w7jdfmEpcTVuTv+sFNSo/VjpwhXB0Y0PM1NLiP5w49J0RN8CwpakhBVg= =WSY8 -END PGP SIGNATURE-
Re: Am i sending spam?
On Fri, 23 Dec 2011, David F. Skoll wrote: On Fri, 23 Dec 2011 23:13:43 +0100 Lars Ebeling lars.ebel...@leopg9.no-ip.org wrote: We automatically block mail from anyone who HELOs as our machine (unless it really *is* from our machine, of course!) how do you do that? We use MIMEDefang which lets you code tests like that in Perl. (So this is done outside of SpamAssassin, but you may be able to hack a SpamAssassin rule to do it too.) Ideally this sort of check should be done at the incoming MTA (mx) level (before it ever gets handed to SA). Right up front do your HELO, DNS, DNSBL checks of the opening connection and reject right there. Why let spam in the front door if you know you're going to reject it later. Thus these sort of tests are MTA specific. You need to know what your MTA is and check the appropriate FAQs, lists, config resources for your MTA. -- Dave Funk University of Iowa dbfunk (at) engineering.uiowa.eduCollege of Engineering 319/335-5751 FAX: 319/384-0549 1256 Seamans Center Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527 #include std_disclaimer.h Better is not better, 'standard' is better. B{
Re: Am i sending spam?
* Lars Ebeling lars.ebel...@leopg9.no-ip.org: You are not sending spam. Someone on the machine SR1S4.mesa.gmu.edu [129.174.112.124 connected to your machine and said: HELO leopg9.no-ip.org In other words, the HELO domain was faked. We automatically block mail from anyone who HELOs as our machine (unless it really *is* from our machine, of course!) how do you do that? In Postfix: smtpd_recipient_restrictions = ... permit_mynetworks reject_unauth_destination ... check_helo_access pcre:/etc/postfix/helo.chk ... # /etc/postfix/helo.chk /^mail\.state-of-mind\.de$/ 550 hostname abuse: mail.state-of-mind.de /^state-of-mind\.de$/ 550 domainname abuse: state-of-mind.de /^194\.126\.158\.24$/ 550 IP address abuse: 194.126.158.24 /^\[194\.126\.158\.24\]$/ 550 IP address abuse: [194.126.158.24] /^[0-9.]+$/ 550 RFC 2821 compliance error HTH, p@rick -- state of mind () http://www.state-of-mind.de Franziskanerstraße 15 Telefon +49 89 3090 4664 81669 München Telefax +49 89 3090 4666 Amtsgericht MünchenPartnerschaftsregister PR 563