Re: per-user or global bayes (was: HUGE bayes DB (non-sitewide) advice?)
bump --- Michael Monnerie <[EMAIL PROTECTED]> wrote: > > My users are quite happy > > with overall markup of the spam. We occasionally get a HAM marked as > > SPAM. We have an odd client base though. > > The question is: when to use global and when per-user bayes? > > On our server, we have people of different languages, communicating with > different countries all over the world, in different areas > (advertising, production, IT, etc.). I thought in that case a per-user > bayes would be much better, as viagra is something good for the one, > but bad for the other. > > What's the general recommendation for bayes? __ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com
Re: HUGE bayes DB (non-sitewide) advice?
> > Just a follow-up to my own brain-lapse: > > > > If you define a custom user scores query like this: > > > > user_scores_sql_custom_querySELECT preference, value FROM > > spamassassin_settings WHERE username = _USERNAME_ OR username = '!GLOBAL' > OR > > username = CONCAT('@', _DOMAIN_) ORDER BY username ASC > > > > Then you can easily decide to use bayes on a per-domain basis for one or > more > > of your domains (and still have per-user bayes for all other domains). A > > sample insert row into the settings table, then, would be: > > > > INSERT INTO spamassassin_settings (username, preference, value) VALUES > > ('@example.com', 'bayes_sql_override_username', 'example.com'); > > > > So everyone in the example.com domain shares all bayes information which > is > > placed under the username "example.com". > > is that in the FAQ? because it certainly sounds like a cool tip for > Bayes/SQL users. I don't think so. One other thing to note about this setup is: I think I caught the idea of using !GLOBAL from someone's how-to a while back (IIRC, the manual suggests @GLOBAL), this way the global settings can be ordered in the query to always override any per-domain settings. > (there should really be a section of the FAQ dedicated to that stuff.) Would be nice. __ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com
Re: HUGE bayes DB (non-sitewide) advice?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 email builder writes: > Just a follow-up to my own brain-lapse: > > If you define a custom user scores query like this: > > user_scores_sql_custom_querySELECT preference, value FROM > spamassassin_settings WHERE username = _USERNAME_ OR username = '!GLOBAL' OR > username = CONCAT('@', _DOMAIN_) ORDER BY username ASC > > Then you can easily decide to use bayes on a per-domain basis for one or more > of your domains (and still have per-user bayes for all other domains). A > sample insert row into the settings table, then, would be: > > INSERT INTO spamassassin_settings (username, preference, value) VALUES > ('@example.com', 'bayes_sql_override_username', 'example.com'); > > So everyone in the example.com domain shares all bayes information which is > placed under the username "example.com". is that in the FAQ? because it certainly sounds like a cool tip for Bayes/SQL users. (there should really be a section of the FAQ dedicated to that stuff.) - --j. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Exmh CVS iD8DBQFDcrbiMJF5cimLx9ARAgQJAJ4gZZo91g16hLhD2bohSJDGCNUFTgCeLCzy rXqrYYYMSDRZkkwS+TS5iao= =3m/8 -END PGP SIGNATURE-
Re: HUGE bayes DB (non-sitewide) advice?
> > Well, I know there have to be some admins out there who have a lot of > users > > and do not use sitewide bayes.. RIGHT? See original email snippet at > > bottom. > > > > > * Other ideas: > > - increase system memory as much as possible > > - per-domain Bayes instead of per-user??? > > This might be our 2nd best choice (unless there is a good > bayes_expiry_max_db_size solution), but I don't see anything in the manual > about the syntax of bayes_sql_override_username. The manual mentions > "grouping", but gives no examples of how I could, for instance, group bayes > data by domain (my usernames are in the form [EMAIL PROTECTED]). Just a follow-up to my own brain-lapse: If you define a custom user scores query like this: user_scores_sql_custom_querySELECT preference, value FROM spamassassin_settings WHERE username = _USERNAME_ OR username = '!GLOBAL' OR username = CONCAT('@', _DOMAIN_) ORDER BY username ASC Then you can easily decide to use bayes on a per-domain basis for one or more of your domains (and still have per-user bayes for all other domains). A sample insert row into the settings table, then, would be: INSERT INTO spamassassin_settings (username, preference, value) VALUES ('@example.com', 'bayes_sql_override_username', 'example.com'); So everyone in the example.com domain shares all bayes information which is placed under the username "example.com". > > - cluster Bayes DB??? > > This apparently is not an option, since clustered MySQL databases are kept > entirely in memory. We don't have any 10GB RAM machines sadly :) > > From the MySQL manual: > > In-memory storage: > > All data stored in each data node is kept in memory on the node's host > computer. For each data node in the cluster, you must have available an > amount of RAM equal to the size of the database times the number of > replicas, > divided by the number of data nodes. Thus, if the database takes up 1 > gigabyte of memory, and you wish to set up the cluster with 4 replicas and > 8 > data nodes, a minimum of 500 MB memory will be required per node. Note that > this is in addition to any requirements for the operating system and any > other applications that might be running on the host. > __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
RE: HUGE bayes DB (non-sitewide) advice?
Thanks a lot for checking, Gary! --- "Gary W. Smith" <[EMAIL PROTECTED]> wrote: > You're right, my guy gave me the size of bayes + awl. The real number > is 14.5mb. (with an overhead of 3.2mb). > > > Not sure, that's just what phpmyadmin is reporting. I'll check again. > I can't remember if the DB is in double byte or not. One of my guys > tweaked it for some other little databases on the same box. > > > >> > Our production database for a large number of emails (but using > site > >> > wide) is about 40mb. > >> > >> What is your bayes_expiry_max_db_size set to? Do you feel that it > has > >> been > >> enough to effectively capture your various user email habits? > > > > Default. > > > > > How can you be running the default value, when the manual says that > 15 > tokens is only 8MB?? How do you end up with 40MB of data?: > > bayes_expiry_max_db_size (default: 15) > What should be the maximum size of the Bayes tokens database? When > expiry > occurs, the Bayes system will keep either 75% of the maximum value, or > 100,000 tokens, whichever has a larger value. 150,000 tokens is roughly > equivalent to a 8Mb database file. > __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
RE: HUGE bayes DB (non-sitewide) advice?
You're right, my guy gave me the size of bayes + awl. The real number is 14.5mb. (with an overhead of 3.2mb). -Original Message- From: Gary W. Smith [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 09, 2005 9:00 AM To: email builder; users@spamassassin.apache.org Subject: RE: HUGE bayes DB (non-sitewide) advice? Not sure, that's just what phpmyadmin is reporting. I'll check again. I can't remember if the DB is in double byte or not. One of my guys tweaked it for some other little databases on the same box. -Original Message- From: email builder [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 09, 2005 1:54 AM To: Gary W. Smith; users@spamassassin.apache.org Subject: RE: HUGE bayes DB (non-sitewide) advice? >> > Our production database for a large number of emails (but using site >> > wide) is about 40mb. >> >> What is your bayes_expiry_max_db_size set to? Do you feel that it has >> been >> enough to effectively capture your various user email habits? > > Default. > How can you be running the default value, when the manual says that 15 tokens is only 8MB?? How do you end up with 40MB of data?: bayes_expiry_max_db_size (default: 15) What should be the maximum size of the Bayes tokens database? When expiry occurs, the Bayes system will keep either 75% of the maximum value, or 100,000 tokens, whichever has a larger value. 150,000 tokens is roughly equivalent to a 8Mb database file. __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
RE: HUGE bayes DB (non-sitewide) advice?
Not sure, that's just what phpmyadmin is reporting. I'll check again. I can't remember if the DB is in double byte or not. One of my guys tweaked it for some other little databases on the same box. -Original Message- From: email builder [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 09, 2005 1:54 AM To: Gary W. Smith; users@spamassassin.apache.org Subject: RE: HUGE bayes DB (non-sitewide) advice? >> > Our production database for a large number of emails (but using site >> > wide) is about 40mb. >> >> What is your bayes_expiry_max_db_size set to? Do you feel that it has >> been >> enough to effectively capture your various user email habits? > > Default. > How can you be running the default value, when the manual says that 15 tokens is only 8MB?? How do you end up with 40MB of data?: bayes_expiry_max_db_size (default: 15) What should be the maximum size of the Bayes tokens database? When expiry occurs, the Bayes system will keep either 75% of the maximum value, or 100,000 tokens, whichever has a larger value. 150,000 tokens is roughly equivalent to a 8Mb database file. __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
Re: HUGE bayes DB (non-sitewide) advice?
email builder wrote: Well, I know there have to be some admins out there who have a lot of users and do not use sitewide bayes.. RIGHT? See original email snippet at bottom. I believe that there are a few running bayes is a similar configuration. It certainly is a tough problem. I believe your tuning ideas are on the mark and are certainly outside the scope of what SpamAssassin is doing, unless of course if there was something in the code that we could do more efficiently. In that case I highly encourage you to read the MySQL website, there is lots of great documentation there, also there are several books that can help in the tuning. Don't be afraid to ask the MySQL community for help, I'm sure they would gladly offer tuning advice. The key to just about any database tuning, once you've exhausted all the various config params is going to be hardware. More memory and more spindles will do wonders. The other way to look at this is (as someone else mentioned I believe) the possibility that in the long run it just won't be possible to run efficiently with such a large database. In that case you can create a custom storage module, the API is documented, that can handle it. Of course, it's open source you can always do this yourself, or it is possible that a few amongst us would be willing to contract to come up with a storage module that better fits your needs. The main thing is that these are all good discussions and I myself highly encourage them. It is very hard to test these types of deployments in development so anytime there is one it is a learning experience, at least for me. Michael
Re: HUGE bayes DB (non-sitewide) advice?
Gary W. Smith wrote: Just my $0.02 but if it's in MySQL then you really don't need to expire each one. You can write a custom script that will do this. When you break it down, expire is really just finding those tokens that are beyond the threshold where id=x and time=y. The resultant would be "where time=x". Don't be fooled into thinking it is this easy, it's not. When running an expire you have to not only delete the unused tokens but you have to also update individual's variables. While it is possible to go around the API and do it yourself, unless you know EXACTLY what you are doing I don't recommend this route. Michael
Re: HUGE bayes DB (non-sitewide) advice?
email builder wrote: How can you be running the default value, when the manual says that 15 tokens is only 8MB?? How do you end up with 40MB of data?: bayes_expiry_max_db_size (default: 15) What should be the maximum size of the Bayes tokens database? When expiry occurs, the Bayes system will keep either 75% of the maximum value, or 100,000 tokens, whichever has a larger value. 150,000 tokens is roughly equivalent to a 8Mb database file. The documentation is very Berkeley DB specific, that no doubt refers to the DBM size of 150k tokens. Michael
Re: HUGE bayes DB (non-sitewide) advice?
> > > > I guess the relevant point for this thread is that I don't necessarily > think > > that this is the silver bullet as implied. Even if you use a > > high-availability clustering technology that can mirror writes and reads, > you > > are STILL dealing with the possibility of a database that is just > massive. > > Processing this size of database will still be disk-bound unless you have > an > > unheard-of amount of memory; I don't think there's any reason to think > that > > clustering the problem will make it go away. > > > > So I still wonder if anyone has any musings on my earlier questions? > > A few spamassassin hacks could help. > 1. Have multiple mysql servers, split your users into A-J, K-S, T-Z OR > smaller units and distribute them over different servers, with some HA / > failover mechanism (possibly drbd). > 2. Have 2 level of bayes, one large global and the other smaller per > user if thats possible. Of course SA will need to be changed to use both > the bayes'. This way you could have 2 large servers for the global bayes > db and 2 for the per user bayes dbs. > > Also see if this SQL failover patch can help you in any way. > http://issues.apache.org/SpamAssassin/show_bug.cgi?id=2197 Thanks for the good thoughts. Sounds like the ultimate answer is that not many people are using per-user Bayes, at least at this level, and that any "solutions" are yet to be realized in practice. I don't think we've got the resources or time to contribute any SA patches, but the food for thought is very much appreciated! > Finally to speed up the database have a look at this, the people at > wikimedia / livejournal seem to be happy using it. > http://www.danga.com/memcached/ That's very cool. I'll *definitely* be keeping this one in mind. __ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
RE: HUGE bayes DB (non-sitewide) advice?
>> > Our production database for a large number of emails (but using site >> > wide) is about 40mb. >> >> What is your bayes_expiry_max_db_size set to? Do you feel that it has >> been >> enough to effectively capture your various user email habits? > > Default. > How can you be running the default value, when the manual says that 15 tokens is only 8MB?? How do you end up with 40MB of data?: bayes_expiry_max_db_size (default: 15) What should be the maximum size of the Bayes tokens database? When expiry occurs, the Bayes system will keep either 75% of the maximum value, or 100,000 tokens, whichever has a larger value. 150,000 tokens is roughly equivalent to a 8Mb database file. __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
Re: per-user or global bayes (was: HUGE bayes DB (non-sitewide) advice?)
On Mittwoch, 9. November 2005 08:04 Gary W. Smith wrote: > My users are quite happy > with overall markup of the spam. We occasionally get a HAM marked as > SPAM. We have an odd client base though. The question is: when to use global and when per-user bayes? On our server, we have people of different languages, communicating with different countries all over the world, in different areas (advertising, production, IT, etc.). I thought in that case a per-user bayes would be much better, as viagra is something good for the one, but bad for the other. What's the general recommendation for bayes? mfg zmi -- // Michael Monnerie, Ing.BSc --- it-management Michael Monnerie // http://zmi.at Tel: 0660/4156531 Linux 2.6.11 // PGP Key: "lynx -source http://zmi.at/zmi2.asc | gpg --import" // Fingerprint: EB93 ED8A 1DCD BB6C F952 F7F4 3911 B933 7054 5879 // Keyserver: www.keyserver.net Key-ID: 0x70545879 pgpCN6ryXTaZ2.pgp Description: PGP signature
Re: HUGE bayes DB (non-sitewide) advice?
email builder wrote: In-memory storage: All data stored in each data node is kept in memory on the node's host computer. For each data node in the cluster, you must have available an amount of RAM equal to the size of the database times the number of replicas, This refers to the first line: "In-memory storage". Of course you can't do that with 160GB DBs. You can still cluster - look at DRBD http://www.drbd.org/ I guess the relevant point for this thread is that I don't necessarily think that this is the silver bullet as implied. Even if you use a high-availability clustering technology that can mirror writes and reads, you are STILL dealing with the possibility of a database that is just massive. Processing this size of database will still be disk-bound unless you have an unheard-of amount of memory; I don't think there's any reason to think that clustering the problem will make it go away. So I still wonder if anyone has any musings on my earlier questions? A few spamassassin hacks could help. 1. Have multiple mysql servers, split your users into A-J, K-S, T-Z OR smaller units and distribute them over different servers, with some HA / failover mechanism (possibly drbd). 2. Have 2 level of bayes, one large global and the other smaller per user if thats possible. Of course SA will need to be changed to use both the bayes'. This way you could have 2 large servers for the global bayes db and 2 for the per user bayes dbs. Also see if this SQL failover patch can help you in any way. http://issues.apache.org/SpamAssassin/show_bug.cgi?id=2197 Finally to speed up the database have a look at this, the people at wikimedia / livejournal seem to be happy using it. http://www.danga.com/memcached/ Hope that helps, - dhawal
RE: HUGE bayes DB (non-sitewide) advice?
Sorry, only answered part of the question. My users are quite happy with overall markup of the spam. We occasionally get a HAM marked as SPAM. We have an odd client base though. -Original Message- From: email builder [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 08, 2005 8:58 PM To: Gary W. Smith; users@spamassassin.apache.org Subject: RE: HUGE bayes DB (non-sitewide) advice? > Our production database for a large number of emails (but using site > wide) is about 40mb. What is your bayes_expiry_max_db_size set to? Do you feel that it has been enough to effectively capture your various user email habits? __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
RE: HUGE bayes DB (non-sitewide) advice?
Default. Gart -Original Message- From: email builder [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 08, 2005 8:58 PM To: Gary W. Smith; users@spamassassin.apache.org Subject: RE: HUGE bayes DB (non-sitewide) advice? > Our production database for a large number of emails (but using site > wide) is about 40mb. What is your bayes_expiry_max_db_size set to? Do you feel that it has been enough to effectively capture your various user email habits? __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
RE: HUGE bayes DB (non-sitewide) advice?
> Our production database for a large number of emails (but using site > wide) is about 40mb. What is your bayes_expiry_max_db_size set to? Do you feel that it has been enough to effectively capture your various user email habits? __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
Re: HUGE bayes DB (non-sitewide) advice?
> > In-memory storage: > > All data stored in each data node is kept in memory on the node's > > host computer. For each data node in the cluster, you must have > > available an amount of RAM equal to the size of the database times > > the number of replicas, > > This refers to the first line: "In-memory storage". Of course you can't > do that with 160GB DBs. You can still cluster - look at DRBD > http://www.drbd.org/ I guess the relevant point for this thread is that I don't necessarily think that this is the silver bullet as implied. Even if you use a high-availability clustering technology that can mirror writes and reads, you are STILL dealing with the possibility of a database that is just massive. Processing this size of database will still be disk-bound unless you have an unheard-of amount of memory; I don't think there's any reason to think that clustering the problem will make it go away. So I still wonder if anyone has any musings on my earlier questions? __ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com
RE: HUGE bayes DB (non-sitewide) advice?
I'd also through www.linux-ha.org into the mix. We use that to manage the cluster for the SA database and use DRBD for the filesystem. We also use the same concept backend email stores as well. It's more open source to complement this open source. -Original Message- From: Michael Monnerie [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 08, 2005 9:48 AM To: users@spamassassin.apache.org Subject: Re: HUGE bayes DB (non-sitewide) advice? On Dienstag, 8. November 2005 03:38 email builder wrote: > In-memory storage: > All data stored in each data node is kept in memory on the node's > host computer. For each data node in the cluster, you must have > available an amount of RAM equal to the size of the database times > the number of replicas, This refers to the first line: "In-memory storage". Of course you can't do that with 160GB DBs. You can still cluster - look at DRBD http://www.drbd.org/ mfg zmi -- // Michael Monnerie, Ing.BSc --- it-management Michael Monnerie // http://zmi.at Tel: 0660/4156531 Linux 2.6.11 // PGP Key: "lynx -source http://zmi.at/zmi2.asc | gpg --import" // Fingerprint: EB93 ED8A 1DCD BB6C F952 F7F4 3911 B933 7054 5879 // Keyserver: www.keyserver.net Key-ID: 0x70545879
Re: HUGE bayes DB (non-sitewide) advice?
On Dienstag, 8. November 2005 03:38 email builder wrote: > In-memory storage: > All data stored in each data node is kept in memory on the node's > host computer. For each data node in the cluster, you must have > available an amount of RAM equal to the size of the database times > the number of replicas, This refers to the first line: "In-memory storage". Of course you can't do that with 160GB DBs. You can still cluster - look at DRBD http://www.drbd.org/ mfg zmi -- // Michael Monnerie, Ing.BSc --- it-management Michael Monnerie // http://zmi.at Tel: 0660/4156531 Linux 2.6.11 // PGP Key: "lynx -source http://zmi.at/zmi2.asc | gpg --import" // Fingerprint: EB93 ED8A 1DCD BB6C F952 F7F4 3911 B933 7054 5879 // Keyserver: www.keyserver.net Key-ID: 0x70545879 pgpQLe7GFJO3j.pgp Description: PGP signature
Re: HUGE bayes DB (non-sitewide) advice?
On Dienstag, 8. November 2005 03:50 email builder wrote: > From what I understand, MySQL cluster design is such that the data > nodes keep all the table data in memory, which would not be feasible > in a 160GB scenario... No. Cluster means: Take two machines of same config, and mirror them. It's kind of RAID-1 just for a whole server. DRBD is one tool for this. mfg zmi -- // Michael Monnerie, Ing.BSc --- it-management Michael Monnerie // http://zmi.at Tel: 0660/4156531 Linux 2.6.11 // PGP Key: "lynx -source http://zmi.at/zmi2.asc | gpg --import" // Fingerprint: EB93 ED8A 1DCD BB6C F952 F7F4 3911 B933 7054 5879 // Keyserver: www.keyserver.net Key-ID: 0x70545879 pgpAF7eTQo3Rp.pgp Description: PGP signature
RE: HUGE bayes DB (non-sitewide) advice?
We run a linux-ha cluster. Works out well. -Original Message- From: email builder [mailto:[EMAIL PROTECTED] Sent: Monday, November 07, 2005 6:51 PM To: users@spamassassin.apache.org Subject: RE: HUGE bayes DB (non-sitewide) advice? >From what I understand, MySQL cluster design is such that the data nodes keep all the table data in memory, which would not be feasible in a 160GB scenario...
RE: HUGE bayes DB (non-sitewide) advice?
> Just my $0.02 but if it's in MySQL then you really don't need to expire > each one. You can write a custom script that will do this. When you > break it down, expire is really just finding those tokens that are > beyond the threshold where id=x and time=y. The resultant would be > "where time=x". Right. Are there any scripts already out there that do this? > But even then, you would only trim it down a manageable size per user. > Our production database for a large number of emails (but using site > wide) is about 40mb. What is your bayes_expiry_max_db_size? Quite a bit larger than default I take it. > Even if you stuck with non-MySQL based databases (suck as Berkeley DB) > you'd still have 160gb of aggregate data files. If you truly need > independent DB's for each user (weather file based or MySQL) I'd > recommend building a big MySQL cluster and managing it that way. We > currently manage a MySQL cluster (with mirrored 300gb drives and DRBD > replication) that houses a whopping 80mb of MySQL data. >From what I understand, MySQL cluster design is such that the data nodes keep all the table data in memory, which would not be feasible in a 160GB scenario... > I don't think this helps you much, just an opinion. I appreciate it nonetheless! __ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com
Re: HUGE bayes DB (non-sitewide) advice?
> Well, I know there have to be some admins out there who have a lot of users > and do not use sitewide bayes.. RIGHT? See original email snippet at > bottom. > * Other ideas: > - increase system memory as much as possible > - per-domain Bayes instead of per-user??? This might be our 2nd best choice (unless there is a good bayes_expiry_max_db_size solution), but I don't see anything in the manual about the syntax of bayes_sql_override_username. The manual mentions "grouping", but gives no examples of how I could, for instance, group bayes data by domain (my usernames are in the form [EMAIL PROTECTED]). > - cluster Bayes DB??? This apparently is not an option, since clustered MySQL databases are kept entirely in memory. We don't have any 10GB RAM machines sadly :) >From the MySQL manual: In-memory storage: All data stored in each data node is kept in memory on the node's host computer. For each data node in the cluster, you must have available an amount of RAM equal to the size of the database times the number of replicas, divided by the number of data nodes. Thus, if the database takes up 1 gigabyte of memory, and you wish to set up the cluster with 4 replicas and 8 data nodes, a minimum of 500 MB memory will be required per node. Note that this is in addition to any requirements for the operating system and any other applications that might be running on the host. __ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
RE: HUGE bayes DB (non-sitewide) advice?
Just my $0.02 but if it's in MySQL then you really don't need to expire each one. You can write a custom script that will do this. When you break it down, expire is really just finding those tokens that are beyond the threshold where id=x and time=y. The resultant would be "where time=x". But even then, you would only trim it down a manageable size per user. Our production database for a large number of emails (but using site wide) is about 40mb. Even if you stuck with non-MySQL based databases (suck as Berkeley DB) you'd still have 160gb of aggregate data files. If you truly need independent DB's for each user (weather file based or MySQL) I'd recommend building a big MySQL cluster and managing it that way. We currently manage a MySQL cluster (with mirrored 300gb drives and DRBD replication) that houses a whopping 80mb of MySQL data. I don't think this helps you much, just an opinion. Gary Wayne Smith -Original Message- From: email builder [mailto:[EMAIL PROTECTED] Sent: Monday, November 07, 2005 10:56 AM To: [EMAIL PROTECTED]; users@spamassassin.apache.org Subject: Re: HUGE bayes DB (non-sitewide) advice? Well, I know there have to be some admins out there who have a lot of users and do not use sitewide bayes.. RIGHT? See original email snippet at bottom. I'll start the ball rolling with what few tweaks we've made, although they are not enough; we desperately need more ideas to make this viable. * bayes_auto_expire is turned on; cronning the expiry of 20K+ accounts every night seems outrageous * bayes_expiry_max_db_size is at its default value; if 20K accounts used the maximum allowable space, then, we'd have a 160GB bayes DB. If 8MB is considered sufficient for a whole domain for some people, then perhaps we can reduce this size for per-user bayes...?? * MySQL tuning for InnoDB: pretty much straight from the MySQL manual... - multiple data files (approx 10G each) - innodb_flush_log_at_trx_commit=0 because it's faster and we don't care about Bayes data enough that the risk of losing one second of data is fine - innodb_buffer_pool_size as large as we can handle, but even if this was 3 or more GB, it's only a fraction of a 160GB database - innodb_additional_mem_pool_size=20M because that's what we saw for their "big" example, although I am wondering in particular about the value of increasing this one - innodb_log_file_size 25% of innodb_buffer_pool_size * Other ideas: - increase system memory as much as possible - per-domain Bayes instead of per-user??? - cluster Bayes DB??? - revert to MyISAM -- will this help THAT much? > I'm wondering if anyone out there hosts a large number of users with > per-USER bayes (in MySQL)? Our user base is varied enough that we do not > feel bayes would be effective if done site-wide. Some people like their > spammy newsletters, some are geeks who would deeply resent someone training > newsletters to be ham. > > As a result of this, however, we are currently burdened with an 8GB(! > yep, > you read it right) bayes database (more than 20K users having mail > delivered). We went to InnoDB when we upgraded to 3.1 per the upgrade > doc's > recommendation, so that also means things are a bit slower. Watching > mytop, > most all the activity we get is from bayes inserts, which is not > surprising, > and is probably the cause of why we get a lot of iowait, trying to keep > writing to an 8G tablespace... > > We've tuned the InnoDB some, but performance is still not all that good > -- > is there anyone out there who runs a system like this? > > * What kinds of MySQL tuning are people using to help cope? > * Are there any SA settings to help allieviate performance problems? > * If we want to walk away from per-user bayes, is the only option to go > site-wide? What other options are there? __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
Re: HUGE bayes DB (non-sitewide) advice?
Well, I know there have to be some admins out there who have a lot of users and do not use sitewide bayes.. RIGHT? See original email snippet at bottom. I'll start the ball rolling with what few tweaks we've made, although they are not enough; we desperately need more ideas to make this viable. * bayes_auto_expire is turned on; cronning the expiry of 20K+ accounts every night seems outrageous * bayes_expiry_max_db_size is at its default value; if 20K accounts used the maximum allowable space, then, we'd have a 160GB bayes DB. If 8MB is considered sufficient for a whole domain for some people, then perhaps we can reduce this size for per-user bayes...?? * MySQL tuning for InnoDB: pretty much straight from the MySQL manual... - multiple data files (approx 10G each) - innodb_flush_log_at_trx_commit=0 because it's faster and we don't care about Bayes data enough that the risk of losing one second of data is fine - innodb_buffer_pool_size as large as we can handle, but even if this was 3 or more GB, it's only a fraction of a 160GB database - innodb_additional_mem_pool_size=20M because that's what we saw for their "big" example, although I am wondering in particular about the value of increasing this one - innodb_log_file_size 25% of innodb_buffer_pool_size * Other ideas: - increase system memory as much as possible - per-domain Bayes instead of per-user??? - cluster Bayes DB??? - revert to MyISAM -- will this help THAT much? > I'm wondering if anyone out there hosts a large number of users with > per-USER bayes (in MySQL)? Our user base is varied enough that we do not > feel bayes would be effective if done site-wide. Some people like their > spammy newsletters, some are geeks who would deeply resent someone training > newsletters to be ham. > > As a result of this, however, we are currently burdened with an 8GB(! > yep, > you read it right) bayes database (more than 20K users having mail > delivered). We went to InnoDB when we upgraded to 3.1 per the upgrade > doc's > recommendation, so that also means things are a bit slower. Watching > mytop, > most all the activity we get is from bayes inserts, which is not > surprising, > and is probably the cause of why we get a lot of iowait, trying to keep > writing to an 8G tablespace... > > We've tuned the InnoDB some, but performance is still not all that good > -- > is there anyone out there who runs a system like this? > > * What kinds of MySQL tuning are people using to help cope? > * Are there any SA settings to help allieviate performance problems? > * If we want to walk away from per-user bayes, is the only option to go > site-wide? What other options are there? __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
Re: HUGE bayes DB (non-sitewide) advice?
On Freitag, 4. November 2005 21:04 email builder wrote: > *SOMEONE* out there has to be doing > something like this, no??? I would be interested in that, too. mfg zmi -- // Michael Monnerie, Ing.BSc --- it-management Michael Monnerie // http://zmi.at Tel: 0660/4156531 Linux 2.6.11 // PGP Key: "lynx -source http://zmi.at/zmi2.asc | gpg --import" // Fingerprint: EB93 ED8A 1DCD BB6C F952 F7F4 3911 B933 7054 5879 // Keyserver: www.keyserver.net Key-ID: 0x70545879 pgpDhbbZFPv1D.pgp Description: PGP signature
RE: HUGE bayes DB (non-sitewide) advice?
> >>> As a result of this, however, we are currently burdened with an > >>> 8GB(! yep, you read it right) bayes database (more than 20K users > >>> having mail delivered). > >> > >> Consider using bayes_expiry_max_db_size in conjunction with > >> bayes_auto_expire > > > > "Using"? So you are saying you use non-sitewide bayes but you limit > > your max DB size to something much smaller than the default? Care to > > share your settings? > > No, I use sitewide bayes. > > > We left these at their defaults (not unintentionally). If we have > > 20K users, the default max of 150,000 tokens at roughly 8MB comes out > > to 160GB. We have the disk space, but just not sure if we have the > > tuning it would take to handle a DB of that size. What I am looking > > for is tuning help or other ideas on how to achieve some reasonable > > level of bayes personalization without drowning our DB resources. > > For optimum performance you probably want the bayes database to fit into > RAM, along with all of your spamassassin objects and anything else on the > server. > > You might consider buying a dedicated Bayes DB server with 4 GB of RAM, and > cutting bayes_expiry_max_db_size in half. That should do it. That should do it today (actually, the database is now 9GB), but not when it has grown to 160GB. I appreciate the tips, but what I am looking for is MySQL tuning advice and thoughts/ideas/other approaches to having at least somewhat personalized Bayes stores for well over 20K users. *SOMEONE* out there has to be doing something like this, no??? > If the DB fits into RAM, the SQL engine should be able to make > transactional changes in RAM and lazily spool them to the disk without > forcing other transactions to wait. __ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com
RE: HUGE bayes DB (non-sitewide) advice?
>> email builder wrote: >>> As a result of this, however, we are currently burdened with an >>> 8GB(! yep, you read it right) bayes database (more than 20K users >>> having mail delivered). >> >> Consider using bayes_expiry_max_db_size in conjunction with >> bayes_auto_expire > > "Using"? So you are saying you use non-sitewide bayes but you limit > your max DB size to something much smaller than the default? Care to > share your settings? No, I use sitewide bayes. > We left these at their defaults (not unintentionally). If we have > 20K users, the default max of 150,000 tokens at roughly 8MB comes out > to 160GB. We have the disk space, but just not sure if we have the > tuning it would take to handle a DB of that size. What I am looking > for is tuning help or other ideas on how to achieve some reasonable > level of bayes personalization without drowning our DB resources. For optimum performance you probably want the bayes database to fit into RAM, along with all of your spamassassin objects and anything else on the server. You might consider buying a dedicated Bayes DB server with 4 GB of RAM, and cutting bayes_expiry_max_db_size in half. That should do it. If the DB fits into RAM, the SQL engine should be able to make transactional changes in RAM and lazily spool them to the disk without forcing other transactions to wait. -- Matthew.van.Eerde (at) hbinc.com 805.964.4554 x902 Hispanic Business Inc./HireDiversity.com Software Engineer
RE: HUGE bayes DB (non-sitewide) advice?
--- [EMAIL PROTECTED] wrote: > email builder wrote: > > As a result of this, however, we are currently burdened with an > > 8GB(! yep, you read it right) bayes database (more than 20K users > > having mail delivered). > > Consider using bayes_expiry_max_db_size in conjunction with > bayes_auto_expire "Using"? So you are saying you use non-sitewide bayes but you limit your max DB size to something much smaller than the default? Care to share your settings? We left these at their defaults (not unintentionally). If we have 20K users, the default max of 150,000 tokens at roughly 8MB comes out to 160GB. We have the disk space, but just not sure if we have the tuning it would take to handle a DB of that size. What I am looking for is tuning help or other ideas on how to achieve some reasonable level of bayes personalization without drowning our DB resources. Thanks __ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com
RE: HUGE bayes DB (non-sitewide) advice?
email builder wrote: > As a result of this, however, we are currently burdened with an > 8GB(! yep, you read it right) bayes database (more than 20K users > having mail delivered). Consider using bayes_expiry_max_db_size in conjunction with bayes_auto_expire -- Matthew.van.Eerde (at) hbinc.com 805.964.4554 x902 Hispanic Business Inc./HireDiversity.com Software Engineer
HUGE bayes DB (non-sitewide) advice?
Hi all, I'm wondering if anyone out there hosts a large number of users with per-USER bayes (in MySQL)? Our user base is varied enough that we do not feel bayes would be effective if done site-wide. Some people like their spammy newsletters, some are geeks who would deeply resent someone training newsletters to be ham. As a result of this, however, we are currently burdened with an 8GB(! yep, you read it right) bayes database (more than 20K users having mail delivered). We went to InnoDB when we upgraded to 3.1 per the upgrade doc's recommendation, so that also means things are a bit slower. Watching mytop, most all the activity we get is from bayes inserts, which is not surprising, and is probably the cause of why we get a lot of iowait, trying to keep writing to an 8G tablespace... Oh, and we let bayes do its token cleanup on the spot (sorry, not remembering the config setting name right now), not at night, since a small lag in delivery is acceptable, but figuring out how to run an absolutely huge cleanup by cron every night in this scenario seems like it'd really kill the DB (and we'd have to run sa-learn once for every single user, right... ugh) We've tuned the InnoDB some, but performance is still not all that good -- is there anyone out there who runs a system like this? * What kinds of MySQL tuning are people using to help cope? * Are there any SA settings to help allieviate performance problems? * If we want to walk away from per-user bayes, is the only option to go site-wide? What other options are there? __ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com