Re: HUGE bayes DB (non-sitewide) advice?

2005-11-10 Thread email builder
> > Just a follow-up to my own brain-lapse:
> > 
> > If you define a custom user scores query like this:
> > 
> > user_scores_sql_custom_querySELECT preference, value FROM
> > spamassassin_settings WHERE username = _USERNAME_ OR username = '!GLOBAL'
> OR
> > username = CONCAT('@', _DOMAIN_) ORDER BY username ASC
> > 
> > Then you can easily decide to use bayes on a per-domain basis for one or
> more
> > of your domains (and still have per-user bayes for all other domains).  A
> > sample insert row into the settings table, then, would be:
> > 
> > INSERT INTO spamassassin_settings (username, preference, value) VALUES
> > ('@example.com', 'bayes_sql_override_username', 'example.com');
> > 
> > So everyone in the example.com domain shares all bayes information which
> is
> > placed under the username "example.com".
> 
> is that in the FAQ?  because it certainly sounds like a cool tip for
> Bayes/SQL users.

I don't think so.  One other thing to note about this setup is:

I think I caught the idea of using !GLOBAL from someone's how-to a while back
(IIRC, the manual suggests @GLOBAL), this way the global settings can be
ordered in the query to always override any per-domain settings.
 
> (there should really be a section of the FAQ dedicated to that stuff.)

Would be nice.




__ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com


Re: HUGE bayes DB (non-sitewide) advice?

2005-11-09 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


email builder writes:
> Just a follow-up to my own brain-lapse:
> 
> If you define a custom user scores query like this:
> 
> user_scores_sql_custom_querySELECT preference, value FROM
> spamassassin_settings WHERE username = _USERNAME_ OR username = '!GLOBAL' OR
> username = CONCAT('@', _DOMAIN_) ORDER BY username ASC
> 
> Then you can easily decide to use bayes on a per-domain basis for one or more
> of your domains (and still have per-user bayes for all other domains).  A
> sample insert row into the settings table, then, would be:
> 
> INSERT INTO spamassassin_settings (username, preference, value) VALUES
> ('@example.com', 'bayes_sql_override_username', 'example.com');
> 
> So everyone in the example.com domain shares all bayes information which is
> placed under the username "example.com".

is that in the FAQ?  because it certainly sounds like a cool tip for
Bayes/SQL users.

(there should really be a section of the FAQ dedicated to that stuff.)

- --j.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDcrbiMJF5cimLx9ARAgQJAJ4gZZo91g16hLhD2bohSJDGCNUFTgCeLCzy
rXqrYYYMSDRZkkwS+TS5iao=
=3m/8
-END PGP SIGNATURE-



Re: HUGE bayes DB (non-sitewide) advice?

2005-11-09 Thread email builder

> > Well, I know there have to be some admins out there who have a lot of
> users
> > and do not use sitewide bayes.. RIGHT?  See original email snippet at
> > bottom.
> 
> 
> 
> > * Other ideas:
> > - increase system memory as much as possible
> > - per-domain Bayes instead of per-user???
> 
> This might be our 2nd best choice (unless there is a good
> bayes_expiry_max_db_size solution), but I don't see anything in the manual
> about the syntax of bayes_sql_override_username.  The manual mentions
> "grouping", but gives no examples of how I could, for instance, group bayes
> data by domain (my usernames are in the form [EMAIL PROTECTED]).

Just a follow-up to my own brain-lapse:

If you define a custom user scores query like this:

user_scores_sql_custom_querySELECT preference, value FROM
spamassassin_settings WHERE username = _USERNAME_ OR username = '!GLOBAL' OR
username = CONCAT('@', _DOMAIN_) ORDER BY username ASC

Then you can easily decide to use bayes on a per-domain basis for one or more
of your domains (and still have per-user bayes for all other domains).  A
sample insert row into the settings table, then, would be:

INSERT INTO spamassassin_settings (username, preference, value) VALUES
('@example.com', 'bayes_sql_override_username', 'example.com');

So everyone in the example.com domain shares all bayes information which is
placed under the username "example.com".


 
> > - cluster Bayes DB???
> 
> This apparently is not an option, since clustered MySQL databases are kept
> entirely in memory.  We don't have any 10GB RAM machines sadly  :)
> 
> From the MySQL manual:
> 
> In-memory storage:
> 
> All data stored in each data node is kept in memory on the node's host
> computer. For each data node in the cluster, you must have available an
> amount of RAM equal to the size of the database times the number of
> replicas,
> divided by the number of data nodes. Thus, if the database takes up 1
> gigabyte of memory, and you wish to set up the cluster with 4 replicas and
> 8
> data nodes, a minimum of 500 MB memory will be required per node. Note that
> this is in addition to any requirements for the operating system and any
> other applications that might be running on the host.
> 




__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-09 Thread email builder
Thanks a lot for checking, Gary!


--- "Gary W. Smith" <[EMAIL PROTECTED]> wrote:

> You're right, my guy gave me the size of bayes + awl.  The real number
> is 14.5mb. (with an overhead of 3.2mb).
> 
> 
> Not sure, that's just what phpmyadmin is reporting.  I'll check again.
> I can't remember if the DB is in double byte or not.  One of my guys
> tweaked it for some other little databases on the same box.
> 
> 
> >> > Our production database for a large number of emails (but using
> site
> >> > wide) is about 40mb.  
> >> 
> >> What is your bayes_expiry_max_db_size set to?  Do you feel that it
> has
> >> been
> >> enough to effectively capture your various user email habits?
> > 
> > Default.
> > 
> 
> 
> How can you be running the default value, when the manual says that
> 15
> tokens is only 8MB??  How do you end up with 40MB of data?:
> 
> bayes_expiry_max_db_size (default: 15)
> What should be the maximum size of the Bayes tokens database? When
> expiry
> occurs, the Bayes system will keep either 75% of the maximum value, or
> 100,000 tokens, whichever has a larger value. 150,000 tokens is roughly
> equivalent to a 8Mb database file.
> 




__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-09 Thread Gary W. Smith
You're right, my guy gave me the size of bayes + awl.  The real number
is 14.5mb. (with an overhead of 3.2mb).

-Original Message-
From: Gary W. Smith [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 09, 2005 9:00 AM
To: email builder; users@spamassassin.apache.org
Subject: RE: HUGE bayes DB (non-sitewide) advice?

Not sure, that's just what phpmyadmin is reporting.  I'll check again.
I can't remember if the DB is in double byte or not.  One of my guys
tweaked it for some other little databases on the same box.

-Original Message-
From: email builder [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 09, 2005 1:54 AM
To: Gary W. Smith; users@spamassassin.apache.org
Subject: RE: HUGE bayes DB (non-sitewide) advice?


>> > Our production database for a large number of emails (but using
site
>> > wide) is about 40mb.  
>> 
>> What is your bayes_expiry_max_db_size set to?  Do you feel that it
has
>> been
>> enough to effectively capture your various user email habits?
> 
> Default.
> 


How can you be running the default value, when the manual says that
15
tokens is only 8MB??  How do you end up with 40MB of data?:

bayes_expiry_max_db_size (default: 15)
What should be the maximum size of the Bayes tokens database? When
expiry
occurs, the Bayes system will keep either 75% of the maximum value, or
100,000 tokens, whichever has a larger value. 150,000 tokens is roughly
equivalent to a 8Mb database file.




__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-09 Thread Gary W. Smith
Not sure, that's just what phpmyadmin is reporting.  I'll check again.
I can't remember if the DB is in double byte or not.  One of my guys
tweaked it for some other little databases on the same box.

-Original Message-
From: email builder [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 09, 2005 1:54 AM
To: Gary W. Smith; users@spamassassin.apache.org
Subject: RE: HUGE bayes DB (non-sitewide) advice?


>> > Our production database for a large number of emails (but using
site
>> > wide) is about 40mb.  
>> 
>> What is your bayes_expiry_max_db_size set to?  Do you feel that it
has
>> been
>> enough to effectively capture your various user email habits?
> 
> Default.
> 


How can you be running the default value, when the manual says that
15
tokens is only 8MB??  How do you end up with 40MB of data?:

bayes_expiry_max_db_size (default: 15)
What should be the maximum size of the Bayes tokens database? When
expiry
occurs, the Bayes system will keep either 75% of the maximum value, or
100,000 tokens, whichever has a larger value. 150,000 tokens is roughly
equivalent to a 8Mb database file.




__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


Re: HUGE bayes DB (non-sitewide) advice?

2005-11-09 Thread Michael Parker

email builder wrote:

Well, I know there have to be some admins out there who have a lot of users
and do not use sitewide bayes.. RIGHT?  See original email snippet at
bottom.


I believe that there are a few running bayes is a similar configuration. 
 It certainly is a tough problem.  I believe your tuning ideas are on 
the mark and are certainly outside the scope of what SpamAssassin is 
doing, unless of course if there was something in the code that we could 
do more efficiently.  In that case I highly encourage you to read the 
MySQL website, there is lots of great documentation there, also there 
are several books that can help in the tuning.  Don't be afraid to ask 
the MySQL community for help, I'm sure they would gladly offer tuning 
advice.  The key to just about any database tuning, once you've 
exhausted all the various config params is going to be hardware.  More 
memory and more spindles will do wonders.


The other way to look at this is (as someone else mentioned I believe) 
the possibility that in the long run it just won't be possible to run 
efficiently with such a large database.  In that case you can create a 
custom storage module, the API is documented, that can handle it.  Of 
course, it's open source you can always do this yourself, or it is 
possible that a few amongst us would be willing to contract to come up 
with a storage module that better fits your needs.


The main thing is that these are all good discussions and I myself 
highly encourage them.  It is very hard to test these types of 
deployments in development so anytime there is one it is a learning 
experience, at least for me.


Michael


Re: HUGE bayes DB (non-sitewide) advice?

2005-11-09 Thread Michael Parker

Gary W. Smith wrote:

Just my $0.02 but if it's in MySQL then you really don't need to expire
each one.  You can write a custom script that will do this.  When you
break it down, expire is really just finding those tokens that are
beyond the threshold where id=x and time=y.  The resultant would be
"where time=x".
  


Don't be fooled into thinking it is this easy, it's not.  When running 
an expire you have to not only delete the unused tokens but you have to 
also update individual's variables.  While it is possible to go around 
the API and do it yourself, unless you know EXACTLY what you are doing I 
don't recommend this route.


Michael



Re: HUGE bayes DB (non-sitewide) advice?

2005-11-09 Thread Michael Parker

email builder wrote:

How can you be running the default value, when the manual says that 15
tokens is only 8MB??  How do you end up with 40MB of data?:

bayes_expiry_max_db_size (default: 15)
What should be the maximum size of the Bayes tokens database? When expiry
occurs, the Bayes system will keep either 75% of the maximum value, or
100,000 tokens, whichever has a larger value. 150,000 tokens is roughly
equivalent to a 8Mb database file.


The documentation is very Berkeley DB specific, that no doubt refers to 
the DBM size of 150k tokens.


Michael


Re: HUGE bayes DB (non-sitewide) advice?

2005-11-09 Thread email builder
> > 
> > I guess the relevant point for this thread is that I don't necessarily
> think
> > that this is the silver bullet as implied.  Even if you use a
> > high-availability clustering technology that can mirror writes and reads,
> you
> > are STILL dealing with the possibility of a database that is just
> massive. 
> > Processing this size of database will still be disk-bound unless you have
> an
> > unheard-of amount of memory; I don't think there's any reason to think
> that
> > clustering the problem will make it go away.
> > 
> > So I still wonder if anyone has any musings on my earlier questions?
> 
> A few spamassassin hacks could help.
> 1. Have multiple mysql servers, split your users into A-J, K-S, T-Z OR 
> smaller units and distribute them over different servers, with some HA / 
> failover mechanism (possibly drbd).
> 2. Have 2 level of bayes, one large global and the other smaller per 
> user if thats possible. Of course SA will need to be changed to use both 
> the bayes'. This way you could have 2 large servers for the global bayes 
> db and 2 for the per user bayes dbs.
> 
> Also see if this SQL failover patch can help you in any way.
> http://issues.apache.org/SpamAssassin/show_bug.cgi?id=2197

Thanks for the good thoughts.  Sounds like the ultimate answer is that not
many people are using per-user Bayes, at least at this level, and that any
"solutions" are yet to be realized in practice.  I don't think we've got the
resources or time to contribute any SA patches, but the food for thought is
very much appreciated!
 
> Finally to speed up the database have a look at this, the people at 
> wikimedia / livejournal seem to be happy using it.
> http://www.danga.com/memcached/

That's very cool.  I'll *definitely* be keeping this one in mind.





__ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-09 Thread email builder

>> > Our production database for a large number of emails (but using site
>> > wide) is about 40mb.  
>> 
>> What is your bayes_expiry_max_db_size set to?  Do you feel that it has
>> been
>> enough to effectively capture your various user email habits?
> 
> Default.
> 


How can you be running the default value, when the manual says that 15
tokens is only 8MB??  How do you end up with 40MB of data?:

bayes_expiry_max_db_size (default: 15)
What should be the maximum size of the Bayes tokens database? When expiry
occurs, the Bayes system will keep either 75% of the maximum value, or
100,000 tokens, whichever has a larger value. 150,000 tokens is roughly
equivalent to a 8Mb database file.




__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


Re: HUGE bayes DB (non-sitewide) advice?

2005-11-08 Thread Dhawal Doshy

email builder wrote:

In-memory storage:
All data stored in each data node is kept in memory on the node's
host computer. For each data node in the cluster, you must have
available an amount of RAM equal to the size of the database times
the number of replicas,


This refers to the first line: "In-memory storage". Of course you can't 
do that with 160GB DBs. You can still cluster - look at DRBD 
http://www.drbd.org/



I guess the relevant point for this thread is that I don't necessarily think
that this is the silver bullet as implied.  Even if you use a
high-availability clustering technology that can mirror writes and reads, you
are STILL dealing with the possibility of a database that is just massive. 
Processing this size of database will still be disk-bound unless you have an

unheard-of amount of memory; I don't think there's any reason to think that
clustering the problem will make it go away.

So I still wonder if anyone has any musings on my earlier questions?


A few spamassassin hacks could help.
1. Have multiple mysql servers, split your users into A-J, K-S, T-Z OR 
smaller units and distribute them over different servers, with some HA / 
failover mechanism (possibly drbd).
2. Have 2 level of bayes, one large global and the other smaller per 
user if thats possible. Of course SA will need to be changed to use both 
the bayes'. This way you could have 2 large servers for the global bayes 
db and 2 for the per user bayes dbs.


Also see if this SQL failover patch can help you in any way.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=2197

Finally to speed up the database have a look at this, the people at 
wikimedia / livejournal seem to be happy using it.

http://www.danga.com/memcached/

Hope that helps,
- dhawal


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-08 Thread Gary W. Smith
Sorry, only answered part of the question.  My users are quite happy
with overall markup of the spam.  We occasionally get a HAM marked as
SPAM.  We have an odd client base though.



-Original Message-
From: email builder [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 08, 2005 8:58 PM
To: Gary W. Smith; users@spamassassin.apache.org
Subject: RE: HUGE bayes DB (non-sitewide) advice?

> Our production database for a large number of emails (but using site
> wide) is about 40mb.  

What is your bayes_expiry_max_db_size set to?  Do you feel that it has
been
enough to effectively capture your various user email habits?



__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-08 Thread Gary W. Smith
Default.

Gart

-Original Message-
From: email builder [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 08, 2005 8:58 PM
To: Gary W. Smith; users@spamassassin.apache.org
Subject: RE: HUGE bayes DB (non-sitewide) advice?

> Our production database for a large number of emails (but using site
> wide) is about 40mb.  

What is your bayes_expiry_max_db_size set to?  Do you feel that it has
been
enough to effectively capture your various user email habits?



__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-08 Thread email builder
> Our production database for a large number of emails (but using site
> wide) is about 40mb.  

What is your bayes_expiry_max_db_size set to?  Do you feel that it has been
enough to effectively capture your various user email habits?



__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


Re: HUGE bayes DB (non-sitewide) advice?

2005-11-08 Thread email builder

> > In-memory storage:
> > All data stored in each data node is kept in memory on the node's
> > host computer. For each data node in the cluster, you must have
> > available an amount of RAM equal to the size of the database times
> > the number of replicas,
> 
> This refers to the first line: "In-memory storage". Of course you can't 
> do that with 160GB DBs. You can still cluster - look at DRBD 
> http://www.drbd.org/

I guess the relevant point for this thread is that I don't necessarily think
that this is the silver bullet as implied.  Even if you use a
high-availability clustering technology that can mirror writes and reads, you
are STILL dealing with the possibility of a database that is just massive. 
Processing this size of database will still be disk-bound unless you have an
unheard-of amount of memory; I don't think there's any reason to think that
clustering the problem will make it go away.

So I still wonder if anyone has any musings on my earlier questions?




__ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-08 Thread Gary W. Smith
I'd also through www.linux-ha.org into the mix.  We use that to manage
the cluster for the SA database and use DRBD for the filesystem.  We
also use the same concept backend email stores as well.

It's more open source to complement this open source.  

-Original Message-
From: Michael Monnerie [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 08, 2005 9:48 AM
To: users@spamassassin.apache.org
Subject: Re: HUGE bayes DB (non-sitewide) advice?

On Dienstag, 8. November 2005 03:38 email builder wrote:
> In-memory storage:
> All data stored in each data node is kept in memory on the node's
> host computer. For each data node in the cluster, you must have
> available an amount of RAM equal to the size of the database times
> the number of replicas,

This refers to the first line: "In-memory storage". Of course you can't 
do that with 160GB DBs. You can still cluster - look at DRBD 
http://www.drbd.org/

mfg zmi
-- 
// Michael Monnerie, Ing.BSc  ---   it-management Michael Monnerie
// http://zmi.at   Tel: 0660/4156531  Linux 2.6.11
// PGP Key:   "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952  F7F4 3911 B933 7054 5879
// Keyserver: www.keyserver.net Key-ID: 0x70545879


Re: HUGE bayes DB (non-sitewide) advice?

2005-11-08 Thread Michael Monnerie
On Dienstag, 8. November 2005 03:38 email builder wrote:
> In-memory storage:
> All data stored in each data node is kept in memory on the node's
> host computer. For each data node in the cluster, you must have
> available an amount of RAM equal to the size of the database times
> the number of replicas,

This refers to the first line: "In-memory storage". Of course you can't 
do that with 160GB DBs. You can still cluster - look at DRBD 
http://www.drbd.org/

mfg zmi
-- 
// Michael Monnerie, Ing.BSc  ---   it-management Michael Monnerie
// http://zmi.at   Tel: 0660/4156531  Linux 2.6.11
// PGP Key:   "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952  F7F4 3911 B933 7054 5879
// Keyserver: www.keyserver.net Key-ID: 0x70545879


pgpQLe7GFJO3j.pgp
Description: PGP signature


Re: HUGE bayes DB (non-sitewide) advice?

2005-11-08 Thread Michael Monnerie
On Dienstag, 8. November 2005 03:50 email builder wrote:
> From what I understand, MySQL cluster design is such that the data
> nodes keep all the table data in memory, which would not be feasible
> in a 160GB scenario...

No. Cluster means: Take two machines of same config, and mirror them. 
It's kind of RAID-1 just for a whole server. DRBD is one tool for this.

mfg zmi
-- 
// Michael Monnerie, Ing.BSc  ---   it-management Michael Monnerie
// http://zmi.at   Tel: 0660/4156531  Linux 2.6.11
// PGP Key:   "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952  F7F4 3911 B933 7054 5879
// Keyserver: www.keyserver.net Key-ID: 0x70545879


pgpAF7eTQo3Rp.pgp
Description: PGP signature


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-07 Thread Gary W. Smith
We run a linux-ha cluster.  Works out well.



-Original Message-
From: email builder [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 07, 2005 6:51 PM
To: users@spamassassin.apache.org
Subject: RE: HUGE bayes DB (non-sitewide) advice?


>From what I understand, MySQL cluster design is such that the data nodes
keep
all the table data in memory, which would not be feasible in a 160GB
scenario...


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-07 Thread email builder
> Just my $0.02 but if it's in MySQL then you really don't need to expire
> each one.  You can write a custom script that will do this.  When you
> break it down, expire is really just finding those tokens that are
> beyond the threshold where id=x and time=y.  The resultant would be
> "where time=x".

Right.  Are there any scripts already out there that do this?

> But even then, you would only trim it down a manageable size per user.
> Our production database for a large number of emails (but using site
> wide) is about 40mb.  

What is your bayes_expiry_max_db_size?  Quite a bit larger than default I
take it.
 
> Even if you stuck with non-MySQL based databases (suck as Berkeley DB)
> you'd still have 160gb of aggregate data files.  If you truly need
> independent DB's for each user (weather file based or MySQL) I'd
> recommend building a big MySQL cluster and managing it that way.  We
> currently manage a MySQL cluster (with mirrored 300gb drives and DRBD
> replication) that houses a whopping 80mb of MySQL data.  

>From what I understand, MySQL cluster design is such that the data nodes keep
all the table data in memory, which would not be feasible in a 160GB
scenario...
 
> I don't think this helps you much, just an opinion.

I appreciate it nonetheless!



__ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com


Re: HUGE bayes DB (non-sitewide) advice?

2005-11-07 Thread email builder
> Well, I know there have to be some admins out there who have a lot of users
> and do not use sitewide bayes.. RIGHT?  See original email snippet at
> bottom.



> * Other ideas:
> - increase system memory as much as possible
> - per-domain Bayes instead of per-user???

This might be our 2nd best choice (unless there is a good
bayes_expiry_max_db_size solution), but I don't see anything in the manual
about the syntax of bayes_sql_override_username.  The manual mentions
"grouping", but gives no examples of how I could, for instance, group bayes
data by domain (my usernames are in the form [EMAIL PROTECTED]).

> - cluster Bayes DB???

This apparently is not an option, since clustered MySQL databases are kept
entirely in memory.  We don't have any 10GB RAM machines sadly  :)

>From the MySQL manual:

In-memory storage:

All data stored in each data node is kept in memory on the node's host
computer. For each data node in the cluster, you must have available an
amount of RAM equal to the size of the database times the number of replicas,
divided by the number of data nodes. Thus, if the database takes up 1
gigabyte of memory, and you wish to set up the cluster with 4 replicas and 8
data nodes, a minimum of 500 MB memory will be required per node. Note that
this is in addition to any requirements for the operating system and any
other applications that might be running on the host.






__ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-07 Thread Gary W. Smith
Just my $0.02 but if it's in MySQL then you really don't need to expire
each one.  You can write a custom script that will do this.  When you
break it down, expire is really just finding those tokens that are
beyond the threshold where id=x and time=y.  The resultant would be
"where time=x".

But even then, you would only trim it down a manageable size per user.
Our production database for a large number of emails (but using site
wide) is about 40mb.  

Even if you stuck with non-MySQL based databases (suck as Berkeley DB)
you'd still have 160gb of aggregate data files.  If you truly need
independent DB's for each user (weather file based or MySQL) I'd
recommend building a big MySQL cluster and managing it that way.  We
currently manage a MySQL cluster (with mirrored 300gb drives and DRBD
replication) that houses a whopping 80mb of MySQL data.  

I don't think this helps you much, just an opinion.

Gary Wayne Smith


-Original Message-
From: email builder [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 07, 2005 10:56 AM
To: [EMAIL PROTECTED]; users@spamassassin.apache.org
Subject: Re: HUGE bayes DB (non-sitewide) advice?

Well, I know there have to be some admins out there who have a lot of
users
and do not use sitewide bayes.. RIGHT?  See original email snippet
at
bottom.

I'll start the ball rolling with what few tweaks we've made, although
they
are not enough; we desperately need more ideas to make this viable.

* bayes_auto_expire is turned on; cronning the expiry of 20K+ accounts
every
night seems outrageous

* bayes_expiry_max_db_size is at its default value; if 20K accounts used
the
maximum allowable space, then, we'd have a 160GB bayes DB.  If 8MB is
considered sufficient for a whole domain for some people, then perhaps
we can
reduce this size for per-user bayes...??

* MySQL tuning for InnoDB: pretty much straight from the MySQL manual...

- multiple data files (approx 10G each)
- innodb_flush_log_at_trx_commit=0 because it's faster and we don't
care
about Bayes data enough that the risk of losing one second of data is
fine
- innodb_buffer_pool_size as large as we can handle, but even if
this was
3 or more GB, it's only a fraction of a 160GB database
- innodb_additional_mem_pool_size=20M because that's what we saw for
their "big" example, although I am wondering in particular about the
value of
increasing this one
- innodb_log_file_size 25% of innodb_buffer_pool_size

* Other ideas:
- increase system memory as much as possible
- per-domain Bayes instead of per-user???
- cluster Bayes DB???
- revert to MyISAM -- will this help THAT much?


>   I'm wondering if anyone out there hosts a large number of users with
> per-USER bayes (in MySQL)?  Our user base is varied enough that we do
not
> feel bayes would be effective if done site-wide.  Some people like
their
> spammy newsletters, some are geeks who would deeply resent someone
training
> newsletters to be ham.
> 
>   As a result of this, however, we are currently burdened with an
8GB(!
> yep,
> you read it right) bayes database (more than 20K users having mail
> delivered).  We went to InnoDB when we upgraded to 3.1 per the upgrade
> doc's
> recommendation, so that also means things are a bit slower.  Watching
> mytop,
> most all the activity we get is from bayes inserts, which is not
> surprising,
> and is probably the cause of why we get a lot of iowait, trying to
keep
> writing to an 8G tablespace...
> 
>   We've tuned the InnoDB some, but performance is still not all that
good
> --
> is there anyone out there who runs a system like this?  
> 
>   * What kinds of MySQL tuning are people using to help cope?
>   * Are there any SA settings to help allieviate performance problems?
>   * If we want to walk away from per-user bayes, is the only option to
go
> site-wide?  What other options are there?




__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


Re: HUGE bayes DB (non-sitewide) advice?

2005-11-07 Thread email builder
Well, I know there have to be some admins out there who have a lot of users
and do not use sitewide bayes.. RIGHT?  See original email snippet at
bottom.

I'll start the ball rolling with what few tweaks we've made, although they
are not enough; we desperately need more ideas to make this viable.

* bayes_auto_expire is turned on; cronning the expiry of 20K+ accounts every
night seems outrageous

* bayes_expiry_max_db_size is at its default value; if 20K accounts used the
maximum allowable space, then, we'd have a 160GB bayes DB.  If 8MB is
considered sufficient for a whole domain for some people, then perhaps we can
reduce this size for per-user bayes...??

* MySQL tuning for InnoDB: pretty much straight from the MySQL manual... 
- multiple data files (approx 10G each)
- innodb_flush_log_at_trx_commit=0 because it's faster and we don't care
about Bayes data enough that the risk of losing one second of data is fine
- innodb_buffer_pool_size as large as we can handle, but even if this was
3 or more GB, it's only a fraction of a 160GB database
- innodb_additional_mem_pool_size=20M because that's what we saw for
their "big" example, although I am wondering in particular about the value of
increasing this one
- innodb_log_file_size 25% of innodb_buffer_pool_size

* Other ideas:
- increase system memory as much as possible
- per-domain Bayes instead of per-user???
- cluster Bayes DB???
- revert to MyISAM -- will this help THAT much?


>   I'm wondering if anyone out there hosts a large number of users with
> per-USER bayes (in MySQL)?  Our user base is varied enough that we do not
> feel bayes would be effective if done site-wide.  Some people like their
> spammy newsletters, some are geeks who would deeply resent someone training
> newsletters to be ham.
> 
>   As a result of this, however, we are currently burdened with an 8GB(!
> yep,
> you read it right) bayes database (more than 20K users having mail
> delivered).  We went to InnoDB when we upgraded to 3.1 per the upgrade
> doc's
> recommendation, so that also means things are a bit slower.  Watching
> mytop,
> most all the activity we get is from bayes inserts, which is not
> surprising,
> and is probably the cause of why we get a lot of iowait, trying to keep
> writing to an 8G tablespace...
> 
>   We've tuned the InnoDB some, but performance is still not all that good
> --
> is there anyone out there who runs a system like this?  
> 
>   * What kinds of MySQL tuning are people using to help cope?
>   * Are there any SA settings to help allieviate performance problems?
>   * If we want to walk away from per-user bayes, is the only option to go
> site-wide?  What other options are there?




__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs


Re: HUGE bayes DB (non-sitewide) advice?

2005-11-04 Thread Michael Monnerie
On Freitag, 4. November 2005 21:04 email builder wrote:
> *SOMEONE* out there has to be doing
> something like this, no???

I would be interested in that, too.

mfg zmi
-- 
// Michael Monnerie, Ing.BSc  ---   it-management Michael Monnerie
// http://zmi.at   Tel: 0660/4156531  Linux 2.6.11
// PGP Key:   "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952  F7F4 3911 B933 7054 5879
// Keyserver: www.keyserver.net Key-ID: 0x70545879


pgpDhbbZFPv1D.pgp
Description: PGP signature


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-04 Thread email builder
> >>>   As a result of this, however, we are currently burdened with an
> >>> 8GB(! yep, you read it right) bayes database (more than 20K users
> >>> having mail delivered).
> >> 
> >> Consider using bayes_expiry_max_db_size in conjunction with
> >> bayes_auto_expire
> > 
> > "Using"?  So you are saying you use non-sitewide bayes but you limit
> > your max DB size to something much smaller than the default?  Care to
> > share your settings?
> 
> No, I use sitewide bayes.
> 
> > We left these at their defaults (not unintentionally).  If we have
> > 20K users, the default max of 150,000 tokens at roughly 8MB comes out
> > to 160GB.  We have the disk space, but just not sure if we have the
> > tuning it would take to handle a DB of that size.  What I am looking
> > for is tuning help or other ideas on how to achieve some reasonable
> > level of bayes personalization without drowning our DB resources.
> 
> For optimum performance you probably want the bayes database to fit into
> RAM, along with all of your spamassassin objects and anything else on the
> server.
> 
> You might consider buying a dedicated Bayes DB server with 4 GB of RAM, and
> cutting bayes_expiry_max_db_size in half.  That should do it.

That should do it today (actually, the database is now 9GB), but not when it
has grown to 160GB.

I appreciate the tips, but what I am looking for is MySQL tuning advice and
thoughts/ideas/other approaches to having at least somewhat personalized
Bayes stores for well over 20K users.  *SOMEONE* out there has to be doing
something like this, no???

 
> If the DB fits into RAM, the SQL engine should be able to make
> transactional changes in RAM and lazily spool them to the disk without
> forcing other transactions to wait.




__ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-03 Thread Matthew.van.Eerde
>> email builder wrote:
>>>   As a result of this, however, we are currently burdened with an
>>> 8GB(! yep, you read it right) bayes database (more than 20K users
>>> having mail delivered).
>> 
>> Consider using bayes_expiry_max_db_size in conjunction with
>> bayes_auto_expire
> 
> "Using"?  So you are saying you use non-sitewide bayes but you limit
> your max DB size to something much smaller than the default?  Care to
> share your settings?

No, I use sitewide bayes.

> We left these at their defaults (not unintentionally).  If we have
> 20K users, the default max of 150,000 tokens at roughly 8MB comes out
> to 160GB.  We have the disk space, but just not sure if we have the
> tuning it would take to handle a DB of that size.  What I am looking
> for is tuning help or other ideas on how to achieve some reasonable
> level of bayes personalization without drowning our DB resources.

For optimum performance you probably want the bayes database to fit into RAM, 
along with all of your spamassassin objects and anything else on the server.

You might consider buying a dedicated Bayes DB server with 4 GB of RAM, and 
cutting bayes_expiry_max_db_size in half.  That should do it.

If the DB fits into RAM, the SQL engine should be able to make transactional 
changes in RAM and lazily spool them to the disk without forcing other 
transactions to wait.

-- 
Matthew.van.Eerde (at) hbinc.com   805.964.4554 x902
Hispanic Business Inc./HireDiversity.com   Software Engineer


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-03 Thread email builder


--- [EMAIL PROTECTED] wrote:

> email builder wrote:
> >   As a result of this, however, we are currently burdened with an
> > 8GB(! yep, you read it right) bayes database (more than 20K users
> > having mail delivered).
> 
> Consider using bayes_expiry_max_db_size in conjunction with
> bayes_auto_expire

"Using"?  So you are saying you use non-sitewide bayes but you limit your max
DB size to something much smaller than the default?  Care to share your
settings?

We left these at their defaults (not unintentionally).  If we have 20K users,
the default max of 150,000 tokens at roughly 8MB comes out to 160GB.  We have
the disk space, but just not sure if we have the tuning it would take to
handle a DB of that size.  What I am looking for is tuning help or other
ideas on how to achieve some reasonable level of bayes personalization
without drowning our DB resources.

Thanks




__ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com


RE: HUGE bayes DB (non-sitewide) advice?

2005-11-03 Thread Matthew.van.Eerde
email builder wrote:
>   As a result of this, however, we are currently burdened with an
> 8GB(! yep, you read it right) bayes database (more than 20K users
> having mail delivered).

Consider using bayes_expiry_max_db_size in conjunction with bayes_auto_expire

-- 
Matthew.van.Eerde (at) hbinc.com   805.964.4554 x902
Hispanic Business Inc./HireDiversity.com   Software Engineer