Re: simultaneous sa-learn processes
Chavdar Videff wrote: Hi List, Our mailserver server serves about 100 users. Our config: Sendmail+Procmail+SpamAssassin. The question is: If I got it right, we should run sa-learn for each user in order to benefit from bayes. We intend to run a cron job for each user and do it at night by supplying a daily snapshot of our spam and ham collections to sa-learn. Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)? A weekly collection run for 1 user usually eats 100% of CPU load. My concern is whether the system is going to crash or just do the job slower and if you can point out how many sa-learn tasks could we run simultaneously with our setup. All hints will be appreciated, for we scheduled an initial load for 16 users of the big collection of spam received so far. Thanks guys Chavdar Videff What kind of Bayes db are you using? We use MySQL here and haven't seen SA-Learn use up that much cpu... I've run it manually up to 10 processes at once without any noticeable slowing of the machine. (p2 450mhz, 256mb) -- Thanks, James
RE: simultaneous sa-learn processes
JamesDR wrote: Chavdar Videff wrote: Hi List, Our mailserver server serves about 100 users. Our config: Sendmail+Procmail+SpamAssassin. The question is: If I got it right, we should run sa-learn for each user in order to benefit from bayes. We intend to run a cron job for each user and do it at night by supplying a daily snapshot of our spam and ham collections to sa-learn. Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)? Why would you want to setup Bayes on a per user basis if you are going to feeed it system-wide hams and spams? Especially feeding it systemwide hams is odd. A weekly collection run for 1 user usually eats 100% of CPU load. My concern is whether the system is going to crash or just do the job slower and if you can point out how many sa-learn tasks could we run simultaneously with our setup. Systems shouldn't crash under high load, so that's not a real concern. If it does happen, you have a more serious problems elswhere. What would be more of a concern is how it is going to affect other processes running on your system. Slower is not a problem, but if you really put the load on your box from a lot of processes, you might start seeing time-outs. All hints will be appreciated, for we scheduled an initial load for 16 users of the big collection of spam received so far. If your are going to simultaniously learn spam and ham for 16 users, and want to keep running your mailserver/spamassassin too (it take you also have a virusscanner running somewhere), I would consider at least running the sa-learn processes under nice to keep them from stalling more essential services. But, depending on your System setup (OS, DB, etc) you might want to cut down a little on the number of processes run simultaniously. Thanks guys Chavdar Videff What kind of Bayes db are you using? We use MySQL here and haven't seen SA-Learn use up that much cpu... I've run it manually up to 10 processes at once without any noticeable slowing of the machine. (p2 450mhz, 256mb)
Re: simultaneous sa-learn processes
On Monday 11 July 2005 14:50, JamesDR wrote: Chavdar Videff wrote: Hi List, Our mailserver server serves about 100 users. Our config: Sendmail+Procmail+SpamAssassin. The question is: If I got it right, we should run sa-learn for each user in order to benefit from bayes. We intend to run a cron job for each user and do it at night by supplying a daily snapshot of our spam and ham collections to sa-learn. Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)? A weekly collection run for 1 user usually eats 100% of CPU load. My concern is whether the system is going to crash or just do the job slower and if you can point out how many sa-learn tasks could we run simultaneously with our setup. All hints will be appreciated, for we scheduled an initial load for 16 users of the big collection of spam received so far. Thanks guys Chavdar Videff What kind of Bayes db are you using? We use MySQL here and haven't seen SA-Learn use up that much cpu... I've run it manually up to 10 processes at once without any noticeable slowing of the machine. (p2 450mhz, 256mb) I guess it is BerkeleyDB, the default installation on Debian. The ineteresting part is that while testing cron on one user the cpu fall was not noticeable. Chavdar Videff
Re: simultaneous sa-learn processes
Chavdar Videff wrote on Mon, 11 Jul 2005 13:40:14 +0300: If I got it right, we should run sa-learn for each user in order to benefit from bayes. We intend to run a cron job for each user and do it at night by supplying a daily snapshot of our spam and ham collections to sa-learn. Do I understand you correctly? You use Bayes for each user, but you want to sa-learn each of them the same daily corpus? This means the only difference in the user's Bayes db's will be auto-learned mail or mail learned by those users (if anything of that is possible/allowed with your setup). Doesn't look too useful to me. If most of the db content is the same then you could just use a site-wide db. Also, Bayes gets better the more mail it gets. If your users don't get many mail their individual Bayes db's won't be very effective. I'm all for using site-wide Bayes unless you users get really a lot of mail (I'd say at least 100 mails per user per day). Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de http://msie.winware.org
Re: simultaneous sa-learn processes
On Monday 11 July 2005 15:31, Kai Schaetzl wrote: Chavdar Videff wrote on Mon, 11 Jul 2005 13:40:14 +0300: If I got it right, we should run sa-learn for each user in order to benefit from bayes. We intend to run a cron job for each user and do it at night by supplying a daily snapshot of our spam and ham collections to sa-learn. Do I understand you correctly? You use Bayes for each user, but you want to sa-learn each of them the same daily corpus? This means the only difference in the user's Bayes db's will be auto-learned mail or mail learned by those users (if anything of that is possible/allowed with your setup). Doesn't look too useful to me. If most of the db content is the same then you could just use a site-wide db. Also, Bayes gets better the more mail it gets. If your users don't get many mail their individual Bayes db's won't be very effective. I'm all for using site-wide Bayes unless you users get really a lot of mail (I'd say at least 100 mails per user per day). Kai I thought it was installed site-wide, however the only bayes db's I find on the system are in each user's ~/.spamassassin folder. And indeed, the only way I can make bayes learn is by teaching it on a per-user basis. For quite a few months I collected spam, feeded it to sa-learn and finially reading this list relized that all I did was teach root's database. Everybody else did not benefit from bayes which was screwd because of autolearning a lot of spam to be ham. If there is a way to set up a single bayes database I would prefer that, for the scenario I am posting about does not make me happy (running 100 sa-learns at night). Thanks Chavdar
Re: simultaneous sa-learn processes
Chavdar Videff wrote on Mon, 11 Jul 2005 16:13:44 +0300: If there is a way to set up a single bayes database I would prefer that There is one, just look in the SA documentation. (documentation for local.cf should do.) Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de http://msie.winware.org
Re: simultaneous sa-learn processes
From: Chavdar Videff [EMAIL PROTECTED] On Monday 11 July 2005 14:50, JamesDR wrote: Chavdar Videff wrote: Hi List, Our mailserver server serves about 100 users. Our config: Sendmail+Procmail+SpamAssassin. The question is: If I got it right, we should run sa-learn for each user in order to benefit from bayes. We intend to run a cron job for each user and do it at night by supplying a daily snapshot of our spam and ham collections to sa-learn. Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)? A weekly collection run for 1 user usually eats 100% of CPU load. My concern is whether the system is going to crash or just do the job slower and if you can point out how many sa-learn tasks could we run simultaneously with our setup. All hints will be appreciated, for we scheduled an initial load for 16 users of the big collection of spam received so far. Thanks guys Chavdar Videff What kind of Bayes db are you using? We use MySQL here and haven't seen SA-Learn use up that much cpu... I've run it manually up to 10 processes at once without any noticeable slowing of the machine. (p2 450mhz, 256mb) I guess it is BerkeleyDB, the default installation on Debian. The ineteresting part is that while testing cron on one user the cpu fall was not noticeable. If feeding individual user Bayes feed with ham samples and spam samples submitted by the particular user for HER Bayes. If you have them all working off the same Bayes corpus then there is little or no gain to using per user Bayes. {^_^}
Re: simultaneous sa-learn processes
Hello Chavdar, Monday, July 11, 2005, 3:40:14 AM, you wrote: CV Hi List, CV Our mailserver server serves about 100 users. Our config: CV Sendmail+Procmail+SpamAssassin. CV The question is: CV If I got it right, we should run sa-learn for each user in order to benefit CV from bayes. We intend to run a cron job for each user and do it at night by CV supplying a daily snapshot of our spam and ham collections to sa-learn. CV Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)? CV A weekly collection run for 1 user usually eats 100% of CPU load. My concern CV is whether the system is going to crash or just do the job slower and if you CV can point out how many sa-learn tasks could we run simultaneously with our CV setup. CV All hints will be appreciated, for we scheduled an initial load for 16 users CV of the big collection of spam received so far. As indicated in another email, doing a user-level learn of system-wide collected ham/spam doesn't make much sense. And if you take your current system-wide collection and sa-learn it 100 times, you'll use 100 times more resources than learning it once. On the other hand, if you meant that you'd sa-learn each individual user's ham/spam for that user only, then move to the next, then provided you do these one after the other sequentially (not all 100 at once), you should not increase your system load at all. (You will increase your disk storage, since each user's database will take up some disk space.) As discussed in a couple of Bugzilla entries, you should probably limit the size of your sa-learn runs -- limit them to a few hundred emails at a time, or maybe a few meg combined size. A massive sa-learn run of thousands of emails, dozens of meg in one run, can bring a resource-limited system to its knees. Bob Menschel