Re: Bayes Questions
Andrew, Andrew Ott wrote: Also is there any way to see the count of spam and ham messages that are in the bayes database, I can't seem to find any info on that. I want to make sure there are a lot in there before I turn the bayes rules on. If you run spamassassin --lint -D you should see a line that says something like: debug: bayes corpus size: nspam = 1, nham = 5000 nspam is the number of spam messages, nham is the number of hams it has learned. HTH Dan
Bayes Questions
For those of you running large sites ( we have about 12,000 users, with 210,000 messages a day) what do you have for a bayes_expiry_max_db_size? Also is there any way to see the count of spam and ham messages that are in the bayes database, I can't seem to find any info on that. I want to make sure there are a lot in there before I turn the bayes rules on. Thank you. Andrew
Re: simultaneous sa-learn processes
Hello Chavdar, Monday, July 11, 2005, 3:40:14 AM, you wrote: CV> Hi List, CV> Our mailserver server serves about 100 users. Our config: CV> Sendmail+Procmail+SpamAssassin. CV> The question is: CV> If I got it right, we should run sa-learn for each user in order to benefit CV> from bayes. We intend to run a cron job for each user and do it at night by CV> supplying a daily snapshot of our spam and ham collections to sa-learn. CV> Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)? CV> A weekly collection run for 1 user usually eats 100% of CPU load. My concern CV> is whether the system is going to crash or just do the job slower and if you CV> can point out how many sa-learn tasks could we run simultaneously with our CV> setup. CV> All hints will be appreciated, for we scheduled an initial load for 16 users CV> of the big collection of spam received so far. As indicated in another email, doing a user-level learn of system-wide collected ham/spam doesn't make much sense. And if you take your current system-wide collection and sa-learn it 100 times, you'll use 100 times more resources than learning it once. On the other hand, if you meant that you'd sa-learn each individual user's ham/spam for that user only, then move to the next, then provided you do these one after the other sequentially (not all 100 at once), you should not increase your system load at all. (You will increase your disk storage, since each user's database will take up some disk space.) As discussed in a couple of Bugzilla entries, you should probably limit the size of your sa-learn runs -- limit them to a few hundred emails at a time, or maybe a few meg combined size. A massive sa-learn run of thousands of emails, dozens of meg in one run, can bring a resource-limited system to its knees. Bob Menschel
Re: Fedora changed SpamAssassin default level to 7?
Justin Mason wrote: fyi, if you're using Fedora Core -- http://blog.dave.org.uk/archives/000715.html totally unconfirmed, but worth noting in case that really is the case. My copy of Fedora Core 4 has "required_hits 5" in local.cf using the distribution's RPM for Spamassassin. rpm -Va made no complaints about the file. Just to be sure, I uninstalled it, checked that local.cf was gone, and reinstalled it via yum. Standard defaults. It looks to me like something other than Fedora Core was messing with his config. -- Kelson Vibber SpeedGate Communications
Re: Bypass URI check
[EMAIL PROTECTED] wrote: Hi All, I have received a few messages like the following. This asks the receiver to copy and past the link into their web browser. Since the href is missing, there is no URI check. That sucks, because the URIBL is my best friend right now (love black). We are close to marking it and URIBL would have definitely got it over. Any ideas on handling this? SpamAssassin 3.1.0 will catch these. Depending on your environment you could consider running 3.1.0-pre3. Daryl
Re: update on floating dividing score between spam and ham messages
Joe Flowers wrote: BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report back how many ham and how many spam messages that I have fed to bayes. It's far from perfect, but it may offer some interesting info regarding the 100:1 (fn:fp) ratio. I usually do this: grep -c "^From " filename It's not perfect, since it's theoretically possible for someone to have a line in their message that starts with From (to provide an example -- see if your mbox-generating program escapes that line!), but it's usually enough. -- Kelson Vibber SpeedGate Communications
Fedora changed SpamAssassin default level to 7?
fyi, if you're using Fedora Core -- http://blog.dave.org.uk/archives/000715.html totally unconfirmed, but worth noting in case that really is the case. --j.
Re: (repost) bayes_ignore_from with wildcard ?
Matt Kettler wrote: Although by looking at _check_whitelist, I wonder if it works the way the docs say. The docs claim it's file glob and not regex, but _check_whitelist looks a lot like it does a regex. _check_whitelist does use a regexp to do the matching but the config parser (add_to_addrlist() and add_to_addrlist_rcvd()) only passes file glob style expressions. Any other regexp style metacharacters are escaped. Daryl
Re: update on floating dividing score between spam and ham messages
> BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report > back how many ham and how many spam messages that I have fed to bayes. Well, I thought this might give some good stats on the FP:FN ratio, but I forgot I manually fed Bayes at the very beginning of the SA 3.02 install to get it kick-started immediately. So, counting those messages won't give anything accurate :( Initially, I thought I was feeding Bayes just the FPs and FNs, but I forgot about the initial feeding.
Help debugging spamc/spamd
Hi, We recently changed some of our network topology so that we are temporarily connecting with spamc to spamd over a regular external network connection (we usually keep it inside our LAN, but this is a temporary thing... don't ask). Unfortunately, spamd stops (mostly) responding it seems. I can watch spamc sitting and waiting on the MTA by using "ps ax | grep spam" but I don't see anything happening on the spamd server except for once every 15 minutes or so, a few messages will process (there are hundreds a minute to process). I'm not sure where it would be choking. I ran spamd in the foreground (-D), painstakingly read all the debug info for a couple messages, and nothing looked bad. When messages DID scan, they took no more than a second or two, so I don't think there are DNS issues, but I don't know where else to look. Things just seem to stop processing suddenly; I don't get it. Anyone have hints? __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: update on floating dividing score between spam and ham messages
Kai Schaetzl wrote on Mon, 11 Jul 2005 22:31:29 +0200: > With the default of 5 we get almost none, not even one per day. That was about FPs. Wrong. We don't get *any* FPs. We do not get even one *FN* per day. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de & http://msie.winware.org
Re: Performance: files or SQL?
Cami wrote: > SQL simply doesnt scale very well for bayes. We have a serverfarm of > 12 spamassassin servers and storing bayes in SQL. We see on average > about 4000 queries per second. The MySQL server has been optimized > to hell and back and is running on high-end hardware,but just simply > doesnt scale as more and more mail begins to roll in. > I'd be interested in your setup (MySQL and SA). I have no problems getting 4200+ queries per second on a single processor machine, with spamd running on the same box. It's not even sweating that hard. When you are at peak what is your average scantime per msg? Michael signature.asc Description: OpenPGP digital signature
Re: update on floating dividing score between spam and ham messages
jdow wrote: > A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot > that uses an indexed almost "mbox" file. There is no way to do it > other than "good guess". However, for a traditional UNIX mbox file > you should be able to nail it perfectly simply looking for the "From" > feature. The dirt stupid "mail" utility looks for a blank line > followed by a line that starts with "From". All other lines that > start with From are supposed to be escaped to ensure accurate > detection. DoveCot skips this blank like feature sometimes. "mail" > does not like this. I have not yet seen any indication that SA is > upset with this, however. Just to be pedantic, it's actually (IIRC) a double newline followed by "From " (note the space! It's important.) Many mail-handling apps will actually parse the From-space "header" in more detail, "just in case". grep "^From " |wc -l typically gives an accurate count; procmail at least is bright enough to escape message body lines such that they don't break this. -kgd -- Get your mouse off of there! You don't know where that email has been!
Re: update on floating dividing score between spam and ham messages
Loren Wilton wrote on Mon, 11 Jul 2005 11:30:07 -0700: > Which of course means that by picking the ratio value you can pick pretty > much any fp/fn ratio you want. Only if the distribution was equal. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de & http://msie.winware.org
Re: update on floating dividing score between spam and ham messages
Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400: > We are very glad and happy about this concept and implementation. Well, the big question is: How many of your spam messages score between the default 5 and your "floating score"? If it is many there's obviously something wrong with your setup: your spam is not scoring high enough. Additionally, it means that your Bayes auto-learn will feed less spam to learn than it could because your overall spam score is way too low. Our average spam score is indeed around -2 as yours is. And it's a very high peak, -2 mails are more than any other ham mails combined. However, our spam score peak is *way* higher than yours is: it "flattens" over 18 and 30, so the average is somewhere around 25 or so. (I deduced that from looking at the raw figures not by calculating a median or average.) I consider your average spam score of 6 as *extremely* bad from a detection standpoint. With a score of 0.5 I would get a *considerable* amount of ham scored as spam. With the default of 5 we get almost none, not even one per day. I doubt that your rate of FPs is nearly non-existant with a spam threshold of 0.5. There *must* be a considerable rate of FPs, you just don't hear about it. I think the general approach on this list is to make spam score as spammy as possible. That's what we do as well. Instead of driving spam to the sky you are trying to find some non-existing "barrier" which may indeed float because tomorrow's messages score different than yesterday's. It does not float at all in the long run. And it exists *only* in the long run. It may throw off next day's detection quite heavily, since there's no guarantee spam and ham look the same next day or even float around that point. It's not even a statistical figure, you deliberately set it to 30%, probably because you get too much spam if you set it higher. That's bad, really bad detection ... If much of your spam is lower than 5 than the spam detection rate of your SA is quite bad. You should improve that instead of trying to find a barrier which gives you the best FP:FN ratio. It may indeed give you the best ratio with your bad setup but not the lowest FP rate and probably not the best ratio compared with a setup that drives spam to the sky. I see your approach as an interesting way of optimizing the threshold when you don't get optimal scores. But you would be better off to optimize the scores. BTW: what does "normalized" exactly mean in this context? Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de & http://msie.winware.org
Re: SA 2.63 vs 2.64
On Sun, 10 Jul 2005, Matthias Fuhrmann wrote: [...] > # jm: do not... > > the lines from Bayes.pm fits to the error messages. didnt checked > PerMsgStatus.pm, but i guess its the same issue. > can someone explain the difference or the impact to the problem, described > above? > > what about replacing the line of 2.64 with the old working one from 2.63? > hope i'm not too wrong, since i try debugging for some hours now :) just in case someone starts bothering. i've upgraded to 3.0.4 and surprisingly there were only some rules to fix and bayesdb, which we had to convert. best of all, the error messages from 2.64 are gone and syslog outputs are now a lot more verbose, very nice :) regards, Matthias
Re: Performance: files or SQL?
Mike Jackson wrote: On my personal server, I'm running SA 3.0.4 with the user prefs, Bayes, and AWL in a MySQL database (mostly because it would be "cooler" that way). On my employer's server, I'm running the same SA version, but with file-based DBs and user prefs. We're going to be rolling out doing filtering for all our mailboxes (several hundred) as opposed to opt-in (as we're doing it now on about 20 accounts). I know I could do benchmarks myself, but I wanted to get your impressions if there's a performance improvement using SQL for storage (user prefs, Bayes, AWL) rather than files. Thanks. SQL simply doesnt scale very well for bayes. We have a serverfarm of 12 spamassassin servers and storing bayes in SQL. We see on average about 4000 queries per second. The MySQL server has been optimized to hell and back and is running on high-end hardware,but just simply doesnt scale as more and more mail begins to roll in. Cami
Re: procmail: Could not create INET socket on 127.0.0.1:783: Permission denied
From: <[EMAIL PROTECTED]> > Hello, > > I set up spamassassin to work with procmail according to instructions. > Here is what is in ~/.procmailrc: > > #SPAM ASSASSIN SECTION > > :0fw: spamd.lock > * < 256000 > | /usr/sbin/spamd ^ The spamd tool is run as a daemon. You want spamc here. Start spamd in Mandrake, RedHat, and I believe SUSE with the "chkconfig spamassassin on;service spamassassin start" mantra if a "service spamassassin status" does not report it is running. > :0: > * ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\* > almost-certainly-spam > > :0: > * ^X-Spam-Status: Yes > probably-spam > > :0 > * ^^rom[ ] > { > LOG="*** Dropped F off From_ header! Fixing up. " > > :0 fhw > | sed -e '1s/^/F/' > } > > #===END SPAM ASSASSIN SECTION== {^_^}
procmail: Could not create INET socket on 127.0.0.1:783: P ermission denied
Hello, I set up spamassassin to work with procmail according to instructions. Here is what is in ~/.procmailrc: #SPAM ASSASSIN SECTION :0fw: spamd.lock * < 256000 | /usr/sbin/spamd :0: * ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*\* almost-certainly-spam :0: * ^X-Spam-Status: Yes probably-spam :0 * ^^rom[ ] { LOG="*** Dropped F off From_ header! Fixing up. " :0 fhw | sed -e '1s/^/F/' } #===END SPAM ASSASSIN SECTION== However spamd is failing to run. This is what I see in the procmail log: procmail: Unlocking "/home/user/.lockmail" procmail: [19772] Mon Jul 11 11:18:53 2005 procmail: Match on "< 256000" procmail: Locking "spamd.lock" procmail: Executing "/usr/sbin/spamd" Could not create INET socket on 127.0.0.1:783: Permission denied (IO::Socket::INET: Permission denied) procmail: [19772] Mon Jul 11 11:18:54 2005 procmail: Program failure (13) of "/usr/sbin/spamd" -- Weitersagen: GMX DSL-Flatrates mit Tempo-Garantie! Ab 4,99 Euro/Monat: http://www.gmx.net/de/go/dsl
Performance: files or SQL?
On my personal server, I'm running SA 3.0.4 with the user prefs, Bayes, and AWL in a MySQL database (mostly because it would be "cooler" that way). On my employer's server, I'm running the same SA version, but with file-based DBs and user prefs. We're going to be rolling out doing filtering for all our mailboxes (several hundred) as opposed to opt-in (as we're doing it now on about 20 accounts). I know I could do benchmarks myself, but I wanted to get your impressions if there's a performance improvement using SQL for storage (user prefs, Bayes, AWL) rather than files. Thanks. Mike Jackson Tech Administrator, Datahost www.datahost.com
Re: update on floating dividing score between spam and ham messages
A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot that uses an indexed almost "mbox" file. There is no way to do it other than "good guess". However, for a traditional UNIX mbox file you should be able to nail it perfectly simply looking for the "From" feature. The dirt stupid "mail" utility looks for a blank line followed by a line that starts with "From". All other lines that start with From are supposed to be escaped to ensure accurate detection. DoveCot skips this blank like feature sometimes. "mail" does not like this. I have not yet seen any indication that SA is upset with this, however. {^_^} - Original Message - From: "Joe Flowers" <[EMAIL PROTECTED]> > Matt: > > I know you know a lot more about this than I do, but for what it's > worth, you're impressions/intuitions are very close to mine. > Originally back in April, I started off using the "average of the > means", but that let through way too much spam. > > So, what I have now is it set to 30% above the average spam score, which > is 20% below the "average of the means". > The assumption being that the optimal spot is somewhere between the two > averages. > > Also, that nastly drop off that produces a lot of FPs is in my intuition > too and as of yet, we haven't run into it. > > Now, if the two curves could be slid apart wider so that there is a big > deadzone,... Although, without upgrading to a newer version of SA, I > don't see how I can expect much better results. > > BTW, if anyone knows a command line program that can easy run thu a > bunch of mbox files and tell how many messages are in them, I will > report back how many ham and how many spam messages that I have fed to > bayes. It's far from perfect, but it may offer some interesting info > regarding the 100:1 (fn:fp) ratio. > > Joe > > > Matt Kettler wrote: > > >Joe Flowers wrote: > > > > > >>Matt Kettler wrote: > >> > >> > >> > >>>The only problem I see with this approach is that it treats false > >>>positives and > >>>false negatives as being equally bad. > >>> > >>> > >>> > >>> > >>We do get many more false negatives than false positives, even though we > >>don't get false positives very often - they are rare. > >>We certainly don't get 1 fp for every fn. > >> > >> > >> > >>>In general, you're adjusting the score bias so the number of FP's and > >>>FNs are > >>>approximately equal. > >>> > >>> > >>This is not what we are seeing in practice. It's not even close to 50-50. > >> > >> > >> > > > >Based on JM's comments about the score distribution for hams being non-linear, > >this makes sense. If the distribution was linear for both you'd get 50/50 by > >dividing the score between the two means. > > > >Since the ham is going to have a pretty sharp drop-off somewhere slightly above > >it's mean your split score approach won't be as bad as 1:1, but it's also likely > >to not be as good as 100:1 which the 5.0 threshold should get you. > > > >It's an interesting concept, and it would be very interesting to graph out FP vs > >FN rates against thresholds. > > > >This graph from JM's post is real data: > >http://spamassassin.apache.org/presentations/HEANet_2002/img12.html > > > >But it doesn't go below 5.0. It would be interesting to see how those curves > >continue as you approach 0. > > > >This graph is a good conceptual one in the "normal" sense of numbers: > >http://taint.org/xfer/2005/score-dist-doodle.gif > > > >That graph would suggest that somewhere below 5.0 there is a threshold at which > >the ham FP rate gets MUCH worse in a very sudden way. However, there's no score > >associated. I'd venture to guess that your "average of the means" is going to > >wind up picking something near, but just above that threshold. > > > >That's a bit of an intuitive guess, but also it has some roots in reality. The > >average score of a ham message on a curve like that is going to wind up being > >somewhere in the middle of that nasty drop off. By biasing just above that you > >should bring yourself into the second part of the curve, where decreases in > >score have a somewhat modest impact on FP rate. > > > > > > >
RE: SURBL, SA 3.0.4, and firewalls
> All it needs is port 53 TCP and UDP open (outbound), > depending on what > firewall product you use, depends on how. A bit of Google with what > ports on what product will yield what you should need. One thing to note... if your firewall is proxying for you, make sure it doesn't think it's authoritative for the 127.0.0.X stuff. Ours did and when it got a reply back from the SURBL servers with a result of 127.0.0.10, for example, the firewall actually returned NXDOMAIN because it saw that the results were in a domain it was authoritative for, and discarded them as invalid. johnS
Re: update on floating dividing score between spam and ham messages
jdow wrote: > The greater the separation the > better the results for a decision point between them. > But anything you can do that widens the > typical score distribution between ham and spam is a good thing. Amen
Re: update on floating dividing score between spam and ham messages
> There's another thing worth noting -- the SpamAssassin score distribution > for hams and spams isn't even. I don't necessarily see that those particular curve shapes necessarily in any way invalidate this method, although they do bias the method somewhat. The two curves are essentially smooth curves with no major dips or bumps in them, so it is possible to select a ratio without getting inversions in the ratio as the selector moves from left to right. You may have to be careful of calculating the ratio, given that ham goes to effectively zero above a certain value. But n:0 and 3.45n:0 are still perfectly valid ratios to deal with, even if one of the terms is zero. Loren
Re: update on floating dividing score between spam and ham messages
Matt: I know you know a lot more about this than I do, but for what it's worth, you're impressions/intuitions are very close to mine. Originally back in April, I started off using the "average of the means", but that let through way too much spam. So, what I have now is it set to 30% above the average spam score, which is 20% below the "average of the means". The assumption being that the optimal spot is somewhere between the two averages. Also, that nastly drop off that produces a lot of FPs is in my intuition too and as of yet, we haven't run into it. Now, if the two curves could be slid apart wider so that there is a big deadzone,... Although, without upgrading to a newer version of SA, I don't see how I can expect much better results. BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report back how many ham and how many spam messages that I have fed to bayes. It's far from perfect, but it may offer some interesting info regarding the 100:1 (fn:fp) ratio. Joe Matt Kettler wrote: Joe Flowers wrote: Matt Kettler wrote: The only problem I see with this approach is that it treats false positives and false negatives as being equally bad. We do get many more false negatives than false positives, even though we don't get false positives very often - they are rare. We certainly don't get 1 fp for every fn. In general, you're adjusting the score bias so the number of FP's and FNs are approximately equal. This is not what we are seeing in practice. It's not even close to 50-50. Based on JM's comments about the score distribution for hams being non-linear, this makes sense. If the distribution was linear for both you'd get 50/50 by dividing the score between the two means. Since the ham is going to have a pretty sharp drop-off somewhere slightly above it's mean your split score approach won't be as bad as 1:1, but it's also likely to not be as good as 100:1 which the 5.0 threshold should get you. It's an interesting concept, and it would be very interesting to graph out FP vs FN rates against thresholds. This graph from JM's post is real data: http://spamassassin.apache.org/presentations/HEANet_2002/img12.html But it doesn't go below 5.0. It would be interesting to see how those curves continue as you approach 0. This graph is a good conceptual one in the "normal" sense of numbers: http://taint.org/xfer/2005/score-dist-doodle.gif That graph would suggest that somewhere below 5.0 there is a threshold at which the ham FP rate gets MUCH worse in a very sudden way. However, there's no score associated. I'd venture to guess that your "average of the means" is going to wind up picking something near, but just above that threshold. That's a bit of an intuitive guess, but also it has some roots in reality. The average score of a ham message on a curve like that is going to wind up being somewhere in the middle of that nasty drop off. By biasing just above that you should bring yourself into the second part of the curve, where decreases in score have a somewhat modest impact on FP rate.
Re: update on floating dividing score between spam and ham messages
> > score of -2.1532284. I have the divding line "set" at 30% of the > > distance between the average ham score and average spam score (30% above > > the average ham score). So, the dividing line is currently floating > > around 0.55416414. > > > The only problem I see with this approach is that it treats false positives and > false negatives as being equally bad. Matt, isn't he actually treating an FP as ~2x as bad as an FN? He has the divider set to 30%, so is biassed in one direction or the other. Which of course means that by picking the ratio value you can pick pretty much any fp/fn ratio you want. Loren
Re: update on floating dividing score between spam and ham messages
From: "Matt Kettler" <[EMAIL PROTECTED]> > Joe Flowers wrote: > > I don't know if this will help anyone or not, but I wanted to report > > back just in case. > > > > In early April, I completely unhinged the dividing line between what SA > > score is used to mark a message as spam or ham (5.00 = default). This > > allows the system and this dividing line to drift "freely" to anywhere > > that SA will allow, without bound. This anti-spam setup has worked > > consistently much much better the whole time than in any previous > > implementation that we have done and with very little maintenance. We > > are very happy with it and are looking forward to implementing future SA > > versions in the same fashion. > > > > I'm not exactly sure the following numbers represent the whole time > > since April, but they should be pretty close. > > > > We've had 360,922 spam messages and 396,983 ham messages with a > > normalized average spam score of 6.8714134 and a normalized average ham > > score of -2.1532284. I have the divding line "set" at 30% of the > > distance between the average ham score and average spam score (30% above > > the average ham score). So, the dividing line is currently floating > > around 0.55416414. > > > The only problem I see with this approach is that it treats false positives and > false negatives as being equally bad. > > In general, you're adjusting the score bias so the number of FP's and FNs are > approximately equal. Although STATISTICS*.txt would suggest that this boundary > occurs somewhere near 2.0, your own local biases could change this considerably. > > > SA's normal scoreset is evolved with the concept that it's better to have 99 > false negatives than 1 false positive. The concept here is most people use > scripts to move their spam into a separate folder, or auto delete it. With that > going on, a FP is potentially lost valid email, whereas a FN is a minor > inconvenience. Operating experience here seems to indicate that the SA score evolution is not optimum. What you want to do is create a brassiere curve for the markups for ham and spam. The greater the separation the better the results for a decision point between them. The bias to prevent false negatives probably means you do not want the decision point right in the center. But anything you can do that widens the typical score distribution between ham and spam is a good thing. It makes the decision point less sensitive to set and the overall error rates lower. I think this is part of the reason I have so much success on a box vastly overloaded with SARE and other rules. The good rules pile one on the other until it's VERY clear what is ham and what is spam. (It surely would be nice if there were some really good indications of "not spam". However, nothing has ever appeared other than absence of hits on spam-sign.) {^_^}
RE: sa-learn on a wide site HOWTO ?
> Forget about this. Most of you users will only report spams, > not ham, they're going to screw the bayes database. As a > consequence, you'll have more spam, or more fp. > > You should find another solution or educate your users (but > it takes too much time) so they feed correctly the bayesian filters. > I've heard this many times, but my experience thus far hasn't borne it out. We've got SA w/Bayes running site-wide on our 400-user system and Bayes_99 is consistently our highest-scoring test systemwide, even outscoring the various SBL and URIBL tests. That said, the Ham corpus is entirely my own, I don't bother to have my users submit anything but Spam. This works surprisingly well, so I guess I have good Ham. :) My method is simple and fairly manual. I have my users put Spam in an Exchange Public Folder (substitute shared IMAP folder if you're using a more standard e-mail server) and copy them down into a local MBOX. Thunderbird is handy for this. I upload the MBOX file to the SA server, run sa-learn, and it's done. Initially I had to do this fairly often, but once I had it well trained and enough SARE rules in place it became less of an issue. I now run it only every other month or so. Bayes covers a number of corner-cases that aren't covered by rules, so it's an important part of my overall strategy. It's also handy to train in new spam that hasn't hit the URIBLs or other rules yet, much easier than writing custom rules. Bayes hasn't given any false positives that I'm aware of in the last year, despite the theoretical skew that ought to be introduced by using everyone's Spam and only my Ham. I cannot tell you why, but it works and it works well. Aaron Grewell Network Administrator University of Washington Bothell
Re: simultaneous sa-learn processes
From: "Chavdar Videff" <[EMAIL PROTECTED]> > On Monday 11 July 2005 14:50, JamesDR wrote: > > Chavdar Videff wrote: > > > Hi List, > > > > > > Our mailserver server serves about 100 users. Our config: > > > Sendmail+Procmail+SpamAssassin. > > > The question is: > > > If I got it right, we should run sa-learn for each user in order to > > > benefit from bayes. We intend to run a cron job for each user and do it > > > at night by supplying a daily snapshot of our spam and ham collections to > > > sa-learn. Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)? > > > A weekly collection run for 1 user usually eats 100% of CPU load. My > > > concern is whether the system is going to crash or just do the job slower > > > and if you can point out how many sa-learn tasks could we run > > > simultaneously with our setup. > > > All hints will be appreciated, for we scheduled an initial load for 16 > > > users of the big collection of spam received so far. > > > > > > Thanks guys > > > > > > Chavdar Videff > > > > What kind of Bayes db are you using? We use MySQL here and haven't seen > > SA-Learn use up that much cpu... I've run it manually up to 10 processes > > at once without any noticeable slowing of the machine. (p2 450mhz, 256mb) > > I guess it is BerkeleyDB, the default installation on Debian. The ineteresting > part is that while testing cron on one user the cpu fall was not noticeable. If feeding individual user Bayes feed with ham samples and spam samples submitted by the particular user for HER Bayes. If you have them all working off the same Bayes corpus then there is little or no gain to using per user Bayes. {^_^}
Re: update on floating dividing score between spam and ham messages
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 the real-world figures can be seen for various thresholds in the rules/STATISTICS*.txt files... - --j. Matt Kettler writes: > Joe Flowers wrote: > > Matt Kettler wrote: > > > >> The only problem I see with this approach is that it treats false > >> positives and > >> false negatives as being equally bad. > >> > >> > > > > We do get many more false negatives than false positives, even though we > > don't get false positives very often - they are rare. > > We certainly don't get 1 fp for every fn. > > > >> In general, you're adjusting the score bias so the number of FP's and > >> FNs are > >> approximately equal. > > > > > > This is not what we are seeing in practice. It's not even close to 50-50. > > > > Based on JM's comments about the score distribution for hams being non-linear, > this makes sense. If the distribution was linear for both you'd get 50/50 by > dividing the score between the two means. > > Since the ham is going to have a pretty sharp drop-off somewhere slightly > above > it's mean your split score approach won't be as bad as 1:1, but it's also > likely > to not be as good as 100:1 which the 5.0 threshold should get you. > > It's an interesting concept, and it would be very interesting to graph out FP > vs > FN rates against thresholds. > > This graph from JM's post is real data: > http://spamassassin.apache.org/presentations/HEANet_2002/img12.html > > But it doesn't go below 5.0. It would be interesting to see how those curves > continue as you approach 0. > > This graph is a good conceptual one in the "normal" sense of numbers: > http://taint.org/xfer/2005/score-dist-doodle.gif > > That graph would suggest that somewhere below 5.0 there is a threshold at > which > the ham FP rate gets MUCH worse in a very sudden way. However, there's no > score > associated. I'd venture to guess that your "average of the means" is going to > wind up picking something near, but just above that threshold. > > That's a bit of an intuitive guess, but also it has some roots in reality. The > average score of a ham message on a curve like that is going to wind up being > somewhere in the middle of that nasty drop off. By biasing just above that you > should bring yourself into the second part of the curve, where decreases in > score have a somewhat modest impact on FP rate. -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Exmh CVS iD8DBQFC0q8dMJF5cimLx9ARAuLrAKCQnoc8eo2rAvIDYIWX0DfW/T0NZgCePoyH WZS8C6aamuWZ3H6C6n8k2n0= =Hruw -END PGP SIGNATURE-
Re: How can I correctly detect these spams?
I repeat myself ;-) > It seems you are not using *any* custom rules. You may want to check out > RDJ and SARE. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de & http://msie.winware.org
Re: simultaneous sa-learn processes
Chavdar Videff wrote on Mon, 11 Jul 2005 16:13:44 +0300: > If there is a way to set up a single bayes database I would prefer that There is one, just look in the SA documentation. (documentation for local.cf should do.) Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de & http://msie.winware.org
Re: update on floating dividing score between spam and ham messages
Joe Flowers wrote: > Matt Kettler wrote: > >> The only problem I see with this approach is that it treats false >> positives and >> false negatives as being equally bad. >> >> > > We do get many more false negatives than false positives, even though we > don't get false positives very often - they are rare. > We certainly don't get 1 fp for every fn. > >> In general, you're adjusting the score bias so the number of FP's and >> FNs are >> approximately equal. > > > This is not what we are seeing in practice. It's not even close to 50-50. > Based on JM's comments about the score distribution for hams being non-linear, this makes sense. If the distribution was linear for both you'd get 50/50 by dividing the score between the two means. Since the ham is going to have a pretty sharp drop-off somewhere slightly above it's mean your split score approach won't be as bad as 1:1, but it's also likely to not be as good as 100:1 which the 5.0 threshold should get you. It's an interesting concept, and it would be very interesting to graph out FP vs FN rates against thresholds. This graph from JM's post is real data: http://spamassassin.apache.org/presentations/HEANet_2002/img12.html But it doesn't go below 5.0. It would be interesting to see how those curves continue as you approach 0. This graph is a good conceptual one in the "normal" sense of numbers: http://taint.org/xfer/2005/score-dist-doodle.gif That graph would suggest that somewhere below 5.0 there is a threshold at which the ham FP rate gets MUCH worse in a very sudden way. However, there's no score associated. I'd venture to guess that your "average of the means" is going to wind up picking something near, but just above that threshold. That's a bit of an intuitive guess, but also it has some roots in reality. The average score of a ham message on a curve like that is going to wind up being somewhere in the middle of that nasty drop off. By biasing just above that you should bring yourself into the second part of the curve, where decreases in score have a somewhat modest impact on FP rate.
Re: update on floating dividing score between spam and ham messages
Thanks Jason! That's good, new info for me. That'll help me *at the very least* visualize what I am trying to do a little better. I've been very curious to know what the rough shapes of those graphs look like. Joe Justin Mason wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 There's another thing worth noting -- the SpamAssassin score distribution for hams and spams isn't even. If you draw a graph of hams and spams, plotting the number of mails in each category as the vertical axis and the score they get as teh horizontal axis, you don't get a simple pair of intersecting straight lines. Instead, since we have many more mark-as-spam rules than mark-as-ham, and due to how the perceptron attempts to optimise for the 5.0 threshold, what happens is that you have two different lines. The ham line is a sigmoid curve, that starts high in the negative area, and curves down to almost 0 at the 5.0 threshold mark. The spam line, by contrast, is a straight line. http://taint.org/xfer/2005/score-dist-doodle.gif is a doodle to illustrate this, or take a look at http://spamassassin.apache.org/presentations/HEANet_2002/img12.html for real-world graphs of this data from 2002 -- although graphing the inverse. Very interesting approach though! - --j.
Re: update on floating dividing score between spam and ham messages
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 There's another thing worth noting -- the SpamAssassin score distribution for hams and spams isn't even. If you draw a graph of hams and spams, plotting the number of mails in each category as the vertical axis and the score they get as teh horizontal axis, you don't get a simple pair of intersecting straight lines. Instead, since we have many more mark-as-spam rules than mark-as-ham, and due to how the perceptron attempts to optimise for the 5.0 threshold, what happens is that you have two different lines. The ham line is a sigmoid curve, that starts high in the negative area, and curves down to almost 0 at the 5.0 threshold mark. The spam line, by contrast, is a straight line. http://taint.org/xfer/2005/score-dist-doodle.gif is a doodle to illustrate this, or take a look at http://spamassassin.apache.org/presentations/HEANet_2002/img12.html for real-world graphs of this data from 2002 -- although graphing the inverse. Very interesting approach though! - --j. Joe Flowers writes: > Matt Kettler wrote: > > >The only problem I see with this approach is that it treats false positives > >and > >false negatives as being equally bad. > > > > > > We do get many more false negatives than false positives, even though we > don't get false positives very often - they are rare. > We certainly don't get 1 fp for every fn. > > >In general, you're adjusting the score bias so the number of FP's and FNs are > >approximately equal. > > > > This is not what we are seeing in practice. It's not even close to 50-50. > > >Although STATISTICS*.txt would suggest that this boundary > >occurs somewhere near 2.0, your own local biases could change this > >considerably. > > > > > >SA's normal scoreset is evolved with the concept that it's better to have 99 > >false negatives than 1 false positive. > > > > We are very glad and happy about this concept and implementation. > > >The concept here is most people use > >scripts to move their spam into a separate folder, or auto delete it. With > >that > >going on, a FP is potentially lost valid email, whereas a FN is a minor > >inconvenience. > > > > > > Yes We work hard to inform our users and to actively solicit their > feedback on how the system is working and to lookout for the system > misplacing emails, especially valid ones. I know it's still not perfect > > >For any site that considers FPs to be "not too bad" because all mail is > >manually > >examined anyway, lowering the score threshold may be a workable thing. > > > >However, other sites that auto-delete such messages may have considerable > >problems if they lower the threshold > > > > > > YES! -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Exmh CVS iD8DBQFC0qYfMJF5cimLx9ARAp+YAJ0X7eoijcnMOE+3WkOlfQQEzasjwgCfZp9B TdyM6BfLga48fgif1AzBW7U= =qdan -END PGP SIGNATURE-
Re: sa-learn on a wide site HOWTO ?
On 16:56, Mon 11 Jul 05, Karl.Oulmi wrote: > Hi, > > I always have a box with postfix/amavis and Spamassin running. > Now, I'd like to run sa-learn in order my users (~500) learn Spam & Ham > to Spamassassin. > > The idea is the following. > On every mail passed through my mailserver, a header or a footer is > added to the mail with à mailto link that permit my users to learn > Spamassassin if the mail is spam or not. Forget about this. Most of you users will only report spams, not ham, they're going to screw the bayes database. As a consequence, you'll have more spam, or more fp. You should find another solution or educate your users (but it takes too much time) so they feed correctly the bayesian filters.
Re: update on floating dividing score between spam and ham messages
Matt Kettler wrote: The only problem I see with this approach is that it treats false positives and false negatives as being equally bad. We do get many more false negatives than false positives, even though we don't get false positives very often - they are rare. We certainly don't get 1 fp for every fn. In general, you're adjusting the score bias so the number of FP's and FNs are approximately equal. This is not what we are seeing in practice. It's not even close to 50-50. Although STATISTICS*.txt would suggest that this boundary occurs somewhere near 2.0, your own local biases could change this considerably. SA's normal scoreset is evolved with the concept that it's better to have 99 false negatives than 1 false positive. We are very glad and happy about this concept and implementation. The concept here is most people use scripts to move their spam into a separate folder, or auto delete it. With that going on, a FP is potentially lost valid email, whereas a FN is a minor inconvenience. Yes We work hard to inform our users and to actively solicit their feedback on how the system is working and to lookout for the system misplacing emails, especially valid ones. I know it's still not perfect For any site that considers FPs to be "not too bad" because all mail is manually examined anyway, lowering the score threshold may be a workable thing. However, other sites that auto-delete such messages may have considerable problems if they lower the threshold YES!
RE: spamassassin with GORDANO
> Does anyone know If I can use Spammain with GMS (Gordano > Mail Software for Linux) In theory, you could use MailScanner as a proxy in front of GMS to run SpamAssassin before the message gets to GMS. And, if I recall correctly (I haven't used GMS for several years), I think you can use their MML scripting language to run an application, so you should be able to run SpamAssassin from there and replace the original message with the tagged version if your MML scripting skills are adequate for doing that. Bret
Re: update on floating dividing score between spam and ham messages
Joe Flowers wrote: > I don't know if this will help anyone or not, but I wanted to report > back just in case. > > In early April, I completely unhinged the dividing line between what SA > score is used to mark a message as spam or ham (5.00 = default). This > allows the system and this dividing line to drift "freely" to anywhere > that SA will allow, without bound. This anti-spam setup has worked > consistently much much better the whole time than in any previous > implementation that we have done and with very little maintenance. We > are very happy with it and are looking forward to implementing future SA > versions in the same fashion. > > I'm not exactly sure the following numbers represent the whole time > since April, but they should be pretty close. > > We've had 360,922 spam messages and 396,983 ham messages with a > normalized average spam score of 6.8714134 and a normalized average ham > score of -2.1532284. I have the divding line "set" at 30% of the > distance between the average ham score and average spam score (30% above > the average ham score). So, the dividing line is currently floating > around 0.55416414. The only problem I see with this approach is that it treats false positives and false negatives as being equally bad. In general, you're adjusting the score bias so the number of FP's and FNs are approximately equal. Although STATISTICS*.txt would suggest that this boundary occurs somewhere near 2.0, your own local biases could change this considerably. SA's normal scoreset is evolved with the concept that it's better to have 99 false negatives than 1 false positive. The concept here is most people use scripts to move their spam into a separate folder, or auto delete it. With that going on, a FP is potentially lost valid email, whereas a FN is a minor inconvenience. For any site that considers FPs to be "not too bad" because all mail is manually examined anyway, lowering the score threshold may be a workable thing. However, other sites that auto-delete such messages may have considerable problems if they lower the threshold.
RE: Bypass URI check
Title: Bypass URI check I'm thinking it may be time for SARE to look at this phrase: "then copy // paste the below page into your window: " I'll see what I can do with it. --Chris (I also love the black ;) -Original Message-From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]Sent: Monday, July 11, 2005 10:42 AMTo: users@spamassassin.apache.orgSubject: Bypass URI check Hi All, I have received a few messages like the following. This asks the receiver to copy and past the link into their web browser. Since the href is missing, there is no URI check. That sucks, because the URIBL is my best friend right now (love black). We are close to marking it and URIBL would have definitely got it over. Any ideas on handling this? Microsoft Mail Internet Headers Version 2.0 Received: from .atco.ca ([xxx.xxx.10.122]) by .atco.com with Microsoft SMTPSVC(5.0.2195.6713); Mon, 11 Jul 2005 08:01:29 -0600 Received: from .atco.ca ([xxx.xxx.10.122]) by .atco.ca (SMSSMTP 4.0.0.59) with SMTP id M2005071108012819018 ; Mon, 11 Jul 2005 08:01:28 -0600 Received: from [58.224.196.19] (helo=xxx.xxx.10.122) by .atco.ca with smtp (Exim ) id 1Dryqd-0001kd-Sa; Mon, 11 Jul 2005 08:01:28 -0600 X-Orcpt: rfc727;zmailer-log Message-ID: <[EMAIL PROTECTED]> Date: Mon, 11 Jul 2005 10:58:46 -0400 From: "Joan Kerry " To: "Joan Kerry " <[EMAIL PROTECTED]> Cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Appetie Suppresant Mon, 11 Jul 2005 07:04:46 -0800 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="--4.AMLHjl9J5.pLVdgIYrJD9" X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on .atco.ca X-Spam-Level: X-Spam-Status: No, score=4.3 required=5.0 tests=J_CHICKENPOX_55,MANGLED_STOP, RCVD_HELO_IP_MISMATCH,RCVD_NUMERIC_HELO autolearn=disabled version=3.0.3 Return-Path: [EMAIL PROTECTED] X-OriginalArrivalTime: 11 Jul 2005 14:01:29.0016 (UTC) FILETIME=[08D53380:01C58621] 4.AMLHjl9J5.pLVdgIYrJD9 Content-Type: text/plain; format=flowed; charset=iso-8859-15 Content-Transfer-Encoding: 7Bit 4.AMLHjl9J5.pLVdgIYrJD9 Content-Type: text/plain; format=flowed; charset=iso-8859-15 Content-Transfer-Encoding: 7Bit 4.AMLHjl9J5.pLVdgIYrJD9-- [EMAIL PROTECTED] Do you find it difficult to cut down on delicious foods filled with carbs? such as pasta,cakes,breads,potato chips and ice cream? if you are one of the people...then copy // paste the below page into your window: slimfat.info Kindly, Joan Kerry - 2refrain: s-t-o-p.info
sa-learn on a wide site HOWTO ?
Hi, I always have a box with postfix/amavis and Spamassin running. Now, I'd like to run sa-learn in order my users (~500) learn Spam & Ham to Spamassassin. The idea is the following. On every mail passed through my mailserver, a header or a footer is added to the mail with à mailto link that permit my users to learn Spamassassin if the mail is spam or not. Does anybody has ever implemented this solution ? Do anyone has an howto or a good url about this subjet ? Many thanks KARL :) -- smime.p7s Description: S/MIME Cryptographic Signature
Re: Rule: envelope to <> header to - help?
Michael W Cocke wrote: > Does anyone have a rule to chech the envelope To: against the header > to: ? I'm sure that there's a reason why it's allowed to be different, > but it doesn't apply here, and almost half of the spam that gets thru > everything else would get stopped by that. No. It's generally not possible, as SA does not have access to the envelope. Also, bear in mind that there are LOTS of reasons why it would be allowed to be different, and your location is not likely to be an exception, despite what you think. For example, all posts sent to any mailing list, including this mailing list, will mismatch. Unless your site does ALL of the following, you'll have mismatches: -No users subscribe to ANY mailing lists, including listservs as well as commercial newsletters, etc. -No users receive mail from anyone that uses bcc. -No users may have mail redirected from another account (ie: auto-forward from yahoo) (need I go on?) In general the only systems that won't get mismatches between the envelope and the to: are systems that don't receive any internet mail except single-user to single-user messages. And that's got to be strict, NO other internet email but single-user to single-user.
Bypass URI check
Title: Bypass URI check Hi All, I have received a few messages like the following. This asks the receiver to copy and past the link into their web browser. Since the href is missing, there is no URI check. That sucks, because the URIBL is my best friend right now (love black). We are close to marking it and URIBL would have definitely got it over. Any ideas on handling this? Microsoft Mail Internet Headers Version 2.0 Received: from .atco.ca ([xxx.xxx.10.122]) by .atco.com with Microsoft SMTPSVC(5.0.2195.6713); Mon, 11 Jul 2005 08:01:29 -0600 Received: from .atco.ca ([xxx.xxx.10.122]) by .atco.ca (SMSSMTP 4.0.0.59) with SMTP id M2005071108012819018 ; Mon, 11 Jul 2005 08:01:28 -0600 Received: from [58.224.196.19] (helo=xxx.xxx.10.122) by .atco.ca with smtp (Exim ) id 1Dryqd-0001kd-Sa; Mon, 11 Jul 2005 08:01:28 -0600 X-Orcpt: rfc727;zmailer-log Message-ID: <[EMAIL PROTECTED]> Date: Mon, 11 Jul 2005 10:58:46 -0400 From: "Joan Kerry " To: "Joan Kerry " <[EMAIL PROTECTED]> Cc: [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Appetie Suppresant Mon, 11 Jul 2005 07:04:46 -0800 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="--4.AMLHjl9J5.pLVdgIYrJD9" X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on .atco.ca X-Spam-Level: X-Spam-Status: No, score=4.3 required=5.0 tests=J_CHICKENPOX_55,MANGLED_STOP, RCVD_HELO_IP_MISMATCH,RCVD_NUMERIC_HELO autolearn=disabled version=3.0.3 Return-Path: [EMAIL PROTECTED] X-OriginalArrivalTime: 11 Jul 2005 14:01:29.0016 (UTC) FILETIME=[08D53380:01C58621] 4.AMLHjl9J5.pLVdgIYrJD9 Content-Type: text/plain; format=flowed; charset=iso-8859-15 Content-Transfer-Encoding: 7Bit 4.AMLHjl9J5.pLVdgIYrJD9 Content-Type: text/plain; format=flowed; charset=iso-8859-15 Content-Transfer-Encoding: 7Bit 4.AMLHjl9J5.pLVdgIYrJD9-- [EMAIL PROTECTED] Do you find it difficult to cut down on delicious foods filled with carbs? such as pasta,cakes,breads,potato chips and ice cream? if you are one of the people...then copy // paste the below page into your window: slimfat.info Kindly, Joan Kerry - 2refrain: s-t-o-p.info
Re: SURBL & SA 3.0.4
Dr Robert Young wrote: > Is there a particular "port" and/or "protocol (TCP/UDP) that must be > opened on any firewalls that might be on the network for the plugin to > work? You don't "need" to open any ports, however you must be able to resolve DNS queries. In general you can test it by using "host www.spamassassin.org".. if you get an answer back, DNS works. If not, DNS doesn't. In general your nameserver must be able to perform queries to port 53 as a UDP client. If your firewall is stateful, you only need to open it in the outbound direction (if you've locked down outbound traffic at all). If it's a stateless packet filter, then you'll need to open both. You can set what nameservers your SA box will use in in /etc/resolv.conf.
Re: (repost) bayes_ignore_from with wildcard ?
At 04:43 AM 7/11/2005, [EMAIL PROTECTED] wrote: Hello, Does anyone know if this will work: bayes_ignore_from [EMAIL PROTECTED] The docs don't say specifically if this kind of directive is allowed. They do say that this kind of thing will work for whitelist_from. We all got your message the first time. No, I don't know. But from a casual glance at the code, it should work. conf.pm builds a list named bayes_ignore_from. bayes.pm calls: $ignore = $PMS->check_from_in_list('bayes_ignore_from') || $PMS->check_to_in_list('bayes_ignore_to'); check_from_in_list is actually implemented in EvalTests.pm: sub check_from_in_list { my ($self,$list) = @_; my $list_ref = $self->{conf}{$list}; warn "Could not find list $list" unless defined $list_ref; foreach my $addr (all_from_addrs $self) { return 1 if _check_whitelist $self $list_ref, $addr; } return 0; } _check_whitelist is the same comparison function the black and whitelists use, so it should work the same. Although by looking at _check_whitelist, I wonder if it works the way the docs say. The docs claim it's file glob and not regex, but _check_whitelist looks a lot like it does a regex.
Re: simultaneous sa-learn processes
On Monday 11 July 2005 15:31, Kai Schaetzl wrote: > Chavdar Videff wrote on Mon, 11 Jul 2005 13:40:14 +0300: > > If I got it right, we should run sa-learn for each user in order to > > benefit from bayes. We intend to run a cron job for each user and do it > > at night by supplying a daily snapshot of our spam and ham collections to > > sa-learn. > > Do I understand you correctly? You use Bayes for each user, but you want to > sa-learn each of them the same daily corpus? This means the only difference > in the user's Bayes db's will be auto-learned mail or mail learned by those > users (if anything of that is possible/allowed with your setup). Doesn't > look too useful to me. If most of the db content is the same then you could > just use a site-wide db. Also, Bayes gets better the more mail it gets. If > your users don't get many mail their individual Bayes db's won't be very > effective. I'm all for using site-wide Bayes unless you users get really a > lot of mail (I'd say at least 100 mails per user per day). > > Kai I thought it was installed site-wide, however the only bayes db's I find on the system are in each user's ~/.spamassassin folder. And indeed, the only way I can make bayes learn is by teaching it on a per-user basis. For quite a few months I collected spam, feeded it to sa-learn and finially reading this list relized that all I did was teach root's database. Everybody else did not benefit from bayes which was screwd because of autolearning a lot of spam to be ham. If there is a way to set up a single bayes database I would prefer that, for the scenario I am posting about does not make me happy (running 100 sa-learns at night). Thanks Chavdar
Re: simultaneous sa-learn processes
Chavdar Videff wrote on Mon, 11 Jul 2005 13:40:14 +0300: > If I got it right, we should run sa-learn for each user in order to benefit > from bayes. We intend to run a cron job for each user and do it at night by > supplying a daily snapshot of our spam and ham collections to sa-learn. Do I understand you correctly? You use Bayes for each user, but you want to sa-learn each of them the same daily corpus? This means the only difference in the user's Bayes db's will be auto-learned mail or mail learned by those users (if anything of that is possible/allowed with your setup). Doesn't look too useful to me. If most of the db content is the same then you could just use a site-wide db. Also, Bayes gets better the more mail it gets. If your users don't get many mail their individual Bayes db's won't be very effective. I'm all for using site-wide Bayes unless you users get really a lot of mail (I'd say at least 100 mails per user per day). Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de & http://msie.winware.org
Re: simultaneous sa-learn processes
On Monday 11 July 2005 14:50, JamesDR wrote: > Chavdar Videff wrote: > > Hi List, > > > > Our mailserver server serves about 100 users. Our config: > > Sendmail+Procmail+SpamAssassin. > > The question is: > > If I got it right, we should run sa-learn for each user in order to > > benefit from bayes. We intend to run a cron job for each user and do it > > at night by supplying a daily snapshot of our spam and ham collections to > > sa-learn. Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)? > > A weekly collection run for 1 user usually eats 100% of CPU load. My > > concern is whether the system is going to crash or just do the job slower > > and if you can point out how many sa-learn tasks could we run > > simultaneously with our setup. > > All hints will be appreciated, for we scheduled an initial load for 16 > > users of the big collection of spam received so far. > > > > Thanks guys > > > > Chavdar Videff > > What kind of Bayes db are you using? We use MySQL here and haven't seen > SA-Learn use up that much cpu... I've run it manually up to 10 processes > at once without any noticeable slowing of the machine. (p2 450mhz, 256mb) I guess it is BerkeleyDB, the default installation on Debian. The ineteresting part is that while testing cron on one user the cpu fall was not noticeable. Chavdar Videff
RE: simultaneous sa-learn processes
JamesDR wrote: > Chavdar Videff wrote: >> Hi List, >> >> Our mailserver server serves about 100 users. Our config: >> Sendmail+Procmail+SpamAssassin. >> The question is: >> If I got it right, we should run sa-learn for each user in order to >> benefit from bayes. We intend to run a cron job for each user and do >> it at night by supplying a daily snapshot of our spam and ham >> collections to sa-learn. Can our mailserver handle it (256 MB RAM, >> Celeron 400 Mhz)? Why would you want to setup Bayes on a per user basis if you are going to feeed it system-wide hams and spams? Especially feeding it systemwide hams is odd. >> A weekly collection run for 1 user usually eats 100% of CPU load. My >> concern is whether the system is going to crash or just do the job >> slower and if you can point out how many sa-learn tasks could we run >> simultaneously with our setup. Systems shouldn't crash under high load, so that's not a real concern. If it does happen, you have a more serious problems elswhere. What would be more of a concern is how it is going to affect other processes running on your system. Slower is not a problem, but if you really put the load on your box from a lot of processes, you might start seeing time-outs. >> All hints will be appreciated, for we scheduled an initial load for >> 16 users of the big collection of spam received so far. If your are going to simultaniously learn spam and ham for 16 users, and want to keep running your mailserver/spamassassin too (it take you also have a virusscanner running somewhere), I would consider at least running the sa-learn processes under nice to keep them from stalling more essential services. But, depending on your System setup (OS, DB, etc) you might want to cut down a little on the number of processes run simultaniously. >> >> Thanks guys >> >> Chavdar Videff >> >> > What kind of Bayes db are you using? We use MySQL here and > haven't seen SA-Learn use up that much cpu... I've run it > manually up to 10 processes at once without any noticeable > slowing of the machine. (p2 450mhz, 256mb)
Re: simultaneous sa-learn processes
Chavdar Videff wrote: Hi List, Our mailserver server serves about 100 users. Our config: Sendmail+Procmail+SpamAssassin. The question is: If I got it right, we should run sa-learn for each user in order to benefit from bayes. We intend to run a cron job for each user and do it at night by supplying a daily snapshot of our spam and ham collections to sa-learn. Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)? A weekly collection run for 1 user usually eats 100% of CPU load. My concern is whether the system is going to crash or just do the job slower and if you can point out how many sa-learn tasks could we run simultaneously with our setup. All hints will be appreciated, for we scheduled an initial load for 16 users of the big collection of spam received so far. Thanks guys Chavdar Videff What kind of Bayes db are you using? We use MySQL here and haven't seen SA-Learn use up that much cpu... I've run it manually up to 10 processes at once without any noticeable slowing of the machine. (p2 450mhz, 256mb) -- Thanks, James
simultaneous sa-learn processes
Hi List, Our mailserver server serves about 100 users. Our config: Sendmail+Procmail+SpamAssassin. The question is: If I got it right, we should run sa-learn for each user in order to benefit from bayes. We intend to run a cron job for each user and do it at night by supplying a daily snapshot of our spam and ham collections to sa-learn. Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)? A weekly collection run for 1 user usually eats 100% of CPU load. My concern is whether the system is going to crash or just do the job slower and if you can point out how many sa-learn tasks could we run simultaneously with our setup. All hints will be appreciated, for we scheduled an initial load for 16 users of the big collection of spam received so far. Thanks guys Chavdar Videff
Re: How can I filter this kind of spam?
Kai Schaetzl wrote: Michael Moyse wrote on Fri, 08 Jul 2005 17:55:32 +0100: To me it looks like a duck and sounds like a duck I'm probably wrong and missing something here because I'm no expert so I'm happy to be enlightened. Ok, I enlighten you ;-) I hope I'm not wrong. Now that I look again at the headers it turns out I was wrong as well, see below. From the headers: Received: (qmail 10812 invoked by uid 567); 5 Jul 2005 12:03:20 - Received: from 65.33.195.76 by host1 (envelope-from <[EMAIL PROTECTED]>, uid 502) with qmail-scanner-1.25 (clamdscan: 0.86.1/967. spamassassin: 3.0.4. Clear:RC:0(65.33.195.76):SA:0(0.0/1.5):. Processed in 0.44071 secs); 05 Jul 2005 12:03:20 - Received: from unknown (HELO ss) (65.33.195.76) by 0 with SMTP; 5 Jul 2005 12:03:19 - 65.33.195.76 = 76.195.33.65.cfl.res.rr.com ! Received: from vitalmex.com.mx (mail1.vitalmex.com.mx [148.223.241.181]) by 76.195.33.65.cfl.res.rr.com (Pastfix) with ESMTP id 0456EDBA28 for <[EMAIL PROTECTED]>; Tue, 05 Jul 2005 05:21:23 -0700 The mail went: vitalmex -> Roadrunner (Po/astfix) -> boom-edv.de (qmail) The last Received line looks forged (Pastfix), there's also no SMTP running at 76.195.33.65.cfl.res.rr.com (=no open/abusable relay). This suggests that the mail was sent out directly from that roadrunner account and the last Received plus all vitalmex stuff is completely forged. Also, a spammer which abused a Roadrunner account would obviously not send openly from his own MX and giving you a return-path which leads back to him. So, what you actually have to block is .rr.com and not .vitalmex.com.mx or .mx. This mail would have never reached us, because we already block all of .rr.com :-) Kai Cool! Thanks for the explanation
bayes_ignore_from with wildcard ?
Hello, Does anyone know if this will work: bayes_ignore_from [EMAIL PROTECTED] The docs don't say specifically if this kind of directive is allowed. They do say that this kind of thing will work for whitelist_from. Regards, Devin