Re: Bayes DB does not grow anymore
GRP Productions wrote on Fri, 18 Mar 2005 10:38:29 +0200: It seems SURBL is now enabled by default. It has also changed its name to URIDNSBL :-) SURBL refers generally to those xx_SURBL rules and to URIDNSBL since the only other distributed rules is SBL and SURBL started it all. I do not use SARE rules (although I am trying to find time to look at them, as I am aware of their credibility). I use Gray's rules (http://files.grayonline.id.au), they seem quite efficient. I wasn't aware of that site, but now that I visited it, I remember I visited it at least once. Use whatever works for you. After all, all this stuff isn't done to make you try out again and again but to help you focus your time on the important things. I understand what you say. The point is, what should be the criteria to understand if the time for an expiration has come? I mean, supposing we take only the size in consideration, could be a problem. What if some old tokens are still common nowadays in spam mail? This is not a problem. Expiry isn't done by addition time, but by access time (short: atime). So, items which didn't occur recently drop to the end of the db and get removed by expiry. There's always the chance that old tokens which haven't been seen for a long time come back. But the chance is slimmer the older the atime of that token is. There's probably some statistical curve algorithm which could be used to determine the best break point. Because of the way dbx databases work expiry can't be done this way, though. As I told you, since my last post I have reset everything. It seems to me it works fine, and it learns rapidly. It gives me no reason not to trust it, in a degree I have set my SA score to be more or less equal with the BAYES_99 score (around 8). Your BAYES_99 score is 8? I would never do this. General rule is that no single rule should be able to mark a message as ham or spam. That cries for false positives. Of course I keep doing mistake-based learning, but most of the times I feed it with 'subjective' spam mail (ie. mail that my users don't want to receive, but is definitely not spam). What kind of mail is that? Newsletters they once subscribed to and don't like anymore? They should unsubscribe instead of declaring it as spam. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de http://msie.winware.org
Re: Bayes DB does not grow anymore
From: Kai Schaetzl [EMAIL PROTECTED] in a degree I have set my SA score to be more or less equal with the BAYES_99 score (around 8). Your BAYES_99 score is 8? I would never do this. General rule is that no single rule should be able to mark a message as ham or spam. That cries for false positives. I'd not do that with Bayes scores. However, there are a few rules that are iron clad spam detectors here and they get VERY high scores. They are unique to me and uniquely usable by me so I don't bother to pass them along. (I have a string if wrong names associated with products people spam me about that I use to send a score well over 5 to SA. And I have some additional PayPal antispam of my own which involve some fancy dancing with meta rules that get an automatic 105 to make sure they never get through to anything but my spam folder. I do scan the spam folder, though. If I didn't scan it I'd not be so vicious about some of my spam scores. {^_-}
Re: Bayes DB does not grow anymore
Thanks for the offer. You can send it to the email address I use for this list, or you could just send me an FTP URL for retrieval. Sorry I did not find the time to do this, but I will try to send it during the weekend. Oh, yes. You need to have SURBL switched on via the init.pre (I think it's off by default) and you should use custom rules. I use a set of carefully chosen rulesets mostly from SARE and updated via rulesdujour and some more rules of my own accumulated over time. It seems SURBL is now enabled by default. It has also changed its name to URIDNSBL :-) I do not use SARE rules (although I am trying to find time to look at them, as I am aware of their credibility). I use Gray's rules (http://files.grayonline.id.au), they seem quite efficient. I think on a heavy traffic machine it's preferrable to have it off, especially when using MailScanner. Otherwise the expiry can kick in at random times every few hours (you can set a minimum time, though, f.i. one day). Some people run a scheduled expiry three times a day. That's an advice which often comes up on the Mailscanner list (which is a very helpful list, btw). Depends on how often you need it (whether it reaches the limit you want to hold more often or not). Starting with one expiry per night should be fine, but you should occasionally expire manually and look at the output, in case there are problems. No. One should get rid of really old tokens, they are only ballast in the db. I don't know how a big db behaves on a busy site. Ours contain 1 Mio. tokens and have a size of 40 MB. They work very well with no ressource hogging. But I have only a few thousand messages running thru each of our servers, there's probably none which gets more than 10.000 a day. If you get 100.000 it may be different. I understand what you say. The point is, what should be the criteria to understand if the time for an expiration has come? I mean, supposing we take only the size in consideration, could be a problem. What if some old tokens are still common nowadays in spam mail? You could say it doesn't matter it will be started again and recognize all the bad stuff. In that sense, we could just stop maintaining Bayes completely. That's what we do. I only learn messages which were categorized wrong. Not by Bayes, but by SA. Most messages which get a score lower than 5 still get a BAYES_99 which means that Bayes identifies them all. Nevertheless, I learn these messages because they are spam and it reassures Bayes that they are spam. BTW: I have set BAYES_99 to 3.0, because it's so accurate for us. As I told you, since my last post I have reset everything. It seems to me it works fine, and it learns rapidly. It gives me no reason not to trust it, in a degree I have set my SA score to be more or less equal with the BAYES_99 score (around 8). Of course I keep doing mistake-based learning, but most of the times I feed it with 'subjective' spam mail (ie. mail that my users don't want to receive, but is definitely not spam). I monitor it constantly and I am happy about it. No problem :-) I tend to be a bit snappy on first messages which look to me like the author could have done a bit more research, but once we are over that stage I hope I can give some good advice based on my experience. I have to admit that our communication was valuable to me, I learned so much about how the whole thing works. Once again, I appreciate it. Greg _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Bayes DB does not grow anymore
GRP Productions wrote on Mon, 14 Mar 2005 00:32:42 +0200: You are right, I am using MailWatch. I just posted this output to be easy for one to see the actual dates without having to convert. That's okay, the problem just is one cannot be sure how accurate it is. Knowing that you use MS would have been useful, anyway :-) (BTW: my version of Mailwatch can't show this, do you use a CVS version?) Here is the actual output: # /usr/bin/sa-learn -p /opt/MailScanner/etc/spam.assassin.prefs.conf --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 49740 0 non-token data: nspam 0.000 0 47167 0 non-token data: nham 0.000 0 123325 0 non-token data: ntokens I didn't look at this closely before, but I think this ratio indicates a problem, f.i. this is from our own mail server (just getting our own mail, not our clients'): 0.000 0 30089 0 non-token data: nspam 0.000 0 12515 0 non-token data: nham 0.000 01001630 0 non-token data: ntokens See the number of tokens, we have ten times yours with less learned mail. That means that our db has much more tokens to qualify an email as ham or spam. Also your hold time is quite low, it's about a month. I think we haven tokens from even a year ago. That's maybe a bit too much, but I strongly suggest upping your bayes_expiry_max_db_size to something like 500.000 or so. Since you have a much higher flux of messages than we have on that machine you are literally burning your db to uselessness. No it isn't. This is exactly the point I mentioned. But you didn't prove it ;-) But as I said earlier, sa-learn claims it has learned, even from the web interface: SA Learn: Learned from 1 message(s) (1 message(s) examined). And you learned by specifying the config file? I suspect that you are at least occasionally using two SA configurations, the one coming with MS and the one coming with SA. This is getting more suspicious: there is no bayes_journal file! Oh. Still possible, though. You don't need to have one, but on high volume systems it's highly recommended. Check your SA config (whereever it is :-) for bayes_learn_to_journal 1. I don't know if it is 1 by default, though. What do you have starting with bayes in your config file? -rw-rw-rw- 1 root nobody 1236 Mar 14 00:22 bayes.mutex -rw-rw-rw- 1 root nobody 10452992 Mar 14 00:22 bayes_seen -rw-rw-rw- 1 root nobody 5509120 Mar 14 00:02 bayes_toks bayes_seen is quite high. I haven't ever seen that it is higher than bayes_toks on our systems. But maybe that's normal for high volume systems, I don't know. On the Mailscanner list many people complain about very big bayes_seen files. Someone else on this list should comment on the size. I can assure you noone has touched anything inside this directory. If this is the reason for the problems I've been facing, is there a way to recreate the file without having to lose my current data? (perhaps by copying the above files somewhere, execute sa-learn --clear and some time later restore the above files?) Don't know if this would be of any help. As I said, I suspect you are using at least two different bayes dbs. At least when you do it from the command line. Run an updatedb and then locate bayes (this may not locate all files, f.i. not in /var !). MS, of course, can only use one and doesn't have a chance of confusing that, so when it uses SA that learns and checks the same db. And so far that part seems to be okay (except for the bigger size of bayes_seen, but as I said, this may be normal for your setup, I really don't know). But you burn your tokens too fast. At least that's what I think. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de http://msie.winware.org
Re: Bayes DB does not grow anymore
That's okay, the problem just is one cannot be sure how accurate it is. Knowing that you use MS would have been useful, anyway :-) (BTW: my version of Mailwatch can't show this, do you use a CVS version?) Indeed, this is the CVS version :-) See the number of tokens, we have ten times yours with less learned mail. That means that our db has much more tokens to qualify an email as ham or spam. Also This is perhaps because I have been using only 'mistake-based' training (ie training only when false classificaiton happens). However this used to work fine. your hold time is quite low, it's about a month. I think we haven tokens from even a year ago. That's maybe a bit too much, but I strongly suggest upping your bayes_expiry_max_db_size to something like 500.000 or so. Since you have a much higher flux of messages than we have on that machine you are literally burning your db to uselessness. So what would you suggest? I certainly dont want to lose everything that has been learned till now. And you learned by specifying the config file? I suspect that you are at least occasionally using two SA configurations, the one coming with MS and the one coming with SA. Nope, there is definitely only the one comng with MS. I never use SA from the command line anyway. Oh. Still possible, though. You don't need to have one, but on high volume systems it's highly recommended. Check your SA config (whereever it is :-) for bayes_learn_to_journal 1. I don't know if it is 1 by default, though. What do you have starting with bayes in your config file? # grep bayes /opt/MailScanner/etc/spam.assassin.prefs.conf # be created as /var/spool/spamassassin/bayes_msgcount, etc. #bayes_path /var/spool/spamassassin/bayes #bayes_file_mode0600 bayes_path /var/spool/MailScanner/bayes/bayes bayes_file_mode 0666 # MailScanner: big bayes_toks.new files wasting space. bayes_auto_expire 0 bayes_expiry_max_db_size 50 bayes_ignore_header X-MailScanner bayes_ignore_header X-MailScanner-SpamCheck bayes_ignore_header X-MailScanner-SpamScore bayes_ignore_header X-MailScanner-Information # use_bayes 0 Don't know if this would be of any help. As I said, I suspect you are using at least two different bayes dbs. At least when you do it from the command line. Run an updatedb and then locate bayes (this may not locate all files, f.i. not in /var !). I think there is only one. MS, of course, can only use one and doesn't have a chance of confusing that, so when it uses SA that learns and checks the same db. And so far that part seems to be okay (except for the bigger size of bayes_seen, but as I said, this may be normal for your setup, I really don't know). But you burn your tokens too fast. At least that's what I think. If I get it you mean that the tokens are lost very quickly? I think am confused , if bayes works with tokens, why does it need nspam and nham? Or are they just counters? In general, do you think that setting bayes_expiry_max_db_size would be enough? One final thing: Why even if i manually expire, the date of last expiration remains old? _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Bayes DB does not grow anymore
GRP Productions wrote on Mon, 14 Mar 2005 03:41:40 +0200: Indeed, this is the CVS version :-) I have been trying to get something from CVS for several days now, no luck. This is perhaps because I have been using only 'mistake-based' training (ie training only when false classificaiton happens). However this used to work fine. Bayes needs constant training, but this doesn't mean it needs any manual training. Once it's up and running and well-greased it should take care of itself by auto-learning (bayes_auto_learn 1, don't know if on by default). About 70 or 80% of our spam and ham (especially the spam) is autolearned. your hold time is quite low, it's about a month. I think we haven tokens from even a year ago. That's maybe a bit too much, but I strongly suggest upping your bayes_expiry_max_db_size to something like 500.000 or so. Since you have a much higher flux of messages than we have on that machine you are literally burning your db to uselessness. So what would you suggest? I certainly dont want to lose everything that has been learned till now. Actually, with those few tokens you won't loose much if you throw it away ;-) As I said upping that should help, no need to throw it away unless you think that's easier (if most spam you get scores at BAYES_50 it might be better to start over than to convince the db that it's spam). Nope, there is definitely only the one comng with MS. I never use SA from the command line anyway. Well, let's go back: you sa-learn a message, it says it learned, you dump magic and see there's no change, you look in the directory and there's no journal. There *has* to be at least one additional Bayes db. Or something happens which I haven't heard of in my about three years of using SA+Bayes. What's the output of sa-learn --dump magic? Don't specify a config file! bayes_path /var/spool/MailScanner/bayes/bayes and what's in your /etc/mail/spamassassin/local.conf? bayes_auto_expire 0 ok, that means it won't expire. Of course, if it doesn't grow this isn't necessary ... ;-) bayes_expiry_max_db_size 50 I assume you just added/changed that? If I get it you mean that the tokens are lost very quickly? Yes. However, now that I know that your bayes_expiry is off we have a different case? Since when has it been off? Since Feb. 11 as your dump magic suggests? Your oldest token is Feb. 2. So that either means your started the db that day or you are burning your tokens in 10 days. That's one problem, upping to a higher ceiling, as you already did, should take care of that. The other problem is that it's apparently not growing. One of the reasons is, of course, that you only learn by mistake. So, how often is that done? How many do you actually add this way? The second part of this other problem is that even if you learn it doesn't seem to learn. I don't see another possibility as that it uses different dbs. I think am confused , if bayes works with tokens, why does it need nspam and nham? Or are they just counters? It's just the number of spam and ham messages you learned to it. Yes, it's more or less informational only. In general, do you think that setting bayes_expiry_max_db_size would be enough? To cure the fast expiration, yes, but you didn't expire for the last 30 days, anyway. One final thing: Why even if i manually expire, the date of last expiration remains old? Same reason as above: you work on different dbs. What does the expire output show? Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de http://msie.winware.org
Re: Bayes DB does not grow anymore
I have been trying to get something from CVS for several days now, no luck. Send me your email in private ([EMAIL PROTECTED]) to send it to you. Bayes needs constant training, but this doesn't mean it needs any manual training. Once it's up and running and well-greased it should take care of itself by auto-learning (bayes_auto_learn 1, don't know if on by default). About 70 or 80% of our spam and ham (especially the spam) is autolearned. I will probably start again from scratch. One point: Do you think I should put custom rules inside /etc/mail/spamassassin or the default installation is enough? Actually, with those few tokens you won't loose much if you throw it away ;-) As I said upping that should help, no need to throw it away unless you think that's easier (if most spam you get scores at BAYES_50 it might be better to start over than to convince the db that it's spam). I'll probably do it. bayes_auto_expire 0 bayes_expiry_max_db_size 50 I assume you just added/changed that? Yes I just added this. Should auto_expire remain always at 0? Also, do you think it would be better if the db NEVER expired? Would this value of 50 achieve that? I don't want to come at work some day and see my tokens were lost again :-( In general, should I do as you said, ie. trust the autolearn system and never use sa-learn again, provided that I do not have the time to do full training. Thanks for giving me so much of your time, and being so patient with my silly questions. Best regards, Greg _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Bayes DB does not grow anymore
GRP Productions wrote on Sun, 13 Mar 2005 11:21:12 +0200: for some days now my bayesian DB does not seem to grow. Its size remains stable. It is read with no problems by SA 3.0.2, but nothing new is written. I send an email to me, it is classified as BAYES_50. I sa-learn it as spam, send it again, and it is still BAYES_50 (I expected to see it as BAYES_99). This doesn't prove anything. sa-learn --dump magic shows you what's inside. Also, Bayes is not a checksum system like Razor, that's its strength. If you learn something to it that means that it extracts tokens (short pieces) from the message and adjusts its internal probability for them being ham or spam by a certain factor. Or if it doesn't know that token yet it adds it. That the size doesn't grow can have several reasons, f.i. expiry or the fact that the db format seems to have some air in it, so that it grows in jumps and not continually. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de http://msie.winware.org
Re: Bayes DB does not grow anymore
This doesn't prove anything. sa-learn --dump magic shows you what's inside. Also, Bayes is not a checksum system like Razor, that's its strength. If you learn something to it that means that it extracts tokens (short pieces) from the message and adjusts its internal probability for them being ham or spam by a certain factor. Or if it doesn't know that token yet it adds it. That the size doesn't grow can have several reasons, f.i. expiry or the fact that the db format seems to have some air in it, so that it grows in jumps and not continually. Perhaps I have not been clear enough. It's not only that the files' size is constant. I am pasting the output of dump magic, and I have to explain that the nham and nspam values are the same for many days now. This is not normal, since we are talking about a very busy server (more than 4,000 messages per day). This behaviour has not always been the case, it used to work fine. If I send to myself a message from Yahoo, with subject 'Viagra sex teen and other nice words, I certainly do not want it to pass. Bayes classifies it as 50% spam. I tried to sa-learn --forget, and then re-learn, still is BAYES_50. The nham and nspam values used to increase very rapidly (sometimes by a value of 200-300 per day). No errors are produced. I wouldn't have noticed the particular problem, but fortunately during the last days we started having more spam than usual to be passing. Also, I tried to force an expiration many times, but as you can see the expiration did not take place. Its definitely not a file permission issue. Thanks Number of Spam Messages:49,740 Number of Ham Messages: 47,167 Number of Tokens: 123,325 Oldest Token: Wed, 2 Feb 2005 06:37:53 +0200 Newest Token: Sat, 12 Mar 2005 16:07:30 +0200 Last Journal Sync: Fri, 11 Feb 2005 18:03:10 +0200 Last Expiry:Fri, 11 Feb 2005 15:45:34 +0200 Last Expiry Reduction Count:3,475 tokens _ FREE pop-up blocking with the new MSN Toolbar - get it now! http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
Re: Bayes DB does not grow anymore
That is the output of --dump magic? I haven't ever seen it formatted that nicely. I assume you skipped the first line, but there's also missing the expire atime delta. So, where do you got this from? Not directly from sa-learn --dump magic I'd say. You are running SA thru some interface? You should have said something about the whereabouts of your installation. You are right, I am using MailWatch. I just posted this output to be easy for one to see the actual dates without having to convert. Here is the actual output: # /usr/bin/sa-learn -p /opt/MailScanner/etc/spam.assassin.prefs.conf --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 49740 0 non-token data: nspam 0.000 0 47167 0 non-token data: nham 0.000 0 123325 0 non-token data: ntokens 0.000 0 1107319073 0 non-token data: oldest atime 0.000 0 1110636450 0 non-token data: newest atime 0.000 0 1108137790 0 non-token data: last journal sync atime 0.000 0 1108129534 0 non-token data: last expiry atime 0.000 0 804361 0 non-token data: last expire atime delta 0.000 0 3475 0 non-token data: last expire reduction count Ok. Get the values. Then learn a message to it. Make sure it says that it actually learned, then check the values again. Is either the spam or ham count increased by one or not? No it isn't. This is exactly the point I mentioned. But as I said earlier, sa-learn claims it has learned, even from the web interface: SA Learn: Learned from 1 message(s) (1 message(s) examined). Ok, this finally looks a bit suspicious. No sync and no expire for a month. If it doesn't sync you don't get new tokens. Check in your bayes directory how big your bayes_journal is. I'd think it's quite big. Do a sync now. (Please don't do it via an interface, do it on the command line.) What's the output? Is the journal gone and the number of tokens increased now? If so, you need to investigate why it doesn't sync anymore. Also do an expire then. This is getting more suspicious: there is no bayes_journal file! # ll /var/spool/MailScanner/bayes/ total 11780 drwxrwxrwx 2 root nobody 4096 Mar 14 00:22 . drwxr-xr-x 4 root nobody 4096 Mar 13 11:55 .. -rw-rw-rw- 1 root nobody 1236 Mar 14 00:22 bayes.mutex -rw-rw-rw- 1 root nobody 10452992 Mar 14 00:22 bayes_seen -rw-rw-rw- 1 root nobody 5509120 Mar 14 00:02 bayes_toks I can assure you noone has touched anything inside this directory. If this is the reason for the problems I've been facing, is there a way to recreate the file without having to lose my current data? (perhaps by copying the above files somewhere, execute sa-learn --clear and some time later restore the above files?) Thanks for your help _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/