Re: Bayes DB does not grow anymore

2005-03-23 Thread Kai Schaetzl
GRP Productions wrote on Fri, 18 Mar 2005 10:38:29 +0200:

 It seems SURBL is now enabled by default. It has also changed its name to 
 URIDNSBL :-)

SURBL refers generally to those xx_SURBL rules and to URIDNSBL since the only 
other distributed rules is SBL and SURBL started it all.

 I do not use SARE rules (although I am trying to find time to 
 look at them, as I am aware of their credibility). I use Gray's rules 
 (http://files.grayonline.id.au), they seem quite efficient.

I wasn't aware of that site, but now that I visited it, I remember I visited it 
at least once. Use whatever works for you. After all, all this stuff isn't done 
to make you try out again and again but to help you focus your time on the 
important things.

 I understand what you say. The point is, what should be the criteria to 
 understand if the time for an expiration has come? I mean, supposing we take 
 only the size in consideration, could be a problem. What if some old tokens 
 are still common nowadays in spam mail?

This is not a problem. Expiry isn't done by addition time, but by access time 
(short: atime). So, items which didn't occur recently drop to the end of the 
db and get removed by expiry. There's always the chance that old tokens which 
haven't been seen for a long time come back. But the chance is slimmer the 
older the atime of that token is. There's probably some statistical curve 
algorithm which could be used to determine the best break point. Because of 
the way dbx databases work expiry can't be done this way, though.

 As I told you, since my last post I have reset everything.  It seems to me 
 it works fine, and it learns rapidly. It gives me no reason not to trust it, 
 in a degree I have set my SA score to be more or less equal with the 
 BAYES_99 score (around 8).

Your BAYES_99 score is 8? I would never do this. General rule is that no single 
rule should be able to mark a message as ham or spam. That cries for false 
positives.

 Of course I keep doing mistake-based learning, 
 but most of the times I feed it with 'subjective' spam mail (ie. mail that 
 my users don't want to receive, but is definitely not spam).

What kind of mail is that? Newsletters they once subscribed to and don't like 
anymore? They should unsubscribe instead of declaring it as spam.


Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de  http://msie.winware.org





Re: Bayes DB does not grow anymore

2005-03-23 Thread jdow
From: Kai Schaetzl [EMAIL PROTECTED]

  in a degree I have set my SA score to be more or less equal with the
  BAYES_99 score (around 8).

 Your BAYES_99 score is 8? I would never do this. General rule is that no
single
 rule should be able to mark a message as ham or spam. That cries for false
 positives.

I'd not do that with Bayes scores. However, there are a few rules that
are iron clad spam detectors here and they get VERY high scores. They
are unique to me and uniquely usable by me so I don't bother to pass
them along. (I have a string if wrong names associated with products
people spam me about that I use to send a score well over 5 to SA. And
I have some additional PayPal antispam of my own which involve some
fancy dancing with meta rules that get an automatic 105 to make sure
they never get through to anything but my spam folder. I do scan the
spam folder, though. If I didn't scan it I'd not be so vicious about
some of my spam scores.

{^_-}




Re: Bayes DB does not grow anymore

2005-03-18 Thread GRP Productions
Thanks for the offer. You can send it to the email address I use for this 
list,
or you could just send me an FTP URL for retrieval.
Sorry I did not find the time to do this, but I will try to send it during 
the weekend.

Oh, yes. You need to have SURBL switched on via the init.pre (I think it's 
off
by default) and you should use custom rules. I use a set of carefully 
chosen
rulesets mostly from SARE and updated via rulesdujour and some more rules 
of my
own accumulated over time.
It seems SURBL is now enabled by default. It has also changed its name to 
URIDNSBL :-) I do not use SARE rules (although I am trying to find time to 
look at them, as I am aware of their credibility). I use Gray's rules 
(http://files.grayonline.id.au), they seem quite efficient.

I think on a heavy traffic machine it's preferrable to have it off, 
especially
when using MailScanner. Otherwise the expiry can kick in at random times 
every
few hours (you can set a minimum time, though, f.i. one day). Some people 
run a
scheduled expiry three times a day. That's an advice which often comes up 
on
the Mailscanner list (which is a very helpful list, btw).
Depends on how often you need it (whether it reaches the limit you want to 
hold
more often or not). Starting with one expiry per night should be fine, but 
you
should occasionally expire manually and look at the output, in case there 
are
problems.

No. One should get rid of really old tokens, they are only ballast in the 
db.
I don't know how a big db behaves on a busy site. Ours contain 1 Mio. 
tokens
and have a size of 40 MB. They work very well with no ressource hogging. 
But I
have only a few thousand messages running thru each of our servers, there's
probably none which gets more than 10.000 a day. If you get 100.000 it may 
be
different.
I understand what you say. The point is, what should be the criteria to 
understand if the time for an expiration has come? I mean, supposing we take 
only the size in consideration, could be a problem. What if some old tokens 
are still common nowadays in spam mail? You could say it doesn't matter it 
will be started again and recognize all the bad stuff. In that sense, we 
could just stop maintaining Bayes completely.

That's what we do. I only learn messages which were categorized wrong. Not 
by
Bayes, but by SA. Most messages which get a score lower than 5 still get a
BAYES_99 which means that Bayes identifies them all. Nevertheless, I learn
these messages because they are spam and it reassures Bayes that they are 
spam.
BTW: I have set BAYES_99 to 3.0, because it's so accurate for us.
As I told you, since my last post I have reset everything.  It seems to me 
it works fine, and it learns rapidly. It gives me no reason not to trust it, 
in a degree I have set my SA score to be more or less equal with the 
BAYES_99 score (around 8). Of course I keep doing mistake-based learning, 
but most of the times I feed it with 'subjective' spam mail (ie. mail that 
my users don't want to receive, but is definitely not spam). I monitor it 
constantly and I am happy about it.

No problem :-) I tend to be a bit snappy on first messages which look to me
like the author could have done a bit more research, but once we are over 
that
stage I hope I can give some good advice based on my experience.
I have to admit that our communication was valuable to me, I learned so much 
about how the whole thing works. Once again, I appreciate it.

Greg
_
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/



Re: Bayes DB does not grow anymore

2005-03-14 Thread Kai Schaetzl
GRP Productions wrote on Mon, 14 Mar 2005 00:32:42 +0200:

 You are right, I am using MailWatch. I just posted this output to be easy 
 for one to see the actual dates without having to convert.

That's okay, the problem just is one cannot be sure how accurate it is. Knowing 
that you use MS would have been useful, anyway :-)
(BTW: my version of Mailwatch can't show this, do you use a CVS version?)

 Here is the 
 actual output: 
  
 # /usr/bin/sa-learn -p /opt/MailScanner/etc/spam.assassin.prefs.conf --dump 
 magic 
 0.000  0  3  0  non-token data: bayes db version 
 0.000  0  49740  0  non-token data: nspam 
 0.000  0  47167  0  non-token data: nham 
 0.000  0 123325  0  non-token data: ntokens

I didn't look at this closely before, but I think this ratio indicates a 
problem, f.i. this is from our own mail server (just getting our own mail, not 
our clients'):

0.000  0  30089  0  non-token data: nspam
0.000  0  12515  0  non-token data: nham
0.000  01001630  0  non-token data: ntokens

See the number of tokens, we have ten times yours with less learned mail. That 
means that our db has much more tokens to qualify an email as ham or spam. Also 
your hold time is quite low, it's about a month. I think we haven tokens from 
even a year ago. That's maybe a bit too much, but I strongly suggest upping 
your bayes_expiry_max_db_size to something like 500.000 or so. Since you have a 
much higher flux of messages than we have on that machine you are literally 
burning your db to uselessness.

 No it isn't. This is exactly the point I mentioned.

But you didn't prove it ;-)

 But as I said earlier, 
 sa-learn claims it has learned, even from the web interface: 
 SA Learn: Learned from 1 message(s) (1 message(s) examined). 

And you learned by specifying the config file? I suspect that you are at least 
occasionally using two SA configurations, the one coming with MS and the one 
coming with SA.

 This is getting more suspicious: there is no bayes_journal file! 

Oh. Still possible, though. You don't need to have one, but on high volume 
systems it's highly recommended. Check your SA config (whereever it is :-) for 
bayes_learn_to_journal 1. I don't know if it is 1 by default, though. What do 
you have starting with bayes in your config file?

 -rw-rw-rw-  1 root nobody 1236 Mar 14 00:22 bayes.mutex 
 -rw-rw-rw-  1 root nobody 10452992 Mar 14 00:22 bayes_seen 
 -rw-rw-rw-  1 root nobody  5509120 Mar 14 00:02 bayes_toks 

bayes_seen is quite high. I haven't ever seen that it is higher than bayes_toks 
on our systems. But maybe that's normal for high volume systems, I don't know. 
On the Mailscanner list many people complain about very big bayes_seen files. 
Someone else on this list should comment on the size.

 I can assure you noone has touched anything inside this directory. If this 
 is the reason for the problems I've been facing, is there a way to recreate 
 the file without having to lose my current data? (perhaps by copying the 
 above files somewhere, execute sa-learn --clear and some time later restore 
 the above files?)

Don't know if this would be of any help. As I said, I suspect you are using at 
least two different bayes dbs. At least when you do it from the command line. 
Run an updatedb and then locate bayes (this may not locate all files, f.i. 
not in /var !).
MS, of course, can only use one and doesn't have a chance of confusing that, so 
when it uses SA that learns and checks the same db. And so far that part seems 
to be okay (except for the bigger size of bayes_seen, but as I said, this may 
be normal for your setup, I really don't know). But you burn your tokens too 
fast. At least that's what I think.


Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de  http://msie.winware.org





Re: Bayes DB does not grow anymore

2005-03-14 Thread GRP Productions
That's okay, the problem just is one cannot be sure how accurate it is. 
Knowing
that you use MS would have been useful, anyway :-)
(BTW: my version of Mailwatch can't show this, do you use a CVS version?)
Indeed, this is the CVS version :-)
See the number of tokens, we have ten times yours with less learned mail. 
That
means that our db has much more tokens to qualify an email as ham or spam. 
Also
This is perhaps because I have been using only 'mistake-based' training (ie 
training only when false classificaiton happens). However this used to work 
fine.

your hold time is quite low, it's about a month. I think we haven tokens 
from
even a year ago. That's maybe a bit too much, but I strongly suggest upping
your bayes_expiry_max_db_size to something like 500.000 or so. Since you 
have a
much higher flux of messages than we have on that machine you are literally
burning your db to uselessness.
So what would you suggest? I certainly dont want to lose everything that has 
been learned till now.

And you learned by specifying the config file? I suspect that you are at 
least
occasionally using two SA configurations, the one coming with MS and the 
one
coming with SA.
Nope, there is definitely only the one comng with MS. I never use SA from 
the command line anyway.

Oh. Still possible, though. You don't need to have one, but on high volume
systems it's highly recommended. Check your SA config (whereever it is :-) 
for
bayes_learn_to_journal 1. I don't know if it is 1 by default, though. What 
do
you have starting with bayes in your config file?
# grep bayes /opt/MailScanner/etc/spam.assassin.prefs.conf
# be created as /var/spool/spamassassin/bayes_msgcount, etc.
#bayes_path /var/spool/spamassassin/bayes
#bayes_file_mode0600
bayes_path  /var/spool/MailScanner/bayes/bayes
bayes_file_mode 0666
# MailScanner: big bayes_toks.new files wasting space.
bayes_auto_expire 0
bayes_expiry_max_db_size 50
bayes_ignore_header X-MailScanner
bayes_ignore_header X-MailScanner-SpamCheck
bayes_ignore_header X-MailScanner-SpamScore
bayes_ignore_header X-MailScanner-Information
# use_bayes 0
Don't know if this would be of any help. As I said, I suspect you are using 
at
least two different bayes dbs. At least when you do it from the command 
line.
Run an updatedb and then locate bayes (this may not locate all files, 
f.i.
not in /var !).
I think there is only one.
MS, of course, can only use one and doesn't have a chance of confusing 
that, so
when it uses SA that learns and checks the same db. And so far that part 
seems
to be okay (except for the bigger size of bayes_seen, but as I said, this 
may
be normal for your setup, I really don't know). But you burn your tokens 
too
fast. At least that's what I think.
If I get it you mean that the tokens are lost very quickly? I think am 
confused , if bayes works with tokens, why does it need nspam and nham? Or 
are they just counters?

In general, do you think that setting bayes_expiry_max_db_size would be 
enough?
One final thing: Why even if i manually expire, the date of last expiration 
remains old?

_
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/



Re: Bayes DB does not grow anymore

2005-03-14 Thread Kai Schaetzl
GRP Productions wrote on Mon, 14 Mar 2005 03:41:40 +0200:

 Indeed, this is the CVS version :-) 

I have been trying to get something from CVS for several days now, no luck.

 This is perhaps because I have been using only 'mistake-based' training (ie 
 training only when false classificaiton happens). However this used to work 
 fine. 

Bayes needs constant training, but this doesn't mean it needs any manual 
training. Once it's up and running and well-greased it should take care of 
itself by auto-learning (bayes_auto_learn 1, don't know if on by default). 
About 70 or 80% of our spam and ham (especially the spam) is autolearned.

  
 your hold time is quite low, it's about a month. I think we haven tokens 
 from 
 even a year ago. That's maybe a bit too much, but I strongly suggest upping 
 your bayes_expiry_max_db_size to something like 500.000 or so. Since you 
 have a 
 much higher flux of messages than we have on that machine you are literally 
 burning your db to uselessness. 
  
 So what would you suggest? I certainly dont want to lose everything that has 
 been learned till now. 

Actually, with those few tokens you won't loose much if you throw it away ;-) 
As I said upping that should help, no need to throw it away unless you think 
that's easier (if most spam you get scores at BAYES_50 it might be better to 
start over than to convince the db that it's spam).

 Nope, there is definitely only the one comng with MS. I never use SA from 
 the command line anyway.

Well, let's go back:
you sa-learn a message, it says it learned, you dump magic and see there's no 
change, you look in the directory and there's no journal. There *has* to be at 
least one additional Bayes db. Or something happens which I haven't heard of in 
my about three years of using SA+Bayes. What's the output of sa-learn --dump 
magic? Don't specify a config file!
 
 bayes_path  /var/spool/MailScanner/bayes/bayes 

and what's in your /etc/mail/spamassassin/local.conf?

 bayes_auto_expire 0
ok, that means it won't expire. Of course, if it doesn't grow this isn't 
necessary ... ;-)

 bayes_expiry_max_db_size 50
I assume you just added/changed that?

 If I get it you mean that the tokens are lost very quickly?

Yes. However, now that I know that your bayes_expiry is off we have a different 
case? Since when has it been off? Since Feb. 11 as your dump magic suggests? 
Your oldest token is Feb. 2. So that either means your started the db that day 
or you are burning your tokens in 10 days. That's one problem, upping to a 
higher ceiling, as you already did, should take care of that. The other problem 
is that it's apparently not growing. One of the reasons is, of course, that you 
only learn by mistake. So, how often is that done? How many do you actually add 
this way? The second part of this other problem is that even if you learn it 
doesn't seem to learn. I don't see another possibility as that it uses 
different dbs.

 I think am 
 confused , if bayes works with tokens, why does it need nspam and nham? Or 
 are they just counters? 

It's just the number of spam and ham messages you learned to it. Yes, it's more 
or less informational only.

  
 In general, do you think that setting bayes_expiry_max_db_size would be 
 enough? 

To cure the fast expiration, yes, but you didn't expire for the last 30 days, 
anyway.

 One final thing: Why even if i manually expire, the date of last expiration 
 remains old?

Same reason as above: you work on different dbs. What does the expire output 
show?


Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de  http://msie.winware.org





Re: Bayes DB does not grow anymore

2005-03-14 Thread GRP Productions
I have been trying to get something from CVS for several days now, no luck.
Send me your email in private ([EMAIL PROTECTED]) to send it to you.
Bayes needs constant training, but this doesn't mean it needs any manual
training. Once it's up and running and well-greased it should take care 
of
itself by auto-learning (bayes_auto_learn 1, don't know if on by default).
About 70 or 80% of our spam and ham (especially the spam) is autolearned.
I will probably start again from scratch. One point: Do you think I should 
put custom rules inside /etc/mail/spamassassin or the default installation 
is enough?

Actually, with those few tokens you won't loose much if you throw it away 
;-)
As I said upping that should help, no need to throw it away unless you 
think
that's easier (if most spam you get scores at BAYES_50 it might be better 
to
start over than to convince the db that it's spam).
I'll probably do it.
 bayes_auto_expire 0
 bayes_expiry_max_db_size 50
I assume you just added/changed that?
Yes I just added this. Should auto_expire remain always at 0? Also, do you 
think it would be better if the db NEVER expired? Would this value of 50 
achieve that? I don't want to come at work some day and see my tokens were 
lost again :-(

In general, should I do as you said, ie. trust the autolearn system and 
never use sa-learn again, provided that I do not have the time to do full 
training.

Thanks for giving me so much of your time, and being so patient with my 
silly questions.
Best regards,
Greg

_
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/



Re: Bayes DB does not grow anymore

2005-03-13 Thread Kai Schaetzl
GRP Productions wrote on Sun, 13 Mar 2005 11:21:12 +0200:

 for some days now my bayesian DB does not seem to grow. Its size remains 
 stable. It is read with no problems by SA 3.0.2, but nothing new is written. 
 I send an email to me, it is classified as BAYES_50. I sa-learn it as spam, 
 send it again, and it is still BAYES_50 (I expected to see it as BAYES_99).


This doesn't prove anything. sa-learn --dump magic shows you what's inside. 
Also, Bayes is not a checksum system like Razor, that's its strength. If you 
learn something to it that means that it extracts tokens (short pieces) from 
the message and adjusts its internal probability for them being ham or spam by 
a certain factor. Or if it doesn't know that token yet it adds it.
That the size doesn't grow can have several reasons, f.i. expiry or the fact 
that the db format seems to have some air in it, so that it grows in jumps 
and not continually.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de  http://msie.winware.org





Re: Bayes DB does not grow anymore

2005-03-13 Thread GRP Productions
This doesn't prove anything. sa-learn --dump magic shows you what's inside.
Also, Bayes is not a checksum system like Razor, that's its strength. If 
you
learn something to it that means that it extracts tokens (short pieces) 
from
the message and adjusts its internal probability for them being ham or spam 
by
a certain factor. Or if it doesn't know that token yet it adds it.
That the size doesn't grow can have several reasons, f.i. expiry or the 
fact
that the db format seems to have some air in it, so that it grows in 
jumps
and not continually.
Perhaps I have not been clear enough. It's not only that the files' size is 
constant. I am pasting the output of dump magic, and I have to explain that 
the nham and nspam values are the same for many days now. This is not 
normal, since we are talking about a very busy server (more than 4,000 
messages per day). This behaviour has not always been the case, it used to 
work fine. If I send to myself a message from Yahoo, with subject 'Viagra 
sex teen  and other nice words, I certainly do not want it to pass. 
Bayes classifies it as 50% spam. I tried to sa-learn --forget, and then 
re-learn, still is BAYES_50. The nham and nspam values used to increase very 
rapidly (sometimes by a value of 200-300 per day). No errors are produced. I 
wouldn't have noticed the particular problem, but fortunately during the 
last days we started having more spam than usual to be passing. Also, I 
tried to force an expiration many times, but as you can see the expiration 
did not take place. Its definitely not a file permission issue.

Thanks
Number of Spam Messages:49,740
Number of Ham Messages: 47,167
Number of Tokens:   123,325
Oldest Token:   Wed, 2 Feb 2005 06:37:53 +0200
Newest Token:   Sat, 12 Mar 2005 16:07:30 +0200
Last Journal Sync:  Fri, 11 Feb 2005 18:03:10 +0200
Last Expiry:Fri, 11 Feb 2005 15:45:34 +0200
Last Expiry Reduction Count:3,475 tokens
_
FREE pop-up blocking with the new MSN Toolbar - get it now! 
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/



Re: Bayes DB does not grow anymore

2005-03-13 Thread GRP Productions
That is the output of --dump magic? I haven't ever seen it formatted that
nicely. I assume you skipped the first line, but there's also missing the
expire atime delta. So, where do you got this from? Not directly from 
sa-learn
--dump magic I'd say. You are running SA thru some interface? You should 
have
said something about the whereabouts of your installation.
You are right, I am using MailWatch. I just posted this output to be easy 
for one to see the actual dates without having to convert. Here is the 
actual output:

# /usr/bin/sa-learn -p /opt/MailScanner/etc/spam.assassin.prefs.conf --dump 
magic
0.000  0  3  0  non-token data: bayes db version
0.000  0  49740  0  non-token data: nspam
0.000  0  47167  0  non-token data: nham
0.000  0 123325  0  non-token data: ntokens
0.000  0 1107319073  0  non-token data: oldest atime
0.000  0 1110636450  0  non-token data: newest atime
0.000  0 1108137790  0  non-token data: last journal sync 
atime
0.000  0 1108129534  0  non-token data: last expiry atime
0.000  0 804361  0  non-token data: last expire atime 
delta
0.000  0   3475  0  non-token data: last expire 
reduction count

Ok. Get the values. Then learn a message to it. Make sure it says that it
actually learned, then check the values again. Is either the spam or ham 
count
increased by one or not?
No it isn't. This is exactly the point I mentioned. But as I said earlier, 
sa-learn claims it has learned, even from the web interface:
SA Learn: Learned from 1 message(s) (1 message(s) examined).

Ok, this finally looks a bit suspicious. No sync and no expire for a month. 
If
it doesn't sync you don't get new tokens. Check in your bayes directory how 
big
your bayes_journal is. I'd think it's quite big. Do a sync now. (Please 
don't
do it via an interface, do it on the command line.) What's the output? Is 
the
journal gone and the number of tokens increased now? If so, you need to
investigate why it doesn't sync anymore. Also do an expire then.
This is getting more suspicious: there is no bayes_journal file!
# ll /var/spool/MailScanner/bayes/
total 11780
drwxrwxrwx  2 root nobody 4096 Mar 14 00:22 .
drwxr-xr-x  4 root nobody 4096 Mar 13 11:55 ..
-rw-rw-rw-  1 root nobody 1236 Mar 14 00:22 bayes.mutex
-rw-rw-rw-  1 root nobody 10452992 Mar 14 00:22 bayes_seen
-rw-rw-rw-  1 root nobody  5509120 Mar 14 00:02 bayes_toks
I can assure you noone has touched anything inside this directory. If this 
is the reason for the problems I've been facing, is there a way to recreate 
the file without having to lose my current data? (perhaps by copying the 
above files somewhere, execute sa-learn --clear and some time later restore 
the above files?)

Thanks for your help
_
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/