Your approach to teaching it should be fine.  What you should do is 
periodically feed it some ham and spams so they don't become FP's.
 
What I have had a problem with, solved by bayes, is that the SARE rules, in 
conjunction with the stock rules and some of the RBL's generate a score very 
close to the threashold for some hams.  sometimes they sneak over the top.  
When bayes is in operation it solves the problem because most of the ham is 
dropped to 0-10% probability rules giving it a negative number (which have been 
lowered a little for our particular environment). 
 
I reset bayes a couple weeks back while I was offsite to fix another problem 
and during the time it was waiting to train many hams got marked as spams being 
off by ~.3 area.  We fixed this temporarily by setting the threshold up while 
bayes was learning.  (I didn't have access to my spam/ham databases from where 
I was so I couldn't train it until I got back).
 
So, here is what I would do.  Get my ham/spam mboxes in order, stop spamd, 
tarball the existing bayes, delete the file, restart spamd (so it will create 
the files) and then train it with your ham/spam.  That way you don't have to 
wait until bayes has learned more than 200 ham/spams.
 
BTW, there should be a document somewhere to the "Proper Care and Feeding of 
Bayes for a long and healthy life".  If there isn't we should create one.
 
Gary

________________________________

From: news on behalf of Alt Thomy
Sent: Tue 7/13/2004 6:19 AM
To: [EMAIL PROTECTED]
Subject: Bayes issues



Hi,
my bayes looks like this:
0.000          0          2          0  non-token data: bayes db version
0.000          0       4588          0  non-token data: nspam
0.000          0      15006          0  non-token data: nham
0.000          0     148621          0  non-token data: ntokens
0.000          0 1088644104          0  non-token data: oldest atime
0.000          0 1089366749          0  non-token data: newest atime
0.000          0 1089366089          0  non-token data: last journal sync
atime
0.000          0 1089335321          0  non-token data: last expiry atime
0.000          0     691200          0  non-token data: last expire atime
delta
0.000          0       7297          0  non-token data: last expire
reduction count

I have been using it for a long time only with SA's autolearn, and recently
I started training. Basically I train it only with false positives or false
negatives (mistake-based learning). It seems to work fine, properly
classifying spam and ham messages. Is my whole approach incorrect?

Also, based on the above numbers of ham and spam, and considering that
sa-learn's man page says that above 5,000 messages there is no significant
improvement, how much more should I let it to grow?

However, my experience says that, using a large number of SA rules, it would
not be a problem to empty it, as the rules will most probably identify the
spam. All I have to do is perform training in the same frequency I do it now
(ie. it doesn't really matter if already manually 'learned' spams and hams
are lost - my work remains the same!). It's a strange approach but it works
for me (I have about 4,000 messages per day, of which about 40% is spam).

I would appreciate any comments.
Regards,
Alty








Reply via email to