Re: BAYES question
Joe Acquisto-j4 skrev den 2013-04-27 13:37: Very interesting. However, I don't see any BAYES_xx markings in the headers at all. On 27.04.13 19:00, Joe Acquisto-j4 wrote: I seem to have not stated my query clearly, as several have suggested this. Or, it was perfectly understood, but I am not comprehending. I don't want to know how to see the tokens, etc (I do, but already know how). I was curious about this BAYES_xx thing, which I gather is something I should see in a message header. In one of your former e-mails you were complaining about spam hitting BAYES_50. What did change since? Did you clear the bad BAYES database? Look at it again, folder and file permissions, and --dump magic if it contains enough ham and spam. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Posli tento mail 100 svojim znamim - nech vidia aky si idiot Send this email to 100 your friends - let them see what an idiot you are
Re: BAYES question
27.04.2013 04:54, Karsten Bräckelmann kirjoitti: And it is good advice to keep the initial training corpora to a ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this point, we're approaching woodoo. Learning 10 times more ham than spam is most likely to be a bad choice, though.) I don't see any problem with having a corpus like this: 0.000 0 28252 0 non-token data: nspam 0.000 0 187579 0 non-token data: nham I have no problems with Bayes whatsoever. -- There's small choice in rotten apples. -- William Shakespeare, The Taming of the Shrew signature.asc Description: OpenPGP digital signature
Re: BAYES question
On 04/27/2013 10:59 AM, Jari Fredriksson wrote: 27.04.2013 04:54, Karsten Bräckelmann kirjoitti: And it is good advice to keep the initial training corpora to a ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this point, we're approaching woodoo. Learning 10 times more ham than spam is most likely to be a bad choice, though.) I don't see any problem with having a corpus like this: 0.000 0 28252 0 non-token data: nspam 0.000 0 187579 0 non-token data: nham I have no problems with Bayes whatsoever. how many users? domains? Can hardly be a heavily spammed setup or it would look more like: 0.000 07762525 0 non-token data: nspam 0.000 04171794 0 non-token data: nham (a week's worth of tokens)
Re: BAYES question
27.04.2013 12:03, Axb kirjoitti: On 04/27/2013 10:59 AM, Jari Fredriksson wrote: 27.04.2013 04:54, Karsten Bräckelmann kirjoitti: And it is good advice to keep the initial training corpora to a ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this point, we're approaching woodoo. Learning 10 times more ham than spam is most likely to be a bad choice, though.) I don't see any problem with having a corpus like this: 0.000 0 28252 0 non-token data: nspam 0.000 0 187579 0 non-token data: nham I have no problems with Bayes whatsoever. how many users? domains? Can hardly be a heavily spammed setup or it would look more like: 0.000 07762525 0 non-token data: nspam 0.000 04171794 0 non-token data: nham (a week's worth of tokens) Only me for SPAM HAM and my colleagues for spam. While I try and collect spam wherever I can, the amount of spam has been dropped big time during the couple of years. My boss seems to draw most of the spam of my sources ;) The ham corpus contains also many List-Id (mailing lists). That means they are included in my Bayes training, not in my ruleqa. And I do skim them thru, and move possible spam from them to my spam corpus (not to ruleqa though). -- For a light heart lives long. -- Shakespeare, Love's Labour's Lost signature.asc Description: OpenPGP digital signature
Re: BAYES question
. . . Do train those, which have a Bayesian probability close(r) to 0.5. Or even worse, have a Bayesian probability contrary to the overall score, or actual classification. Training the plethora of spam hitting BAYES_99 might not be a mistake. But it is pretty likely, to *not* improve general SA performance. You're training Bayes. Not SpamAssassin. Very interesting. However, I don't see any BAYES_xx markings in the headers at all. I assumed that is because it is not scoring yet, due to low samples. Or some other reason. How do I find that number? joe a.
Re: BAYES question
Do train those, which have a Bayesian probability close(r) to 0.5. Or even worse, have a Bayesian probability contrary to the overall score, or actual classification. Training the plethora of spam hitting BAYES_99 might not be a mistake. But it is pretty likely, to *not* improve general SA performance. You're training Bayes. Not SpamAssassin. On 27.04.13 07:37, Joe Acquisto-j4 wrote: Very interesting. However, I don't see any BAYES_xx markings in the headers at all. I assumed that is because it is not scoring yet, due to low samples. Or some other reason. How do I find that number? sa-learn --dump magic, of course. You need at least 200 hams and 200 spams for BAYES to start firing. At the begin, you can train ANY mail. Later, it's easier just to correct mail with misfired score (ham not BAYES_00 and spam not BAYES_99) -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. BSE = Mad Cow Desease ... BSA = Mad Software Producents Desease
Re: BAYES question
Joe Acquisto-j4 skrev den 2013-04-27 01:38: path-to-ham as one might feed missed spam, sa-learn --spam path-to-spam yes, but if you sort based on scores there is no point in using bayes in the first place only thing that is important is to feed what is spam and what is ham to learning -- senders that put my email into body content will deliver it to my own trashcan, so if you like to get reply, dont do it
Re: BAYES question
Joe Acquisto-j4 skrev den 2013-04-27 13:37: Very interesting. However, I don't see any BAYES_xx markings in the headers at all. how is you bayes setup ? what gives sa-learn --dump magic ? I assumed that is because it is not scoring yet, due to low samples. Or some other reason. that could be the reason, others might be diff users bayes learning How do I find that number? --dump magic -- senders that put my email into body content will deliver it to my own trashcan, so if you like to get reply, dont do it
Re: BAYES question
Jari Fredriksson skrev den 2013-04-27 10:59: 0.000 0 28252 0 non-token data: nspam 0.000 0 187579 0 non-token data: nham I have no problems with Bayes whatsoever. this is an good working mta setup, not a bayes problem :) -- senders that put my email into body content will deliver it to my own trashcan, so if you like to get reply, dont do it
Re: BAYES question
Hello John, Saturday, April 27, 2013, 12:50:34 AM, you wrote: JH Simple rule: train any ham that doesn't hit BAYES_00. ??? What about ham that hits BAYES_00 and shows autolearn=no ? -- Best regards, Niamhmailto:ni...@fullbore.co.uk pgp3P8oEu1ldu.pgp Description: PGP signature
Re: BAYES question
Niamh Holding skrev den 2013-04-27 18:25: What about ham that hits BAYES_00 and shows autolearn=no ? if its spam, sa-learn --spam msg else the above is ok, its no need to learn if it already is learned as ham -- senders that put my email into body content will deliver it to my own trashcan, so if you like to get reply, dont do it
Re: BAYES question
27.04.2013 18:24, Benny Pedersen kirjoitti: Jari Fredriksson skrev den 2013-04-27 10:59: 0.000 0 28252 0 non-token data: nspam 0.000 0 187579 0 non-token data: nham I have no problems with Bayes whatsoever. this is an good working mta setup, not a bayes problem :) My MTA does not reject anything. And I collect all spam from Gmail others sources just to get spam. I love spam. -- Your reasoning powers are good, and you are a fairly good planner. signature.asc Description: OpenPGP digital signature
Re: BAYES question
On Fri, 26 Apr 2013, Joe Acquisto-j4 wrote: So, I could just feed a bunch of good mail, to --ham, and spam that is correctly marked as spam as well as missed spam, to --spam? Correct; the important part is that what you train with must be *correctly classified* - training a ham as spam is not helpful... :) Hang onto that as part of your base corpus in case you need to retrain. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Your mouse has moved. Your Windows Operating System must be relicensed due to this hardware change. Please contact Microsoft to obtain a new activation key. If this hardware change results in added functionality you may be subject to additional license fees. Your system will now shut down. Thank you for choosing Microsoft. --- 331 days since the first successful private support mission to ISS (SpaceX)
Re: BAYES question
On Sat, 27 Apr 2013, Niamh Holding wrote: Hello John, Saturday, April 27, 2013, 12:50:34 AM, you wrote: JH Simple rule: train any ham that doesn't hit BAYES_00. ??? What about ham that hits BAYES_00 and shows autolearn=no ? If a ham hits BAYES_00 that means the Bayes system did a good job of recognizing it. Training on it won't hurt, but it won't help much either. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- ...in the 2nd amendment the right to arms clause means you have the right to choose how many arms you want, and the militia clause means that Congress can punish you if the answer is none. -- David Hardy, 2nd Amendment scholar --- 331 days since the first successful private support mission to ISS (SpaceX)
Re: BAYES question
On Sat, 2013-04-27 at 11:59 +0300, Jari Fredriksson wrote: 27.04.2013 04:54, Karsten Bräckelmann kirjoitti: And it is good advice to keep the initial training corpora to a ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this point, we're approaching woodoo. Learning 10 times more ham than spam is most likely to be a bad choice, though.) I don't see any problem with having a corpus like this: I don't see a problem there, either. And if you re-read the complete paragraph, carefully avoiding overvaluing the voodoo marked comment, you might realize I even suggested it. In your case. You mentioned, you do not get much spam anyway. Moreover, you also include mailing-lists in your ham corpus, whereas the average user likely doesn't even participate on a single list. Point being, am I correct in assuming these numbers roughly reflect your ham/spam ratio? 0.000 0 28252 0 non-token data: nspam 0.000 0 187579 0 non-token data: nham I have no problems with Bayes whatsoever. -- char *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: BAYES question
27.04.2013 23:15, Karsten Br�ckelmann kirjoitti: Point being, am I correct in assuming these numbers roughly reflect your ham/spam ratio? 0.000 0 28252 0 non-token data: nspam 0.000 0 187579 0 non-token data: nham Yes. I want more spam, but it nowadays tries to evade me, dunno why. -- Gratitude and treachery are merely the two extremities of the same procession. You have seen all of it that is worth staying for when the band and the gaudy officials have gone by. -- Mark Twain, Pudd'nhead Wilson's Calendar signature.asc Description: OpenPGP digital signature
Re: BAYES question
Hi, To feed ham to bayes, should one only user mis-flagged mail, or may one use unflagged (below 5) mail? Expressed differently, can one feed good messages, sa-learn --ham path-to-ham as one might feed missed spam, sa-learn --spam path-to-spam You can train hams that have scored high (i.e. misclassified hams) and you can proactively train low-scoring mail to try to avoid problems in the first place. If there are some spam messages with BAYES_00, and the database needs to be corrected, is it best to just learn it as spam, or use --forget, then --spam? I just grepped the quarantine and there were a handful of BAYES_00 with overall scores between 6 and 10. Thanks, Alex
Re: BAYES question
On 4/27/2013 at 1:20 PM, John Hardin jhar...@impsec.org wrote: On Fri, 26 Apr 2013, Joe Acquisto-j4 wrote: So, I could just feed a bunch of good mail, to --ham, and spam that is correctly marked as spam as well as missed spam, to --spam? Correct; the important part is that what you train with must be *correctly classified* - training a ham as spam is not helpful... :) Hang onto that as part of your base corpus in case you need to retrain. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Your mouse has moved. Your Windows Operating System must be relicensed due to this hardware change. Please contact Microsoft to obtain a new activation key. If this hardware change results in added functionality you may be subject to additional license fees. Your system will now shut down. Thank you for choosing Microsoft. --- 331 days since the first successful private support mission to ISS (SpaceX) Thanks. I have created YASF (yet another shared folder) to assist in this adventure. joe a.
Re: BAYES question
On 4/27/2013 at 11:17 AM, Benny Pedersen m...@junc.eu wrote: Joe Acquisto-j4 skrev den 2013-04-27 13:37: Very interesting. However, I don't see any BAYES_xx markings in the headers at all. how is you bayes setup ? what gives sa-learn --dump magic ? I assumed that is because it is not scoring yet, due to low samples. Or some other reason. that could be the reason, others might be diff users bayes learning How do I find that number? --dump magic I seem to have not stated my query clearly, as several have suggested this. Or, it was perfectly understood, but I am not comprehending. I don't want to know how to see the tokens, etc (I do, but already know how). I was curious about this BAYES_xx thing, which I gather is something I should see in a message header. joe a.
Re: BAYES question
On Sat, 27 Apr 2013, Alex wrote: Hi, To feed ham to bayes, should one only user mis-flagged mail, or may one use unflagged (below 5) mail? Expressed differently, can one feed good messages, sa-learn --ham path-to-ham as one might feed missed spam, sa-learn --spam path-to-spam You can train hams that have scored high (i.e. misclassified hams) and you can proactively train low-scoring mail to try to avoid problems in the first place. If there are some spam messages with BAYES_00, and the database needs to be corrected, is it best to just learn it as spam, or use --forget, then --spam? I just grepped the quarantine and there were a handful of BAYES_00 with overall scores between 6 and 10. Just re-learn it as spam, that automatically forgets that it was ham. --forget is only useful to completely remove that message from the database. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Any time law enforcement becomes a revenue center, the system becomes corrupt. --- 331 days since the first successful private support mission to ISS (SpaceX)
Re: BAYES question
On Sat, 27 Apr 2013, Joe Acquisto-j4 wrote: I don't want to know how to see the tokens, etc (I do, but already know how). I was curious about this BAYES_xx thing, which I gather is something I should see in a message header. Yes, the BAYES_## are rules that would show up in the hit-rules list in the processed message's headers - assuming bayes is working. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Any time law enforcement becomes a revenue center, the system becomes corrupt. --- 331 days since the first successful private support mission to ISS (SpaceX)
Re: BAYES question
On Sat, 2013-04-27 at 19:00 -0400, Joe Acquisto-j4 wrote: Very interesting. However, I don't see any BAYES_xx markings in the headers at all. I assumed that is because it is not scoring yet, due to low samples. Or some other reason. that could be the reason, others might be diff users bayes learning How do I find that number? --dump magic I seem to have not stated my query clearly, as several have suggested this. Or, it was perfectly understood, but I am not comprehending. I don't want to know how to see the tokens, etc (I do, but already know how). You assumed Bayes might not working due to low samples, and asked how to find that number. Are you not asking for the number of ham and spam learned? sa-learn --dump magic See the result of this command for the number of spam and ham learned (nspam and nham respectively). You must run that command as the user SA runs as when scanning incoming mail. Which might be the recipient's system user, or a site-wide user depending on your setup. Obviously, initially training Bayes needs to be done as that very user(s), too. Which user(s) are that? Do you use site-wide or per-user configuration? I was curious about this BAYES_xx thing, which I gather is something I should see in a message header. The BAYES_nn headers are rules reflecting the Bayesian probability of the mail on a scale between ham (0.00) and spam (1.00). The two digit number is the probability expressed in percent. As has been pointed out at least twice in this thread, Bayes will not start working after at least 200 ham and spam each have been trained. How many did you train yet? (Hint: Output of above command.) Also, of course, Bayes needs to be enabled. It is by default, though you might want to cross-check with your site and/or user configuration. See the section Learning Options in the docs. http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html -- char *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: BAYES question
On Fri, 26 Apr 2013, Joe Acquisto-j4 wrote: To feed ham to bayes, should one only user mis-flagged mail, or may one use unflagged (below 5) mail? Expressed differently, can one feed good messages, sa-learn --ham path-to-ham as one might feed missed spam, sa-learn --spam path-to-spam You can train hams that have scored high (i.e. misclassified hams) and you can proactively train low-scoring mail to try to avoid problems in the first place. Simple rule: train any ham that doesn't hit BAYES_00. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- The Tea Party wants to remove the Crony from Crony Capitalism. OWS wants to remove Capitalism from Crony Capitalism. -- Astaghfirullah --- 330 days since the first successful private support mission to ISS (SpaceX)
Re: BAYES question
On 4/26/2013 at 7:50 PM, John Hardin jhar...@impsec.org wrote: On Fri, 26 Apr 2013, Joe Acquisto-j4 wrote: To feed ham to bayes, should one only user mis-flagged mail, or may one use unflagged (below 5) mail? Expressed differently, can one feed good messages, sa-learn --ham path-to-ham as one might feed missed spam, sa-learn --spam path-to-spam You can train hams that have scored high (i.e. misclassified hams) and you can proactively train low-scoring mail to try to avoid problems in the first place. Simple rule: train any ham that doesn't hit BAYES_00. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ Well, right now, there are no bayes hits at all. I cleared bayes to re-train, after correcting for a botched initial scheme. While I am getting a fair amount of missed spam, there is very little mis-classified. So I am looking for a way to speed up learning. So, I could just feed a bunch of good mail, to --ham, and spam that is correctly marked as spam as well as missed spam, to --spam? or do I need a rest? joe a.
Re: BAYES question
On Fri, 2013-04-26 at 21:25 -0400, Joe Acquisto-j4 wrote: Well, right now, there are no bayes hits at all. I cleared bayes to re-train, after correcting for a botched initial scheme. While I am getting a fair amount of missed spam, there is very little mis-classified. So I am looking for a way to speed up learning. Initial training. Train on existing, verified corpora. So, I could just feed a bunch of good mail, to --ham, and spam that is correctly marked as spam as well as missed spam, to --spam? Yes. Bayes by default will not be used for scoring (it does learn, though), unless at least 200 spam and ham each have been learned. So by training, you can have Bayes kick in earlier. Ham usually does not change much over time. Spam does, significantly. Training 1000 ham received the last months, years, whatever, thus generally is OK. You'd want to limit the time span for training spam, though. And it is good advice to keep the initial training corpora to a ratio roughly assembling your ham/spam ratio, or maybe 1/1. (At this point, we're approaching woodoo. Learning 10 times more ham than spam is most likely to be a bad choice, though.) or do I need a rest? Dunno. Got a beer near you? -- char *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: BAYES question
On Fri, 2013-04-26 at 19:38 -0400, Joe Acquisto-j4 wrote: To feed ham to bayes, should one only user mis-flagged mail, or may one use unflagged (below 5) mail? The Bayesian classifier is a subsystem mostly independent from SA. Most SA rules are rather white or black. Match, or don't. And scored according to the probability of actually distinguishing ham from spam. The higher the absolute score of a given rule, the higher the probability to be ham (negative score) or spam (positive score). Mere hints, but not reliable indicators, have low scores. For a scoring system like SA, this is generically true. With different, varying scales. It is correct for single rules. Dunno would be a rule's score of zero. The higher the score, the more spammy it is. It is correct for the overall, resulting score of a message. The dunno tipping point is 5 by default. A message scoring 4.5 is more likely ham, though you'd better not bet on it. And it also is correct for the Bayes subsystem, with a notable scale of it's own -- ranging from 0 (ham) to 1 (spam), with 0.5 being a big fat shrug. The BAYES_nn rules and their scores are set accordingly. BAYES_50 really should have no score. Back to the question, and explaining why I mentioned the above. mis-flagged mail, false positives and false negatives, do exist on multiple levels. The OP mentioned it with respect to the *overall* score. And asked about *Bayes* training. Training Bayes, first and foremost, helps Bayes only. In the end, it might make a significant difference overall, sure. However, when it comes to the question whether training Bayes might help... Look at the Bayesian probability. Not the overall SA score. Do train those, which have a Bayesian probability close(r) to 0.5. Or even worse, have a Bayesian probability contrary to the overall score, or actual classification. Training the plethora of spam hitting BAYES_99 might not be a mistake. But it is pretty likely, to *not* improve general SA performance. You're training Bayes. Not SpamAssassin. -- char *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Bayes Question
Craig wrote: Hello All- My bayes database seems to have problems and I would like suggestion on how to correct. Here is my issue- I take any spam email from my users and run the following commands a. spamassassin -R name of spam file to check b. spamassassin -r name of spam file to check c. sa-learn --forget name of spam file to check d. sa-learn --spam name of spam file to check 1) running forget before training is redundant. SpamAssassin is smart enough to realize when it is retraining a message that was previously learned the wrong way and compensate correctly. 2) Running sa-learn --spam after spamassassin -r is redundant, unless you've set bayes_learn_during_report to 0. So really, you only need to do a and b. I assume you're trying to over-do-it on purpose, but did want to point out what parts are redundant for clarity sake. I re-run an email (spamassassin -D -t name of spam file to check name of spam file to check.txt) to check all is well-that bayes learned the email as spam. Today after running the above I still have several messages with the following output info: Content analysis details: (-0.1 points, 5.0 required) pts rule name description -- -- 0.1 FORGED_RCVD_HELO Received: contains a forged HELO -0.2 BAYES_40 BODY: Bayesian spam probability is 20 to 40% [score: 0.2729] Thoughts? My first thought is: What was it's bayes score *before* you trained? Training a single message as spam will not guarantee that it will immediately get a high bayes score. If most of the tokens in the message strongly match a large volume of nonspam training, it will take a similar volume of spam training to overcome it. Otherwise one mis-trained message could wildly upset your whole bayes database causing large numbers of mis-marked messages. If you really are seeing this problem a lot, you might want to take some of the spams and run them through spamassassin -D bayes in order to get the individual tokens and their scores to be printed out. (Previously you used spamassassin -D -t, which isn't the same. That's general debugging, but doesn't enable detailed bayes debugging.)
Re: Bayes question
M. Lewis wrote: I recently lost a hard drive and have had to setup everything again. I'm seeing a fair amount of spam that is getting through my filters. From what I can see in the headers of messages, bayes does not seem to be used at all. I'm reasonable sure this is the reason I'm seeing spam. If I do #spamassassin -t -D spam.txt I can clearly see bayes is being used. Suggestions for what to check? Thanks for any ideas. M sa-learn --dump magic What does it say? -- Steve
Re: Bayes question
M. Lewis wrote: Thanks Steve, # sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 57468 0 non-token data: nspam 0.000 0 16419 0 non-token data: nham 0.000 0 181931 0 non-token data: ntokens 0.000 0 1139892654 0 non-token data: oldest atime 0.000 0 1140583854 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 1140584727 0 non-token data: last expiry atime 0.000 0 691200 0 non-token data: last expire atime delta 0.000 0 1510 0 non-token data: last expire reduction count Please keep replies on the list I was wondering if you'd had enough ham and spam to get past the minimums. Looks like you have. How about posting the output from spamassassin -D --lint -- Steve
Re: Bayes question
Sorry, I am in the habit of 'reply' as opposed to 'reply all'. I see no 'obvious' errors in spamassassin -D --lint which was the first thing I checked. Shortly before you asked about the 'sa-learn --dump magic', I found this message from Matt: http://marc.theaimsgroup.com/?l=spamassassin-usersm=113327783327806w=2 I did this and now I'm seeing bayes markups. So hopefully it was just a perms issue that is now resolved. Thanks, Mike Steven Stern wrote: M. Lewis wrote: Thanks Steve, # sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 57468 0 non-token data: nspam 0.000 0 16419 0 non-token data: nham 0.000 0 181931 0 non-token data: ntokens 0.000 0 1139892654 0 non-token data: oldest atime 0.000 0 1140583854 0 non-token data: newest atime 0.000 0 0 0 non-token data: last journal sync atime 0.000 0 1140584727 0 non-token data: last expiry atime 0.000 0 691200 0 non-token data: last expire atime delta 0.000 0 1510 0 non-token data: last expire reduction count Please keep replies on the list I was wondering if you'd had enough ham and spam to get past the minimums. Looks like you have. How about posting the output from spamassassin -D --lint -- Those who can, do. Those who cannot, teach. Those who cannot teach, HACK! 00:30:01 up 3 days, 35 min, 6 users, load average: 0.54, 0.60, 0.58 Linux Registered User #241685 http://counter.li.org
Re: bayes question (sa-learn)
Philipp Snizek wrote: [...] However, I fear SA learns that headers coming from my internal MTA could be spam and so causing false results on real spam. Exactly. Forwarding e-mail breaks the original information and has to be avoided. What experiences have you made or how have you solved this ? (e.g. by setting up an IMAPd on the spamgateway?) You can configure an imapd wherever you want, there are many tools out there to fetch IMAP-mailboxes to a local maildir/mbox/anything, which can then be used by salearn. I use cyrus and have to folders spamreport and hamreport which are shared amount myusers, which have write, but no read access. Even if I ignored those folders, my users would just be happy to give feedback and contribute. ;-) -- CU, Patrick.
Re: Bayes question
Robert Swan wrote: I have a pair of Spamassassin servers filtering e-mail (Spamassassin 3.0.4, spamd/spamc, Postfix, redhat 9) I was wondering if I could share the bayes database between the two server rather than having each with its own and having to do the salearn process twice. Any Thoughts? Robert Peace he would say instead of goodbyepeace my brother. Yes... Use the bayes (MY|Postgre)SQL modules, see the docs on how to set this up. -- Thanks, James
RE: Bayes question
I attempted to do that once, with a network file system, but it didnt seem to know how to handle the locking properly. I know I did something wrong, so if anyone else has a solution, Id also be happy to hear it! J -Alan Fullmer [EMAIL PROTECTED] www.xnote.com www.zoobuh.com From: Robert Swan [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 27, 2005 12:22 PM To: users@spamassassin.apache.org Subject: Bayes question I have a pair of Spamassassin servers filtering e-mail (Spamassassin 3.0.4, spamd/spamc, Postfix, redhat 9) I was wondering if I could share the bayes database between the two server rather than having each with its own and having to do the salearn process twice. Any Thoughts? Robert Peace he would say instead of goodbyepeace my brother.
RE: Bayes question
Boy... anytime I've done some kind of network file sharing across a system or two, I have never done it for good performance reasons... only convenience sakes. And even then, never large files. Almost a decade ago when I was performing massive COBOL database conversions to load data into flat files to be imported into a relational database, I noticed a significant decrease in performance of the machine that is accessing remotely stored files. It was far easier/faster to auto-ftp the half a gigabyte of information to another machine so that it could have the information *local* and therefore it can access the data extremely quickly. Depending on the machine and it's resources, I'd expect it to slow down it's processing between 25-40% on the average. If the data remained on a remote machine, then the CPU has to use it's resources to handle the resources on the remote file system as if it's a part of it's own. It is then at the whim of a NFS file system handle that may or may not stay fresh. Even if the machines are separated by a couple feet of cable .. for me .. back then ... NFS wasn't reliable enough for me to be able to bank on it being up. Because when the remote NFS file handle went stale, it caused the local machine to hang and drag. Maybe NFS is better now than back then... I don't know. The machine doesn't make a network *call* to the other machine to borrow it's resources, it uses it's own resources to access the remote files as if they are local yet, it does it over a network cable rather than the typical high-speed of motherboard's bus that would access the local hard drive. So... the only way I'd do this in this day and age would be to have the kind of hardware that you could build a multi-node supercomputer where they all share the same hard drive over a fiber optic network with lightning quick hard disks on the server node as it shares its resources with the worker nodes. In that case, the networking element has been removed from the equation as the slowest link in the chain of events. On Wed, July 27, 2005 16:37, Alan Fullmer said: I attempted to do that once, with a network file system, but it didn't seem to know how to handle the locking properly. I know I did something wrong, so if anyone else has a solution, I'd also be happy to hear it! :-) -- Tyler Nally [EMAIL PROTECTED]
Re: Bayes question
Alan Fullmer wrote: I attempted to do that once, with a network file system, but it didn’t seem to know how to handle the locking properly. I know I did something wrong, so if anyone else has a solution, I’d also be happy to hear it! J As JamesDR suggested.. Do it right, use SQL. It's a database that's *designed* to be accessed remotely. Trying to share a DB_File based database over NFS is asking for poor performance and trouble.
Re: Bayes question
Joe Zitnik wrote: I apologize if this has been asked before, but I need some clarification. If I have autolearn for ham set to 0, and the default BAYES_00 score assigns mail a negative value, and a spam message comes through with enough good text in it to give it a BAYES_00 and therefore a negative value BUT it is not a message that has been learned before, is there the potential for that mail to be learned as ham based on the negative BAYES score assigned it? No. It's 100% impossible, as the bayes autolearner makes it's judgments based on the score the message would have gotten if bayes was disabled. That kind of self-feedback is exactly why this is done. (Note that calculating the score as if bayes was disabled also involves calculating the score using scoreset 0 or 1 instead of 2 or 3.) The autolearner also ignores any userconf flagged rules, such as white and blacklists.
Re: bayes question
On Mon, Jan 10, 2005 at 04:22:03PM -0500, Sunny Forro wrote: Help! I know this has got to be the number 1 question. But I haven't had any luck with it: Actually, it doesn't happen that often these days. I'm getting: Bayes: bayes db version 2 is not able to be used, aborting! errors. I followed the instructions in UPGRADE, i.e. I shutdown all running processes and verified there were no locks. Ran SA-LEARN --rebuild. Installed SpamAssassin 3.0.2. Ran SA-LEARN --SYNC and got the bayes db version 2 warning read the FAQ found the post on it's just a warning that'll go away. Ran SA-LEARN --SYNC again, same problem. Waited 2 days (about 1500 messages) and still receive same problem. My bayes db files (bayes.seen bayes.toks) are having their file dates and sizes kept up to date, but I still receive this error. I'm running SA 3.0.2 with MailScanner 4.37.7 (although I get the error no matter how I run SA, including spamassassin -t), on FreeBSD 4.9p13 with Sendmail 8.13.2. http://wiki.apache.org/spamassassin/BayesUpgradeError Not clear from the above if you read that, but at the end it talked about sending the -D output. Without that, there isn't much that can be done. Michael pgpJcCemQyTTu.pgp Description: PGP signature
RE: bayes question
either. Elmer Steve Forro III (Sunny) Assistant Manager of Information Systems Compco Industries 400 West Railroad Street Columbiana, OH 44511 Phone: (330) 482-0200 x229 Fax: (330) 482-6429 Cell:(330) 240-6611 Email: [EMAIL PROTECTED] Web: http://www.compcoind.com/ -Original Message- From: Michael Parker [mailto:[EMAIL PROTECTED] Sent: Monday, January 10, 2005 4:30 PM To: Sunny Forro Cc: users@spamassassin.apache.org Subject: Re: bayes question On Mon, Jan 10, 2005 at 04:22:03PM -0500, Sunny Forro wrote: Help! I know this has got to be the number 1 question. But I haven't had any luck with it: Actually, it doesn't happen that often these days. I'm getting: Bayes: bayes db version 2 is not able to be used, aborting! errors. I followed the instructions in UPGRADE, i.e. I shutdown all running processes and verified there were no locks. Ran SA-LEARN --rebuild. Installed SpamAssassin 3.0.2. Ran SA-LEARN --SYNC and got the bayes db version 2 warning read the FAQ found the post on it's just a warning that'll go away. Ran SA-LEARN --SYNC again, same problem. Waited 2 days (about 1500 messages) and still receive same problem. My bayes db files (bayes.seen bayes.toks) are having their file dates and sizes kept up to date, but I still receive this error. I'm running SA 3.0.2 with MailScanner 4.37.7 (although I get the error no matter how I run SA, including spamassassin -t), on FreeBSD 4.9p13 with Sendmail 8.13.2. http://wiki.apache.org/spamassassin/BayesUpgradeError Not clear from the above if you read that, but at the end it talked about sending the -D output. Without that, there isn't much that can be done. Michael
Re: bayes question
On Mon, Jan 10, 2005 at 04:50:57PM -0500, Sunny Forro wrote: debug: bayes: found bayes db version 2 bayes: bayes db version 2 is not able to be used, aborting! at /usr/local/lib/perl5/site_perl/5.8.4/Mail/SpamAssassin/BayesStore/DBM.pm line 160. Ok, yeah, this is just a warning, no error, forget that it says aborting, it is just aborting the check for if scanning is available. debug: bayes: found bayes db version 2 debug: bayes: detected bayes db format 2, upgrading debug: bayes: upgrading database format from v2 to v3 Now we get to the meat of the matter. Here is where we finally open up a read/write connection and force the upgrade. And it looks like it finishes just fine. If you run sa-learn -D --sync again does it show the same upgrading message? You're running this command as root, I assume. Are you using a bayes_path config option? Are you using a sitewide bayes? Are you possibly just seeing this message multiple times as you run it as different users? Michael pgpHJat0si3e7.pgp Description: PGP signature
Re: bayes question
In the future, please be sure to CC the list as well, so it can get dumped into the archives for future use. On Mon, Jan 10, 2005 at 06:13:16PM -0500, Sunny Forro wrote: Michael, I am running it as root. I get the error every time I run SA-LEARN -D --SYNC, I don't get bayes checking with spamassassin. I haven't been running it with a bayes_path option, my old SpamAssassin used /root/.spamassassin as the db path. This is a sitewide setup, it's used to filter emails coming in for some charitable organizations hosted on this box. I effectively get the same exact output every time I run sa-learn -d --sync with the exception of the number of tokens it ties to the db file. It still says upgrading database from version 2 to version 3 every time. Very odd. It is possible that there is some sort of db corruption that is causing a strange failure. Are there any extra files in /root/.spamassassin? Here are a few stabs in the dark that may or may not help. Try setting bayes_path and bayes_file_mode and running the sync again. Read up on sitewide bayes on the wiki. You could try to do a sa-learn --backup and then a sa-learn --restore to see if that fixed the problem. Did you move this db from another machine? Maybe it is a Berkeley DB library conflict? Perhaps a db_dump and db_load (see wiki for info) would help. For that matter, you might try a sa-learn --import first and see if that helps. Worst case, blow away the database files and start from scratch. Michael pgpSC8xIDdSKh.pgp Description: PGP signature
Re: Bayes question
Chuck Campbell wrote: On Mon, Dec 20, 2004 at 12:56:43PM -0600, Steve Bondy wrote: For example, the default score in 2.6.x for BAYES_90 is either 2.454 or 2.101. If that's the only rule you hit, and your threshold is above those numbers, it will come through. But what if you repeatedly learn the message(s) in question as spam? Shouldn't bayes start to give it higher scores? If it becomes a near perfect match, it should get a bayes_99, right? true, but by default BAYES_99 alone still won't mark a message as spam. the default BAYES_99 score is either 4.07 or 1.886, and the default for spam is 5.0. also bayes won't learn the *exact* same message repeatedly. if it's already seen a message it won't process it at all. i'm not sure if it works off the message-id or a hash of the message content. i set BAYES_99 to a very high score for my personal setup, because i have never seen a legit message yet that triggered that rule. -jsd-
Re: Bayes question
On Mon, Dec 20, 2004 at 08:28:45PM -0800, Jon Drukman wrote: also bayes won't learn the *exact* same message repeatedly. if it's already seen a message it won't process it at all. i'm not sure if it works off the message-id or a hash of the message content. Just for clarification, it's a SHA1 hash of several message headers and a section of the body. It's not (anymore) simply the Message-Id header. :) -- Randomly Generated Tagline: Let's start by ... spelling the word correctly... - Roxanne Tisch pgpafp2RNSKY1.pgp Description: PGP signature
RE: Bayes question
Just because you learn something as spam doesn't mean it will be blocked. SA will add a score to the message based on the bayes rules, but if the bayes rules are the only ones that get hit, and they score less than your threshold, it won't keep the stuff out. For example, the default score in 2.6.x for BAYES_90 is either 2.454 or 2.101. If that's the only rule you hit, and your threshold is above those numbers, it will come through. -Original Message- From: Chuck Campbell [mailto:[EMAIL PROTECTED] Sent: Monday, December 20, 2004 12:02 PM To: SpamAssassin Users Subject: Bayes question Lately I've been seeing lots of very similar spams get through my 2.6.3 I don't run autolearn, but I save my spam and ham daily, and run them through sa-learn -spam and -ham respectively. I'm puzzled why a spam I've manually learned via sa-learn keeps coming through. What can I check to ensure things are working properly? BTW, I know I should upgrade, but time isn't available right now, and this setup is catching more than 99.5 percent of the spam coming in. I'm just curious about bayes not working as expected any longer, although it still catches LOTS of others, so that can't be it completely... baffled, -chuck
Re: Bayes question
On Mon, Dec 20, 2004 at 12:56:43PM -0600, Steve Bondy wrote: Just because you learn something as spam doesn't mean it will be blocked. SA will add a score to the message based on the bayes rules, but if the bayes rules are the only ones that get hit, and they score less than your threshold, it won't keep the stuff out. For example, the default score in 2.6.x for BAYES_90 is either 2.454 or 2.101. If that's the only rule you hit, and your threshold is above those numbers, it will come through. But what if you repeatedly learn the message(s) in question as spam? Shouldn't bayes start to give it higher scores? If it becomes a near perfect match, it should get a bayes_99, right? -chuck
RE: Bayes question
I'm no expert on Bayes, but as far as I know, repeatedly learning the same message over and over again doesn't do you any good. Once the tokens are in there, that's it. The bayes score goes up as more tokens in the message match Someone please correct me if I'm wrong, and confirm if I'm right... It would help me out too. Steve -Original Message- From: Chuck Campbell [mailto:[EMAIL PROTECTED] Sent: Monday, December 20, 2004 3:54 PM To: Steve Bondy Cc: SpamAssassin Users Subject: Re: Bayes question On Mon, Dec 20, 2004 at 12:56:43PM -0600, Steve Bondy wrote: Just because you learn something as spam doesn't mean it will be blocked. SA will add a score to the message based on the bayes rules, but if the bayes rules are the only ones that get hit, and they score less than your threshold, it won't keep the stuff out. For example, the default score in 2.6.x for BAYES_90 is either 2.454 or 2.101. If that's the only rule you hit, and your threshold is above those numbers, it will come through. But what if you repeatedly learn the message(s) in question as spam? Shouldn't bayes start to give it higher scores? If it becomes a near perfect match, it should get a bayes_99, right? -chuck
Re: Bayes question
On Mon, Dec 20, 2004 at 04:13:44PM -0600, Steve Bondy wrote: I'm no expert on Bayes, but as far as I know, repeatedly learning the same message over and over again doesn't do you any good. Once the tokens are in there, that's it. The bayes score goes up as more tokens in the message match It's not the same message... exactly. It is the same spam, coming from many different senders, each with a unique message ID. I keep getting more of them, and I keep learning them with sa-learn. I'm just not getting SA to notice them as spam. -chuck
Re: Bayes question
On Mon, Dec 20, 2004 at 04:18:58PM -0600, Chuck Campbell wrote: It's not the same message... exactly. It is the same spam, coming from many different senders, each with a unique message ID. I keep getting more of them, and I keep learning them with sa-learn. I'm just not getting SA to notice them as spam. What rules are hitting? Is BAYES_99 one of them? Michael pgpMsoBR1DAtl.pgp Description: PGP signature
RE: Bayes question
On Mon, Dec 20, 2004 at 04:13:44PM -0600, Steve Bondy wrote: I'm no expert on Bayes, but as far as I know, repeatedly learning the same message over and over again doesn't do you any good. Once the tokens are in there, that's it. The bayes score goes up as more tokens in the message match It's not the same message... exactly. It is the same spam, coming from many different senders, each with a unique message ID. I keep getting more of them, and I keep learning them with sa-learn. I'm just not getting SA to notice them as spam. -chuck So the message content is the same, but coming from different sources?
RE: Bayes question
Title: Re: Bayes question So, what happens when you take these two overlapping databases and combine them is that certain tokens (those that have overlap) are then double counted. This makes the database, at least according to the bayes model SA is using, statistically invalid. Using this reasoning, the tokens that overlap are going to be identified as being related to the same message based on the same hashes. Therfore it should be possible to detect the tokens that are being double counted, and to dismiss them when they do. If you can do this then surely the database remains statistically correct and can be safely merged? --- This email from dns has been validated by dnsMSS Managed Email Security and is free from all known viruses. For further information contact [EMAIL PROTECTED]
Re: Bayes question
Michael, I understood the dangers behing the theory - I'll get into the analysis of all the bayes databases later on. I guess the only way to do it cleanly is to feed the same HAM+SPAM messages to all the bayes's learning mechanisms... Thanks for your time, Ricardo
Re: Bayes question
According to the docs, --restore is destructive (in the sense it destroys the previous contents of the database). Would you guys be interested in such a feature? I plan to use a generic bayes DB (which is maintained by our tech team), and merge it with each clients's own DB (which would result in a highly accurate, well-trained bayes mechanism). Anyone care to share your thoughts on this? TIA, Ricardo
Re: Bayes question
On Sat, Dec 04, 2004 at 10:46:22AM +, Ricardo Oliveira wrote: According to the docs, --restore is destructive (in the sense it destroys the previous contents of the database). Would you guys be interested in such a feature? I plan to use a generic bayes DB (which is maintained by our tech team), and merge it with each clients's own DB (which would result in a highly accurate, well-trained bayes mechanism). Anyone care to share your thoughts on this? No, this is not a good idea, please don't make a tool like this generally available, here is the reason: When you learn tokens from a message those tokens are added to the database, or if they already exist their counts are increased, either as spam or ham depending on how you are learning. At the same time a notation is made that you learned that message by storing, in later versions, a pseudo message id (it's basically the SHA1 hash of several pieces of data that should be unique) so that bayes will not re-learn the tokens from that message. When you take two different bayes databases that have been learning separately for any length of time you are bound to have overlap in the messages they learned. Everyone gets the same spam and if the database is from someone you do business with, have relationship with or share the same interests you are bound to have ham overlap as well. So, what happens when you take these two overlapping databases and combine them is that certain tokens (those that have overlap) are then double counted. This makes the database, at least according to the bayes model SA is using, statistically invalid. Now, that being said, lets say you did an analysis and found that the two databases had no overlap, or at least very little (I have no idea what very little would mean in this case). You could probably convince yourself, and it's math and statistics so I'm horrible at it but I'd beat some folks on this list could provide a formula, that the amount of overlap is statistically insignificant. If you could do that then you could combine the databases, in which case I leave it as an exercise to the reader. When calculating overlap it is VERY important to remember this. The pseudo message ids that are stored in the seen database, they changed in the middle of the 3.0 development cycle. So, if you used bayes in SA in a version 3.0 you will have mixed message ids in your database. In this case it may be difficult to determine how much overlap your databases have. If you do write such a tool, I ask that you not make it available. There are several issues that someone attempting this should study carefully and a simple tool makes it too easy to ignore those issues and it could leave to a broken bayes database in the end. Michael pgp6Fajw4ZlQ6.pgp Description: PGP signature
Re: Bayes question
What about joining several databases together? I'd like to use a general bayes DB, and join it with some clients's particular DB's. TIA, Ricardo
Re: Bayes question
On Fri, 3 Dec 2004 19:37:05 +, Ricardo Oliveira [EMAIL PROTECTED] wrote: What about joining several databases together? I'd like to use a general bayes DB, and join it with some clients's particular DB's. TIA, Ricardo Never tried it, but it should be possible with sa-learn --backup and sa-learn --restore. Mike
Re: Bayes question
By the way - are the bayes databases on disk portable (in the sense I could import or copy them to another server and use them accordingly)? Thanks in advance
Re: Bayes question
On Thu, 2 Dec 2004 22:27:05 +, Ricardo Oliveira [EMAIL PROTECTED] wrote: By the way - are the bayes databases on disk portable (in the sense I could import or copy them to another server and use them accordingly)? Thanks in advance I haven't had a problem doing that, moving from one Sparc to another. Mike
Re: Bayes question
Austin Weidner wrote: Really trying to figure out bayes. Auto learn is set up, and my headers are showing autolearn=spam However, when I do sa-learn --dump magic, there are zero spams and zero hams. By using the -D (debug) option, I can see sa-learn is looking at: debug: bayes: 17216 tie-ing to DB file R/O /root/.spamassassin/bayes_toks debug: bayes: 17216 tie-ing to DB file R/O /root/.spamassassin/bayes_seen When I get a new spam, these files are NOT being updated. The files being updated are in: /var/spool/mqueue/.spamassassin How do I sort this out? Autolearn seems to be feeding the files in the mqueue directory, but sa-learn (and therefore I would think spamassassin itself) wants it in /root/.spamassassin This is a MailScanner/SA installation. I've tried to set the path in the spam.assassin.prefs.conf file to: bayes_path /root/.spamassassin/bayes bayes_file_mode 0660 But this didn't do anything. In fact, when I did this, autolearn=spam stopped showing up in headers. Any ideas? Did you create a softlink of local.cf in /etc/mail/spamassassin to your spam.assassin.prefs.conf . Which ever path of bayes you set in local.cf spamassassin will follow that path -- Regards, Rakesh B. Pal Emergic CleanMail Team. Netcore Solutions Pvt. Ltd. == perl -emap{y/a-z/l-za-k/;print}shift Jjhi pcdiwtg Ptga wprztg, == -- Netcore's New Website http://www.netcore.co.in --
Re: Bayes question
At 01:58 AM 11/23/2004 -0500, Austin Weidner wrote: Really trying to figure out bayes. Auto learn is set up, and my headers are showing autolearn=spam However, when I do sa-learn --dump magic, there are zero spams and zero hams. By using the -D (debug) option, I can see sa-learn is looking at: debug: bayes: 17216 tie-ing to DB file R/O /root/.spamassassin/bayes_toks debug: bayes: 17216 tie-ing to DB file R/O /root/.spamassassin/bayes_seen When I get a new spam, these files are NOT being updated. The files being updated are in: /var/spool/mqueue/.spamassassin What's happening is that your mail is being processed as a non-root user, probably something like mail sendmail or some similar user that has mqueue as it's homedir. You can look at the owner of the bayes files to see what user it is running as. probably the best option is to use sa-learn parameters to tell it where the db is. sa-learn --dbpath /var/spool/mqueue/.spamassassin also be sure to chown those files back to their original owner when you're done