Re: spamassassin learn spam
Hi Harald, yes i execute this as the root user. but with which user i have to execute sa-lean --spam if i use amavisd? per default you can't switch to the amavis user to execute the learn command. cheers
Re: spamassassin learn spam
On 08.05.20 09:27, supp...@mmarzouki.de wrote: i have spamassassin on my centos7 system. sometimes i received spammails and i would like to learn this mails as spam with sa-learn --spam. but it doesn't seem to work, because the spamscore is before and after the same. what i did?: i have a spammail in my inbox as maildir format. when i check the spamscore with spamassassin < $spam_mail, then i get a score from 1.0. i learned this mail as spam with sa-learn --spam $spam_mail and the system confirm this. if i check the mail again with spamassassin, i get the same score 1.0. is this normal? i think the score should over the required spamscore you need to train at least 200 spams and 200 hams before bayes start kicking your system can use different bayes database, e.g. systems using amavis se bayes database in amavis user's directory one spam sometimed may not be enough to change the resulting score post your X-Spam headers. I have added these lines to mu user_prefs file: add_header all Report _REPORT_ add_header all Languages _LANGUAGES_ add_header all tokens-spam _SPAMMYTOKENS(25,short)_ add_header all tokens-ham _HAMMYTOKENS(25,short)_ add_header all tokens-sum _TOKENSUMMARY_ add_header all countries _RELAYCOUNTRY_ my X-Spam-Report contains lines like these to see how my bayes works: * -0.0 BAYES_20 BODY: Bayes spam probability is 5 to 20% * [score: 0.1545] -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. On the other hand, you have different fingers.
Re: Spamassassin Learn
From: Gene Heskett [EMAIL PROTECTED] On Tuesday 07 February 2006 15:27, Clay Davis wrote: Does anyone have any good techniques for capturing a sample of ham that can be used as the ham corpus. I'm in a corporate environment and am not keen on the idea of intercepting non-spam messages. I will if I have to, but was hoping someone had a better idea. I wouldn't have too guilty a consience(sp?) on that subject because generally, you won't be reading very much other than intercepted spam. There may be an FP in there occasionally, but you'll soon learn to catch those and feed them to the ham learner hence move them to the correct mailbox folder. In other words, to make an omelete, you normally have to break a few eggs. What you accidently read in an FP should be treated with the usual amount of salt and otherwise forgotten. Intercept some ham, feed it through SpamAssasin's salearn, forget to store it on the way out. You don't have to know WHAT you trained with. You just have to know it's ham. Now, if you are in a corporate environment and don't have a strong email policy you'd best do that first. Then you can sample the email, with some discretion, legally and properly to get a test set of ham messages. It MAY even be good corporate policy to save, for at least a short time, all incoming and outgoing emails. 3 months to 6 months may be OK. This will be handy if an employee is caught engaging in illegal activities and must be terminated for cause, for example. Just make sure that the company has a firm and clear email policy with regards to permissable uses and notify the employees that the company reserves the right to read emails in and out. If you don't your company could face some interesting time if the fit hits the shan. {^_^} Regards, Clay On 2/7/2006 at 3:16 pm, in message [EMAIL PROTECTED], Matt Kettler [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Can you just feed spamassassin spam or do you need to give it ham also? I read the docs and it didn't say you had to feed it ham. I then read another doc and it said you should feed it equal amounts of spam and ham. Yes, you really should feed it both. You also should strive for a 1:1 ratio of spam and nonspam, but don't kill yourself to get there. SA's use of chi-squared combining makes it very tolerant of wild imbalances in training. However, the closer you are to a 1:1 ratio the better SA will be able to distinguish tokens that are present in both kinds of mail and ignore them. So this is a worthwhile goal to strive for as long as it doesn't become a burden. My current training ratio is about 7:1 spam:nonspam, but in the past it's been as bad as 20:1. Both of those are very far off from equal amounts, but the imbalance has never caused me any problems. From my sa-learn --dump magic output as of today: 0.000 0 995764 0 non-token data: nspam 0.000 0 145377 0 non-token data: nham That works out to a ratio of 6.85:1 -- Cheers, Gene People having trouble with vz bouncing email to me should add the word 'online' between the 'verizon', and the dot which bypasses vz's stupid bounce rules. I do use spamassassin too. :-) Yahoo.com and AOL/TW attorneys please note, additions to the above message by Gene Heskett are: Copyright 2006 by Maurice Eugene Heskett, all rights reserved.
Re: Spamassassin Learn
jdow wrote: From: Matt Kettler [EMAIL PROTECTED] (Note I use mailscanner, hence the odd log syntax) grep is spam, /var/log/maillog |wc -l 3434 grep is spam, /var/log/maillog|grep autolearn=spam |wc -l 2766 grep is spam, /var/log/maillog|grep autolearn=not spam | wc -l 0 snip I wonder if he has greylisting turned on. I do, I don't know about Jim. (Note: my greylisting configuration isn't entirely conventional. I only greylist certain hosts using regex rules in milter-greylist's ACLs. I greylist APNIC, LACNIC, dynamic-looking hostnames, and hosts with no RDNS.)
Re: Spamassassin Learn
On Tue, Feb 07, 2006 at 10:05:05PM -0500, Matt Kettler wrote: For reference, these are the only rules in a stock SA 3.1.0 that can give you a negative learning score: score HABEAS_ACCREDITED_COI 0 -8.0 0 -8.0 score RCVD_IN_BSP_TRUSTED 0 -4.3 0 -4.3 score HABEAS_ACCREDITED_SOI 0 -4.3 0 -4.3 score ALL_TRUSTED -1.360 -1.440 -1.665 -1.800 score RCVD_IN_IADB_VOUCHED 0 -1.825 0 -2.200 score HABEAS_CHECKED 0 -0.2 0 -0.2 score RCVD_IN_BSP_OTHER 0 -0.1 0 -0.1 score NO_RELAYS -0.001 score NO_RECEIVED -0.001 score DK_VERIFIED -0.001 score SPF_PASS -0.001 score SPF_HELO_PASS -0.001 score HASHCASH_20 -0.500 score HASHCASH_21 -0.700 score HASHCASH_22 -1.000 score HASHCASH_23 -2.000 score HASHCASH_24 -3.000 score HASHCASH_25 -4.000 score HASHCASH_HIGH -5.000 The hashcash scores don't seem to be triggering learning for me, for some reason... X-Spam-Status: No, score=-5.0 required=5.0 tests=AWL,BAYES_00, FORGED_RCVD_HELO,HASHCASH_HIGH autolearn=no version=3.1.0 grep threshold .spamassassin/user_prefs|grep -v # bayes_auto_learn_threshold_spam 5.0 bayes_auto_learn_threshold_nonspam 0.0 Or is that because of the AWL rule? -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
Jim C. Nasby wrote: The hashcash scores don't seem to be triggering learning for me, for some reason... They generally won't. Three things must happen for hashcash to fire: 1) you need a loadplugin Mail::SpamAssassin::Plugin::Hashcash command in your init.pre 2) you need a hashcash_accept command with the recipient address in your config files. 3) the sender needs to generate a hashcash hash when sending the message. It sounds like you've done 1, but you probably haven't done 2. X-Spam-Status: No, score=-5.0 required=5.0 tests=AWL,BAYES_00, FORGED_RCVD_HELO,HASHCASH_HIGH autolearn=no version=3.1.0 grep threshold .spamassassin/user_prefs|grep -v # bayes_auto_learn_threshold_spam 5.0 bayes_auto_learn_threshold_nonspam 0.0 Or is that because of the AWL rule? Hmm, well, you have to ignore the AWL and BAYES scores when figuring out the autolearner.. So you have FORGED_RCVD_HELO and HASHCASH_HIGH.. That would leave a score of -4.865, which confused me for a bit... However, looking in the config files, HASHCASH rules have the userconf flag. This means that the Autolearner will also ignore these rules too, as SA will treat it as a user configured whitelist. So, this message had an autolearner score of +0.135 from the FORGED_RCVD_HELO.
Re: Spamassassin Learn
On Wed, Feb 08, 2006 at 11:29:36AM -0500, Matt Kettler wrote: However, looking in the config files, HASHCASH rules have the userconf flag. This means that the Autolearner will also ignore these rules too, as SA will treat it as a user configured whitelist. So, this message had an autolearner score of +0.135 from the FORGED_RCVD_HELO. Ahh, so hashcash scores don't actually count towards learning. Should maybe be changed...? BTW, I was reading http://article.gmane.org/gmane.mail.spam.hashcash/803 last night, and I'm wondering if there's been any progress on a way to enable hashcash without requiring users to supply emails they receive stamps for? -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
Jim C. Nasby wrote: On Wed, Feb 08, 2006 at 11:29:36AM -0500, Matt Kettler wrote: However, looking in the config files, HASHCASH rules have the userconf flag. This means that the Autolearner will also ignore these rules too, as SA will treat it as a user configured whitelist. So, this message had an autolearner score of +0.135 from the FORGED_RCVD_HELO. Ahh, so hashcash scores don't actually count towards learning. Should maybe be changed...? I'm not entirely sure.. Part of me thinks it's a good idea to not count it, since it does effectively behave a bit like a user-configured whitelist. I mean, if you start accepting hashcash for learning, then you probably should also accept whitelist_from_spf. Realistically, hashcash doesn't provide any proof the sender isn't a spammer. It merely provides proof they are willing to burn some CPU time to send you an email. In the era of spammers using enormous botnets a little CPU time really costs a spammer very little. They're much more limited by network bandwidth than available CPU power when they control 10,000+ infected PCs each with a cable/dsl uplink speed of 128k-1mbit to send spam with. BTW, I was reading http://article.gmane.org/gmane.mail.spam.hashcash/803 last night, and I'm wondering if there's been any progress on a way to enable hashcash without requiring users to supply emails they receive stamps for? The hashcash_accept command accepts file-glob style wildcards, so this should work: hashcash_accept * or safer: hashcash_accept [EMAIL PROTECTED] The problem with wildcards is that a spammer doesn't need to compute a hash on a per-recipient basis. They merely need to do it on a per-message basis, which makes it much less expensive for a spammer to use.
Re: Spamassassin Learn (hashcash)
Matt Kettler wrote: Jim C. Nasby wrote: On Wed, Feb 08, 2006 at 11:29:36AM -0500, Matt Kettler wrote: However, looking in the config files, HASHCASH rules have the userconf flag. This means that the Autolearner will also ignore these rules too, as SA will treat it as a user configured whitelist. So, this message had an autolearner score of +0.135 from the FORGED_RCVD_HELO. Ahh, so hashcash scores don't actually count towards learning. Should maybe be changed...? I'm not entirely sure.. Part of me thinks it's a good idea to not count it, since it does effectively behave a bit like a user-configured whitelist. Also, let's face it.. Hashcash is almost completely unused, so this is a lot of worry over something very rare. Since 1/1/2006 I have received mail with hashcash signatures from exactly 5 persons. Only 2 of those persons sent mail directly to me and had hashcash signatures for my address. Summary of persons who have used hashcash posting to lists: (names and public list they were on only, don't want to re-post people's email addresses on lists they don't subscribe to) Alex B. (uribl-discuss) John D. (uribl-discuss) Andrew D. (sa-talk) Jim N. (sa-talk) rogelio a. (dansguardian) Direct to me: Andrew D. Jim N. Both of the above were sending emails regarding sa-talk postings. Since the only people who sent me emails with hashcash for my address were discussing spamassassin, it would have been counterproductive for me to use hashcash in autolearning. Since SA discussions often contain spam quotes, it's best not to intentionally take steps that will learn such messages as ham. The benefit you get from learning it will ultimately be counter-balanced by the mis-learning of the occasional spam quote.
Re: Spamassassin Learn
On Wed, Feb 08, 2006 at 11:49:09AM -0500, Matt Kettler wrote: Jim C. Nasby wrote: On Wed, Feb 08, 2006 at 11:29:36AM -0500, Matt Kettler wrote: However, looking in the config files, HASHCASH rules have the userconf flag. This means that the Autolearner will also ignore these rules too, as SA will treat it as a user configured whitelist. So, this message had an autolearner score of +0.135 from the FORGED_RCVD_HELO. Ahh, so hashcash scores don't actually count towards learning. Should maybe be changed...? I'm not entirely sure.. Part of me thinks it's a good idea to not count it, since it does effectively behave a bit like a user-configured whitelist. I mean, if you start accepting hashcash for learning, then you probably should also accept whitelist_from_spf. Realistically, hashcash doesn't provide any proof the sender isn't a spammer. It merely provides proof they are willing to burn some CPU time to send you an email. Sure, but I think it warrants a small negative learn score. I'd expect that real spam would have plenty enough positive score to ensure that it didn't get learned. Of course I guess part of this is that the default learn ham score of 0.1 is probably too high... In the era of spammers using enormous botnets a little CPU time really costs a spammer very little. They're much more limited by network bandwidth than available CPU power when they control 10,000+ infected PCs each with a cable/dsl uplink speed of 128k-1mbit to send spam with. True, but if they start burning that kind of CPU generating postage the owner of the machine is more likely to notice something's wrong... BTW, I was reading http://article.gmane.org/gmane.mail.spam.hashcash/803 last night, and I'm wondering if there's been any progress on a way to enable hashcash without requiring users to supply emails they receive stamps for? The hashcash_accept command accepts file-glob style wildcards, so this should work: hashcash_accept * or safer: hashcash_accept [EMAIL PROTECTED] The problem with wildcards is that a spammer doesn't need to compute a hash on a per-recipient basis. They merely need to do it on a per-message basis, which makes it much less expensive for a spammer to use. Yeah, I was specifically wondering about getting it into the default config. It seems like it would be a very useful tool if more people used it, and having it work by default in SA would undoubtedly go a long way towards getting people to use it. BTW, there were 3 proposals in that thread to combat generating one stamp per email. -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
Jim C. Nasby wrote: On Wed, Feb 08, 2006 at 11:49:09AM -0500, Matt Kettler wrote: Jim C. Nasby wrote: On Wed, Feb 08, 2006 at 11:29:36AM -0500, Matt Kettler wrote: However, looking in the config files, HASHCASH rules have the userconf flag. This means that the Autolearner will also ignore these rules too, as SA will treat it as a user configured whitelist. So, this message had an autolearner score of +0.135 from the FORGED_RCVD_HELO. Ahh, so hashcash scores don't actually count towards learning. Should maybe be changed...? I'm not entirely sure.. Part of me thinks it's a good idea to not count it, since it does effectively behave a bit like a user-configured whitelist. I mean, if you start accepting hashcash for learning, then you probably should also accept whitelist_from_spf. Realistically, hashcash doesn't provide any proof the sender isn't a spammer. It merely provides proof they are willing to burn some CPU time to send you an email. Sure, but I think it warrants a small negative learn score. Does it? A negative learning score is a VERY powerful thing. VERY powerful. Someone who can forge a negative learning score can poison your bayes database rather quickly. Currently SA only accepts negative learning scores for things which actually attest to the fact that this specific sender is not a spammer. SA doesn't even trust the user's own whitelists for this purpose, because too many users do whitelist_from * In the era of spammers using enormous botnets a little CPU time really costs a spammer very little. They're much more limited by network bandwidth than available CPU power when they control 10,000+ infected PCs each with a cable/dsl uplink speed of 128k-1mbit to send spam with. True, but if they start burning that kind of CPU generating postage the owner of the machine is more likely to notice something's wrong... Surely you're joking. The average user would only notice if their computer became sluggish and unresponsive. If you do the hashes in a low-priority thread the user interface responsiveness will never be affected. Take the distributed.net client as an example. It burns tons of CPU, and the average user wouldn't realize it was there. Sure the user could detect it with a processor usage monitor. However, if they were clueful enough to detect CPU load by using the task manager, they'd be clueful enough to avoid infection in the first place, or at least realize they'd infected themselves and clean it up asap. Remember, the bot nets are largely built from users who are infected by email viruses. Thus for the most part we are dealing with users that will open a .pif file attached to an email with a body saying nothing but Please read the document. and a subject Re: document (a netsky/somefool variant)
Re: Spamassassin Learn
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Jim C. Nasby writes: On Wed, Feb 08, 2006 at 11:29:36AM -0500, Matt Kettler wrote: However, looking in the config files, HASHCASH rules have the userconf flag. This means that the Autolearner will also ignore these rules too, as SA will treat it as a user configured whitelist. So, this message had an autolearner score of +0.135 from the FORGED_RCVD_HELO. Ahh, so hashcash scores don't actually count towards learning. Should maybe be changed...? Nah. The idea is that rules where users can conceivably configure SpamAssassin to induce FNs or FPs should be ignored for purposes of auto-learning; we've seen *many* cases where an accidental whitelisting of spam (for example) polluted the Bayes db. This was put in place to avoid that problem. BTW, I was reading http://article.gmane.org/gmane.mail.spam.hashcash/803 last night, and I'm wondering if there's been any progress on a way to enable hashcash without requiring users to supply emails they receive stamps for? Not yet; none of us are keen to get into that argument^Wdiscussion ;) - --j. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Exmh CVS iD8DBQFD6i3yMJF5cimLx9ARAhbsAKCNXlzAi0EYOzn/g81yZ8tSCz4OuwCeMRXr 4FCPKphI7Q6W9RxGKrMVvV8= =4ZcX -END PGP SIGNATURE-
RE: Spamassassin Learn
[EMAIL PROTECTED] wrote: Can you just feed spamassassin spam or do you need to give it ham also? I read the docs and it didn't say you had to feed it ham. I then read another doc and it said you should feed it equal amounts of spam and ham. You need to feed it both. I wouldn't worry too much about the ratios, but the Bayes scoring won't take effect until you have learned at least 200 ham and 200 spam. -- Bowie
Re: Spamassassin Learn
200 of each to even make it start working on sa-learn email. I then feed it representative amounts of ham and spam. The ratio it comes in. [EMAIL PROTECTED] wrote: Can you just feed spamassassin spam or do you need to give it ham also? I read the docs and it didn't say you had to feed it ham. I then read another doc and it said you should feed it equal amounts of spam and ham.
Re: Spamassassin Learn
[EMAIL PROTECTED] wrote: Can you just feed spamassassin spam or do you need to give it ham also? I read the docs and it didn't say you had to feed it ham. I then read another doc and it said you should feed it equal amounts of spam and ham. Yes, you really should feed it both. You also should strive for a 1:1 ratio of spam and nonspam, but don't kill yourself to get there. SA's use of chi-squared combining makes it very tolerant of wild imbalances in training. However, the closer you are to a 1:1 ratio the better SA will be able to distinguish tokens that are present in both kinds of mail and ignore them. So this is a worthwhile goal to strive for as long as it doesn't become a burden. My current training ratio is about 7:1 spam:nonspam, but in the past it's been as bad as 20:1. Both of those are very far off from equal amounts, but the imbalance has never caused me any problems. From my sa-learn --dump magic output as of today: 0.000 0 995764 0 non-token data: nspam 0.000 0 145377 0 non-token data: nham That works out to a ratio of 6.85:1
Re: Spamassassin Learn
Does anyone have any good techniques for capturing a sample of ham that can be used as the ham corpus. I'm in a corporate environment and am not keen on the idea of intercepting non-spam messages. I will if I have to, but was hoping someone had a better idea. Regards, Clay On 2/7/2006 at 3:16 pm, in message [EMAIL PROTECTED], Matt Kettler [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Can you just feed spamassassin spam or do you need to give it ham also? I read the docs and it didn't say you had to feed it ham. I then read another doc and it said you should feed it equal amounts of spam and ham. Yes, you really should feed it both. You also should strive for a 1:1 ratio of spam and nonspam, but don't kill yourself to get there. SA's use of chi-squared combining makes it very tolerant of wild imbalances in training. However, the closer you are to a 1:1 ratio the better SA will be able to distinguish tokens that are present in both kinds of mail and ignore them. So this is a worthwhile goal to strive for as long as it doesn't become a burden. My current training ratio is about 7:1 spam:nonspam, but in the past it's been as bad as 20:1. Both of those are very far off from equal amounts, but the imbalance has never caused me any problems. From my sa-learn --dump magic output as of today: 0.000 0 995764 0 non-token data: nspam 0.000 0 145377 0 non-token data: nham That works out to a ratio of 6.85:1
Re: Spamassassin Learn
This is what automatic training attempts to solve. If you are reliably nailing spam with your current setup you can experiment with the automatic learning. But I'd widen the score ranges a little, as far as is practical for your mail mix. {^_^} - Original Message - From: Clay Davis [EMAIL PROTECTED] Does anyone have any good techniques for capturing a sample of ham that can be used as the ham corpus. I'm in a corporate environment and am not keen on the idea of intercepting non-spam messages. I will if I have to, but was hoping someone had a better idea. Regards, Clay On 2/7/2006 at 3:16 pm, in message [EMAIL PROTECTED], Matt Kettler [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Can you just feed spamassassin spam or do you need to give it ham also? I read the docs and it didn't say you had to feed it ham. I then read another doc and it said you should feed it equal amounts of spam and ham. Yes, you really should feed it both. You also should strive for a 1:1 ratio of spam and nonspam, but don't kill yourself to get there. SA's use of chi-squared combining makes it very tolerant of wild imbalances in training. However, the closer you are to a 1:1 ratio the better SA will be able to distinguish tokens that are present in both kinds of mail and ignore them. So this is a worthwhile goal to strive for as long as it doesn't become a burden. My current training ratio is about 7:1 spam:nonspam, but in the past it's been as bad as 20:1. Both of those are very far off from equal amounts, but the imbalance has never caused me any problems. From my sa-learn --dump magic output as of today: 0.000 0 995764 0 non-token data: nspam 0.000 0 145377 0 non-token data: nham That works out to a ratio of 6.85:1
Re: Spamassassin Learn
On Tue, Feb 07, 2006 at 03:16:57PM -0500, Matt Kettler wrote: My current training ratio is about 7:1 spam:nonspam, but in the past it's been as bad as 20:1. Both of those are very far off from equal amounts, but the imbalance has never caused me any problems. From my sa-learn --dump magic output as of today: 0.000 0 995764 0 non-token data: nspam 0.000 0 145377 0 non-token data: nham Interesting... it appears I actually need to do a better job of training spam! sa-learn --dump magic|grep am 0.000 0 98757 0 non-token data: nspam 0.000 0 255134 0 non-token data: nham I just changed bayes_auto_learn_threshold_spam to 5.0, we'll see what that does... -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 03:16:57PM -0500, Matt Kettler wrote: My current training ratio is about 7:1 spam:nonspam, but in the past it's been as bad as 20:1. Both of those are very far off from equal amounts, but the imbalance has never caused me any problems. From my sa-learn --dump magic output as of today: 0.000 0 995764 0 non-token data: nspam 0.000 0 145377 0 non-token data: nham Interesting... it appears I actually need to do a better job of training spam! sa-learn --dump magic|grep am 0.000 0 98757 0 non-token data: nspam 0.000 0 255134 0 non-token data: nham I just changed bayes_auto_learn_threshold_spam to 5.0, we'll see what that does... Actually, you can't ever set the threshold below 6.0. SA has a hard-coded requirement of at least 3.0 header points, and 3.0 body points before it will autolearn as spam. Therefore, any setting below 6 is moot, because the two 3.0 requirements can't both be met without a score of at least 6. I would also check to make sure you don't have a lot of spam coming in that's getting autolearned as ham. (note: the learner's idea of score is very different than the final message score, so a message CAN be tagged as spam, and still get autolearned as ham)
Re: Spamassassin Learn
From: Jim C. Nasby [EMAIL PROTECTED] On Tue, Feb 07, 2006 at 03:16:57PM -0500, Matt Kettler wrote: My current training ratio is about 7:1 spam:nonspam, but in the past it's been as bad as 20:1. Both of those are very far off from equal amounts, but the imbalance has never caused me any problems. From my sa-learn --dump magic output as of today: 0.000 0 995764 0 non-token data: nspam 0.000 0 145377 0 non-token data: nham Interesting... it appears I actually need to do a better job of training spam! sa-learn --dump magic|grep am 0.000 0 98757 0 non-token data: nspam 0.000 0 255134 0 non-token data: nham I just changed bayes_auto_learn_threshold_spam to 5.0, we'll see what that does... If you have the option manually train the spam for awhile. If the threshold is set too low for autolearning spam you will find yourself with a mangled database that has a high percentage of actual ham learned as spam. That is not a good thing. You might actually lower the ham threshold, as well. It looks like you might be at risk of learning spam as ham. (And in fact may have done this already to a high degree.) {^_^}
Re: Spamassassin Learn
On Tue, Feb 07, 2006 at 04:40:40PM -0500, Matt Kettler wrote: I would also check to make sure you don't have a lot of spam coming in that's getting autolearned as ham. (note: the learner's idea of score is very different than the final message score, so a message CAN be tagged as spam, and still get autolearned as ham) What would be the easiest way to do that? Grep through my caughtspam maildir? -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 04:40:40PM -0500, Matt Kettler wrote: I would also check to make sure you don't have a lot of spam coming in that's getting autolearned as ham. (note: the learner's idea of score is very different than the final message score, so a message CAN be tagged as spam, and still get autolearned as ham) What would be the easiest way to do that? Grep through my caughtspam maildir? That would be the way I'd check.. grep for autolearn=ham
Re: Spamassassin Learn
On Tue, Feb 07, 2006 at 05:02:25PM -0500, Matt Kettler wrote: Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 04:40:40PM -0500, Matt Kettler wrote: I would also check to make sure you don't have a lot of spam coming in that's getting autolearned as ham. (note: the learner's idea of score is very different than the final message score, so a message CAN be tagged as spam, and still get autolearned as ham) What would be the easiest way to do that? Grep through my caughtspam maildir? That would be the way I'd check.. grep for autolearn=ham Nothing autolearned. Interesting... I know I've fed my sent mail as ham, but I'm pretty sure I only did that once or twice... Guess I'll see how the numbers change with the low autolearn threshold... -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
Does anyone have any good techniques for capturing a sample of ham that can be used as the ham corpus. I'm in a corporate environment and am not keen on the idea of intercepting non-spam messages. I will if I have to, but was hoping someone had a better idea. Depending on your MTA/MDA, you might be able to do it on the fly so that an actual copy of the message isn't necessary. For instance, if the messages pass through procmail, learn them just before delivery if the X-Spam-Status header isn't set to yes. Oh, and make sure you pass the --no-sync flag to sa-learn, then schedule the syncing for sometime during off-peak hours.
Re: Spamassassin Learn
Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 05:02:25PM -0500, Matt Kettler wrote: Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 04:40:40PM -0500, Matt Kettler wrote: I would also check to make sure you don't have a lot of spam coming in that's getting autolearned as ham. (note: the learner's idea of score is very different than the final message score, so a message CAN be tagged as spam, and still get autolearned as ham) What would be the easiest way to do that? Grep through my caughtspam maildir? That would be the way I'd check.. grep for autolearn=ham Nothing autolearned. Nothing autolearned at all? or nothing autolearned as ham? Are there any autolearn strings? Are they all autolearn=no? are there any decent number that are autolearn=failed or autolearn=disabled?
Re: Spamassassin Learn
On Tue, Feb 07, 2006 at 01:45:48PM -0800, jdow wrote: From: Jim C. Nasby [EMAIL PROTECTED] On Tue, Feb 07, 2006 at 03:16:57PM -0500, Matt Kettler wrote: My current training ratio is about 7:1 spam:nonspam, but in the past it's been as bad as 20:1. Both of those are very far off from equal amounts, but the imbalance has never caused me any problems. From my sa-learn --dump magic output as of today: 0.000 0 995764 0 non-token data: nspam 0.000 0 145377 0 non-token data: nham Interesting... it appears I actually need to do a better job of training spam! sa-learn --dump magic|grep am 0.000 0 98757 0 non-token data: nspam 0.000 0 255134 0 non-token data: nham I just changed bayes_auto_learn_threshold_spam to 5.0, we'll see what that does... If you have the option manually train the spam for awhile. If the threshold is set too low for autolearning spam you will find yourself with a mangled database that has a high percentage of actual ham learned as spam. That is not a good thing. You might actually lower the ham threshold, as well. It looks like you might be at risk of learning spam as ham. (And in fact may have done this already to a high degree.) See my other reply, which showed stats for all spam over 5 this month. The stats for last month are: grep -r autolearn oldspam/ | grep -v 'Binary file' | sed -e 's/.*autolearn=\([^ ]*\).*/\1/' | sort | uniq -c 5862 no 1225 spam 24 unavailable So based on this, I'd think it's not learning spam as ham... BTW, autolearn ham should be at it's default setting... What's interesting is that I get about 10-20 spams a day that are scored below 3, and another 30-50 a day that are between 3 and 5 (which go to my 'probablespam' folder). I send all of these to sa via spamassassin -r, so I would have thought that I'd have far more spam in the database than ham... -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
Jim C. Nasby wrote: Are there any autolearn strings? Are they all autolearn=no? are there any decent number that are autolearn=failed or autolearn=disabled? grep -r autolearn caughtspam/ | grep -v 'Binary file' | sed -e 's/.*autolearn=\([^ ]*\).*/\1/'|sort|uniq -c 1545 no 140 spam 4 unavailable Fair enough, that at least suggests that the autolearner is working. However, that learning ratio is pretty low. Are you using network tests? Without DNSBLs it's often hard to get enough header points to cause spam learning.. (Note I use mailscanner, hence the odd log syntax) grep is spam, /var/log/maillog |wc -l 3434 grep is spam, /var/log/maillog|grep autolearn=spam |wc -l 2766 grep is spam, /var/log/maillog|grep autolearn=not spam | wc -l 0 So I'm autolearning about 80% of my tagged spam as spam, and none as ham. I'm also autolearning about 38% of my nonspam as ham. I'm using the default bayes_auto_learn_threshold_spam (12.0) I'm also using modified bayes_auto_learn_threshold_nonspam (-0.01). I use this coupled with a series of custom rules with tiny negative scores (all -0.1). This makes nonspam learning something that has to be minimally earned, not just granted by virtue of a low score.
Re: Spamassassin Learn
On Tue, Feb 07, 2006 at 05:47:36PM -0500, Matt Kettler wrote: Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 05:02:25PM -0500, Matt Kettler wrote: Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 04:40:40PM -0500, Matt Kettler wrote: I would also check to make sure you don't have a lot of spam coming in that's getting autolearned as ham. (note: the learner's idea of score is very different than the final message score, so a message CAN be tagged as spam, and still get autolearned as ham) What would be the easiest way to do that? Grep through my caughtspam maildir? That would be the way I'd check.. grep for autolearn=ham Nothing autolearned. Nothing autolearned at all? or nothing autolearned as ham? Are there any autolearn strings? Are they all autolearn=no? are there any decent number that are autolearn=failed or autolearn=disabled? grep -r autolearn caughtspam/ | grep -v 'Binary file' | sed -e 's/.*autolearn=\([^ ]*\).*/\1/'|sort|uniq -c 1545 no 140 spam 4 unavailable -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
On Tue, Feb 07, 2006 at 06:17:20PM -0500, Matt Kettler wrote: Jim C. Nasby wrote: Are there any autolearn strings? Are they all autolearn=no? are there any decent number that are autolearn=failed or autolearn=disabled? grep -r autolearn caughtspam/ | grep -v 'Binary file' | sed -e 's/.*autolearn=\([^ ]*\).*/\1/'|sort|uniq -c 1545 no 140 spam 4 unavailable Fair enough, that at least suggests that the autolearner is working. However, that learning ratio is pretty low. Are you using network tests? Without DNSBLs it's often hard to get enough header points to cause spam learning.. I believe so... grep loadplugin /usr/local/etc/mail/spamassassin/init.pre # loadplugin Mail::SpamAssassin::Plugin::RelayCountry loadplugin Mail::SpamAssassin::Plugin::URIDNSBL loadplugin Mail::SpamAssassin::Plugin::Hashcash loadplugin Mail::SpamAssassin::Plugin::SPF grep -v # ~/.spamassassin/user_prefs | grep -v whitelist bayes_auto_learn 1 bayes_auto_learn_threshold_spam 5.0 This is basically a stock FreeBSD install from ports, if you're familiar... -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
Probably would work if you were running Linux. Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 05:47:36PM -0500, Matt Kettler wrote: Chupacabra
Re: Spamassassin Learn
Jim C. Nasby wrote: Are you using network tests? Without DNSBLs it's often hard to get enough header points to cause spam learning.. I believe so... grep loadplugin /usr/local/etc/mail/spamassassin/init.pre # loadplugin Mail::SpamAssassin::Plugin::RelayCountry loadplugin Mail::SpamAssassin::Plugin::URIDNSBL loadplugin Mail::SpamAssassin::Plugin::Hashcash loadplugin Mail::SpamAssassin::Plugin::SPF None of that will tell you if DNSBLs are enabled.. The DNSBLs aren't a plugin, they're a built-in that auto-enables itself in you have perl's Net::DNS installed. Try running spamassassin --lint -D and look for these lines: [18000] dbg: dns: is Net::DNS::Resolver available? yes [18000] dbg: dns: Net::DNS version: 0.48 This is basically a stock FreeBSD install from ports, if you're familiar... Nope. I personally dislike distro packages and ports of any sort for tools that are rapidly updated.
Re: Spamassassin Learn
On Tue, Feb 07, 2006 at 05:36:56PM -0600, Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 06:17:20PM -0500, Matt Kettler wrote: Jim C. Nasby wrote: Are there any autolearn strings? Are they all autolearn=no? are there any decent number that are autolearn=failed or autolearn=disabled? grep -r autolearn caughtspam/ | grep -v 'Binary file' | sed -e 's/.*autolearn=\([^ ]*\).*/\1/'|sort|uniq -c 1545 no 140 spam 4 unavailable Fair enough, that at least suggests that the autolearner is working. However, that learning ratio is pretty low. Are you using network tests? Without DNSBLs it's often hard to get enough header points to cause spam learning.. I believe so... grep loadplugin /usr/local/etc/mail/spamassassin/init.pre # loadplugin Mail::SpamAssassin::Plugin::RelayCountry loadplugin Mail::SpamAssassin::Plugin::URIDNSBL loadplugin Mail::SpamAssassin::Plugin::Hashcash loadplugin Mail::SpamAssassin::Plugin::SPF grep -v # ~/.spamassassin/user_prefs | grep -v whitelist bayes_auto_learn 1 bayes_auto_learn_threshold_spam 5.0 Hmm... here's something interesting... grep -r autolearn pgsql/ | grep -v 'Binary file' | sed -e 's/.*autolearn=\([^ ]*\).*/\1/' | sort | uniq -c 2010 ham 198 no 17 unavailable So a big chunk of [EMAIL PROTECTED] email is being learned as ham. Looking further, I see... X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00 autolearn=ham version=3.1.0 ISTM that having the thresholds setup so that BAYES_00 scores low enough to autolearn is a BadThing, as it creates a positive feedback loop. :) I've added bayes_auto_learn_threshold_nonspam -2.6 to my personal config; we'll see if that helps. -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
On Tue, Feb 07, 2006 at 06:47:06PM -0500, Matt Kettler wrote: Jim C. Nasby wrote: Are you using network tests? Without DNSBLs it's often hard to get enough header points to cause spam learning.. I believe so... grep loadplugin /usr/local/etc/mail/spamassassin/init.pre # loadplugin Mail::SpamAssassin::Plugin::RelayCountry loadplugin Mail::SpamAssassin::Plugin::URIDNSBL loadplugin Mail::SpamAssassin::Plugin::Hashcash loadplugin Mail::SpamAssassin::Plugin::SPF None of that will tell you if DNSBLs are enabled.. The DNSBLs aren't a plugin, they're a built-in that auto-enables itself in you have perl's Net::DNS installed. Try running spamassassin --lint -D and look for these lines: [18000] dbg: dns: is Net::DNS::Resolver available? yes [18000] dbg: dns: Net::DNS version: 0.48 spamassassin --lint -D | grep Net::DNS | grep -i version [50306] dbg: dns: Net::DNS version: 0.55 [50306] dbg: diag: module installed: Net::DNS, version 0.55 -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
On Tue, Feb 07, 2006 at 05:45:54PM -0600, mike wrote: Probably would work if you were running Linux. The problem isn't that it isn't working, the problem is that it's working too well. I guess maybe that's something you're not used to. :P -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 05:45:54PM -0600, mike wrote: Probably would work if you were running Linux. The problem isn't that it isn't working, the problem is that it's working too well. I guess maybe that's something you're not used to. :P Something tells me if that were true you would not be in here asking questions but demoing howtos IE how to make SA work too well. Whatever that is supposed to mean.
Re: Spamassassin Learn
Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 05:36:56PM -0600, Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 06:17:20PM -0500, Matt Kettler wrote: Jim C. Nasby wrote: Are there any autolearn strings? Are they all autolearn=no? are there any decent number that are autolearn=failed or autolearn=disabled? grep -r autolearn caughtspam/ | grep -v 'Binary file' | sed -e 's/.*autolearn=\([^ ]*\).*/\1/'|sort|uniq -c 1545 no 140 spam 4 unavailable Fair enough, that at least suggests that the autolearner is working. However, that learning ratio is pretty low. Are you using network tests? Without DNSBLs it's often hard to get enough header points to cause spam learning.. I believe so... grep loadplugin /usr/local/etc/mail/spamassassin/init.pre # loadplugin Mail::SpamAssassin::Plugin::RelayCountry loadplugin Mail::SpamAssassin::Plugin::URIDNSBL loadplugin Mail::SpamAssassin::Plugin::Hashcash loadplugin Mail::SpamAssassin::Plugin::SPF grep -v # ~/.spamassassin/user_prefs | grep -v whitelist bayes_auto_learn 1 bayes_auto_learn_threshold_spam 5.0 Hmm... here's something interesting... grep -r autolearn pgsql/ | grep -v 'Binary file' | sed -e 's/.*autolearn=\([^ ]*\).*/\1/' | sort | uniq -c 2010 ham 198 no 17 unavailable So a big chunk of [EMAIL PROTECTED] email is being learned as ham. Looking further, I see... X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00 autolearn=ham version=3.1.0 ISTM that having the thresholds setup so that BAYES_00 scores low enough to autolearn is a BadThing, as it creates a positive feedback loop. :) I've added bayes_auto_learn_threshold_nonspam -2.6 to my personal config; we'll see if that helps. Jim, Bayes is NOT used when calculating autolearning score, that would promote self feedbac. As I said before, the autolearner's concept of score is VERY different from the final message score. Score contributions from bayes, white/blacklists, and the AWL are all ignored by the autolearner. It also looks up the individual rule scores from set 0 or 1 instead of 2 or 3. This is a MASSIVE difference. However, the default autolearn threshold is 0.1. That's a POSITIVE threshold. To the autolearner that message scored 0 points. 0 is less than 0.1, so it learned as HAM. I'd suggest re-adjusting your threshold, as a default spamassasin config will only VERY rarely generate a negative score to the autolearner. The only rules that can do it are bondedsender, habeas COI/SOI and hashcash. Hashcash is so rare it may as well not exist at present. BondedSender and Habeas are only use by large legitamate mailers, so none of your person-to-person mail will ever get autolearned in your current setup unless you know someone who uses hashcash.
Re: Spamassassin Learn
From: Matt Kettler [EMAIL PROTECTED] Jim C. Nasby wrote: Are there any autolearn strings? Are they all autolearn=no? are there any decent number that are autolearn=failed or autolearn=disabled? grep -r autolearn caughtspam/ | grep -v 'Binary file' | sed -e 's/.*autolearn=\([^ ]*\).*/\1/'|sort|uniq -c 1545 no 140 spam 4 unavailable Fair enough, that at least suggests that the autolearner is working. However, that learning ratio is pretty low. Are you using network tests? Without DNSBLs it's often hard to get enough header points to cause spam learning.. (Note I use mailscanner, hence the odd log syntax) grep is spam, /var/log/maillog |wc -l 3434 grep is spam, /var/log/maillog|grep autolearn=spam |wc -l 2766 grep is spam, /var/log/maillog|grep autolearn=not spam | wc -l 0 So I'm autolearning about 80% of my tagged spam as spam, and none as ham. I'm also autolearning about 38% of my nonspam as ham. I'm using the default bayes_auto_learn_threshold_spam (12.0) I'm also using modified bayes_auto_learn_threshold_nonspam (-0.01). I use this coupled with a series of custom rules with tiny negative scores (all -0.1). This makes nonspam learning something that has to be minimally earned, not just granted by virtue of a low score. I wonder if he has greylisting turned on. {^_^}
Re: Spamassassin Learn
On Tue, Feb 07, 2006 at 07:59:37PM -0500, Matt Kettler wrote: Jim, Bayes is NOT used when calculating autolearning score, that would promote self feedbac. As I said before, the autolearner's concept of score is VERY different from the final message score. Score contributions from bayes, white/blacklists, and the AWL are all ignored by the autolearner. It also looks up the individual rule scores from set 0 or 1 instead of 2 or 3. This is a MASSIVE difference. However, the default autolearn threshold is 0.1. That's a POSITIVE threshold. To the autolearner that message scored 0 points. 0 is less than 0.1, so it learned as HAM. I'd suggest re-adjusting your threshold, as a default spamassasin config will only VERY rarely generate a negative score to the autolearner. The only rules that can do it are bondedsender, habeas COI/SOI and hashcash. Hashcash is so rare it may as well not exist at present. BondedSender and Habeas are only use by large legitamate mailers, so none of your person-to-person mail will ever get autolearned in your current setup unless you know someone who uses hashcash. Ahh, got it. Makes much more sense. :) So I guess either 0 or -0.1 makes the most sense? -- Jim C. Nasby, Database Architect[EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what?
Re: Spamassassin Learn
Jim C. Nasby wrote: On Tue, Feb 07, 2006 at 07:59:37PM -0500, Matt Kettler wrote: Jim, Bayes is NOT used when calculating autolearning score, that would promote self feedbac. As I said before, the autolearner's concept of score is VERY different from the final message score. Score contributions from bayes, white/blacklists, and the AWL are all ignored by the autolearner. It also looks up the individual rule scores from set 0 or 1 instead of 2 or 3. This is a MASSIVE difference. However, the default autolearn threshold is 0.1. That's a POSITIVE threshold. To the autolearner that message scored 0 points. 0 is less than 0.1, so it learned as HAM. I'd suggest re-adjusting your threshold, as a default spamassasin config will only VERY rarely generate a negative score to the autolearner. The only rules that can do it are bondedsender, habeas COI/SOI and hashcash. Hashcash is so rare it may as well not exist at present. BondedSender and Habeas are only use by large legitamate mailers, so none of your person-to-person mail will ever get autolearned in your current setup unless you know someone who uses hashcash. Ahh, got it. Makes much more sense. :) So I guess either 0 or -0.1 makes the most sense? 0 makes the most sense, unless you add on negative-scoring rules. With a default SA there's really no difference in autolearning threshold between -1.3 and -0.1, and very little difference between -0.001 and -100.0. Ignoring hashcash due to it's rarity, and bayes, the AWL, and all whitelists can't count so they are omitted: There are 0 rules in SA that can get you a learning score at or below -8.001 There are only 3 rules in SA that can get you a learning score at or below -2.3 There are only 7 rules in SA that can get you a learning score at or below -0.1. There are only 12 rules in SA that can get you a learning score at or below -0.001. The differences between the 4 cases is more-or less moot. You won't learn much ham at all. Even if you consider hashcash, that's only another 5 rules, and only applies when senders realize what hashcash even is. I run my boxes with -0.01 as a threshold, but I've added on about 30 simple body-text rules looking for industry terminology for my company's business and assigning -0.02 scores to them. This way I autolearn any business-related mail without any real chance of a spammer abusing them to whitelist himself. Even if a spam every single one of my rules, it would only knock 0.6 points off the spam score. For reference, these are the only rules in a stock SA 3.1.0 that can give you a negative learning score: score HABEAS_ACCREDITED_COI 0 -8.0 0 -8.0 score RCVD_IN_BSP_TRUSTED 0 -4.3 0 -4.3 score HABEAS_ACCREDITED_SOI 0 -4.3 0 -4.3 score ALL_TRUSTED -1.360 -1.440 -1.665 -1.800 score RCVD_IN_IADB_VOUCHED 0 -1.825 0 -2.200 score HABEAS_CHECKED 0 -0.2 0 -0.2 score RCVD_IN_BSP_OTHER 0 -0.1 0 -0.1 score NO_RELAYS -0.001 score NO_RECEIVED -0.001 score DK_VERIFIED -0.001 score SPF_PASS -0.001 score SPF_HELO_PASS -0.001 score HASHCASH_20 -0.500 score HASHCASH_21 -0.700 score HASHCASH_22 -1.000 score HASHCASH_23 -2.000 score HASHCASH_24 -3.000 score HASHCASH_25 -4.000 score HASHCASH_HIGH -5.000
Re: Spamassassin Learn
On Tuesday 07 February 2006 15:27, Clay Davis wrote: Does anyone have any good techniques for capturing a sample of ham that can be used as the ham corpus. I'm in a corporate environment and am not keen on the idea of intercepting non-spam messages. I will if I have to, but was hoping someone had a better idea. I wouldn't have too guilty a consience(sp?) on that subject because generally, you won't be reading very much other than intercepted spam. There may be an FP in there occasionally, but you'll soon learn to catch those and feed them to the ham learner hence move them to the correct mailbox folder. In other words, to make an omelete, you normally have to break a few eggs. What you accidently read in an FP should be treated with the usual amount of salt and otherwise forgotten. Regards, Clay On 2/7/2006 at 3:16 pm, in message [EMAIL PROTECTED], Matt Kettler [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Can you just feed spamassassin spam or do you need to give it ham also? I read the docs and it didn't say you had to feed it ham. I then read another doc and it said you should feed it equal amounts of spam and ham. Yes, you really should feed it both. You also should strive for a 1:1 ratio of spam and nonspam, but don't kill yourself to get there. SA's use of chi-squared combining makes it very tolerant of wild imbalances in training. However, the closer you are to a 1:1 ratio the better SA will be able to distinguish tokens that are present in both kinds of mail and ignore them. So this is a worthwhile goal to strive for as long as it doesn't become a burden. My current training ratio is about 7:1 spam:nonspam, but in the past it's been as bad as 20:1. Both of those are very far off from equal amounts, but the imbalance has never caused me any problems. From my sa-learn --dump magic output as of today: 0.000 0 995764 0 non-token data: nspam 0.000 0 145377 0 non-token data: nham That works out to a ratio of 6.85:1 -- Cheers, Gene People having trouble with vz bouncing email to me should add the word 'online' between the 'verizon', and the dot which bypasses vz's stupid bounce rules. I do use spamassassin too. :-) Yahoo.com and AOL/TW attorneys please note, additions to the above message by Gene Heskett are: Copyright 2006 by Maurice Eugene Heskett, all rights reserved.