ham source for site-wide bayes?
I've set up spamassassin with a site-wide bayes configuration. I have some spamtrap email addresses that supply fresh spam into bayes for training on a cron job. However, from what I've read, bayes needs to have ongoing ham as well as spam for training in order to work well. What's the usual method of supplying the ham? Does that have to be done manually (how often?) or has anyone come up with a way to automatically supply ham. I have the spamtrap email boxes that receive spam-only but all the real email addresses on the server receive a mix of ham and spam, which is why I need spamassassin in the first place :) I can't find anything in spamassassin docs so far that explains a non-manual way of supplying ham. Have I missed something? Is there some sort of service where I can subscribe to an updated ham corpus automatically like with the clamav database? -Steve
Re: ham source for site-wide bayes?
On 5/20/2015 12:29 PM, Steve Rainwater wrote: I've set up spamassassin with a site-wide bayes configuration. I have some spamtrap email addresses that supply fresh spam into bayes for training on a cron job. However, from what I've read, bayes needs to have ongoing ham as well as spam for training in order to work well. What's the usual method of supplying the ham? Does that have to be done manually (how often?) or has anyone come up with a way to automatically supply ham. I have the spamtrap email boxes that receive spam-only but all the real email addresses on the server receive a mix of ham and spam, which is why I need spamassassin in the first place :) I can't find anything in spamassassin docs so far that explains a non-manual way of supplying ham. Have I missed something? Is there some sort of service where I can subscribe to an updated ham corpus automatically like with the clamav database? One way people often supply ham is to use sent items from your legit users. Regards, KAM
Re: ham source for site-wide bayes?
On 20.05.2015 18:29, Steve Rainwater wrote: I've set up spamassassin with a site-wide bayes configuration. I have some spamtrap email addresses that supply fresh spam into bayes for training on a cron job. However, from what I've read, bayes needs to have ongoing ham as well as spam for training in order to work well. What's the usual method of supplying the ham? Does that have to be done manually (how often?) it doesn't have to be done - you *can* do it manually. or has anyone come up with a way to automaticallysupply ham. it's called auto_learn [works for me] you'll find all the details in https://spamassassin.apache.org/full/3.4.x/doc/Mail_SpamAssassin_Conf.txt LEARNING OPTIONS I have the spamtrap email boxes that receive spam-only but all the real email addresses on the server receive a mix of ham and spam, which is why I need spamassassin in the first place :) I can't find anything in spamassassin docs so far that explains a non-manual way of supplying ham. Have I missed something? Is there some sort of service where I can subscribe to an updated ham corpus automatically like with the clamav database? your ham is specific to your traffic - you cannot inherit somebody else's ham and expect it to work nicely with you traffic. You'll soon read a dozen of ways to do it. I'll add mine: I use autolearn AND feed bayes trap data to a 6GB Redis DB [works for] Axb
Re: Site-wide bayes and individual bayes
On 10 Oct 2014, at 06:49 , RW rwmailli...@googlemail.com wrote: And, if not, is it generally better to do sitewide? It's hard to say, there are advantages and disadvantages either way. OK, so specific example then. Small server with a few dozen email users spread over several domains. Almost none of these users does any spam training at all, the rest just delete unwanted messages (not even marking them as junk) or even worse, just ignore them. One user is very aggressive in marking Spam and in keeping the Inbox clear of all spam. I am of two minds. First, that everyone else would benefit from this user’s actions or, alternatively, that the user’s aggressive tagging will actually ‘poison’ the bayes db for the other users who maybe do not think that endless emails from pinterest or some political candidate are actually spam. -- You see, in this world there's two kinds of people, my friend: Those with loaded guns and those who dig. You dig.
Re: Site-wide bayes and individual bayes
Am 12.10.2014 um 18:59 schrieb LuKreme: On 10 Oct 2014, at 06:49 , RW rwmailli...@googlemail.com wrote: And, if not, is it generally better to do sitewide? It's hard to say, there are advantages and disadvantages either way. OK, so specific example then. Small server with a few dozen email users spread over several domains. Almost none of these users does any spam training at all, the rest just delete unwanted messages (not even marking them as junk) or even worse, just ignore them. One user is very aggressive in marking Spam and in keeping the Inbox clear of all spam. I am of two minds. First, that everyone else would benefit from this user’s actions or, alternatively, that the user’s aggressive tagging will actually ‘poison’ the bayes db for the other users who maybe do not think that endless emails from pinterest or some political candidate are actually spam. if nobody trains his user specific bayes (like here) site-wide is the way to go, just because until a user has flagged 200 ham messages his bayes won#t get used regardless of the amount of spam marked ones merge a users aggressive training site-wide means you need to trust that users actions - means: he needs to be careful and not just flag anything he don't want to see as spam if it is really one or two users like here i would stay at a normal site-wide bayes, i realized that with IMAP shared folders where those users see a ham/spam folder to move messages there and are advised to be carfeul in case of ham samples not leak sensitive content i review that stuff, save the eml messages to the training folders on the mailserver and call the sa-learn script, until now a nearly 100% result over 8 weeks production (99% spam catched, no false positives) signature.asc Description: OpenPGP digital signature
Re: Site-wide bayes and individual bayes
On 10/12/2014 9:59 AM, LuKreme wrote: On 10 Oct 2014, at 06:49 , RWrwmailli...@googlemail.com wrote: And, if not, is it generally better to do sitewide? It's hard to say, there are advantages and disadvantages either way. OK, so specific example then. Small server with a few dozen email users spread over several domains. Almost none of these users does any spam training at all, the rest just delete unwanted messages (not even marking them as junk) or even worse, just ignore them. One user is very aggressive in marking Spam and in keeping the Inbox clear of all spam. I am of two minds. First, that everyone else would benefit from this user’s actions or, alternatively, that the user’s aggressive tagging will actually ‘poison’ the bayes db for the other users who maybe do not think that endless emails from pinterest or some political candidate are actually spam. For starters your problem isn't SPAM it's HAM. You can get all the spam you want. Just parse the mail log file every day for a few weeks, looking for delivery attempts to nonexistent mailboxes. When you see repeated delivery attempts to a specific mailbox then create an email address on that nonexistent mailbox and redirect all the email into it into a spam box My experience is that once spammers think they have discovered an email address they will never leave it alone, they will send increasing amounts of spam to that address. If you are lucky enough to never have spammers trying to probe your server, you can create your honeypot email addresses, just make them up, and then take these email addresses and post them into the Unsubscribe links on spam. That is a good way to contaminate spammers mailing lists with honeypot addresses. A legitimate mailsender will ignore these, a spammer will happily pull addresses out of unsubscribe replies. That's your centralized spam source. Do this for a couple dozen nonexistent email addresses on your server domains and you will have all the input you want for the Bayes learner. By definition ANY email to a nonexistent address (not an old address that was closed down years ago) is unsolicited, AKA SPAM. As for desired political mail, on my servers I classify all of it as spam, I can think of maybe only 2 users over the last decade who have complained about not getting it and for those it's easy to do an all_spam_to to them and then tell them they will have to do their own spam filtering. Since overwhelmingly the political email I have seen coming in is the offensive conservative anti-women, anti-blacks, anti-latinos, beg for more money email, I have to say that I'm not particularly concerned about the wishes of customers who WANT that kind of mail - I'm quite happy if they go find another provider. And, naturally, that kind of email is never ever appropriate for a business and no employee in a business is ever going to dare complain to their bosses that they aren't getting it. If the politicos want to drown people in hate mail, they have paper mail to do it - might as well make them help reduce my taxes by subsidizing the US Post Office with their hate mail, that's about the only thing that's good about it. Anyway, as I said HAM is the problem. If you don't have large quantities of ham, Bayes won't work. Of course, nothing is preventing you from copying people's folders (if they are using IMAP) into one giant mailbox and using that as a HAM source. You can probably assume that if a user has gone to the trouble of saving mail to a folder that it is ham. Ted
Re: Site-wide bayes and individual bayes
On Wed, 8 Oct 2014 15:26:25 -0600 LuKreme wrote: Is it possible to have a site-wide bayes AND individual bayes for some users (or all users)? Not as things stand. You could use Bayes for one and a separate filter for the other. And, if not, is it generally better to do sitewide? It's hard to say, there are advantages and disadvantages either way. And, is it possible to take all the individual bayes and combine them into a stitewide db? It should be fairly straightforward to combine the results from running sa-learn --backup on multiple accounts. It's just a matter of combining the total ham/spam message counts and the counts for each token.
Re: Site-wide bayes and individual bayes
On Fri, 10 Oct 2014, RW wrote: On Wed, 8 Oct 2014 15:26:25 -0600 LuKreme wrote: Is it possible to have a site-wide bayes AND individual bayes for some users (or all users)? Not as things stand. Not as things stand, possibly absent a hack like: any user who wants to use the site-wide bayes has symlinks to the shared bayes database files in their local dir. Not sure how well that would work in practice (locking if you autolearn), and it would be somewhat tedious to maintain. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Maxim VI: If violence wasn’t your last resort, you failed to resort to enough of it. --- 862 days since the first successful private support mission to ISS (SpaceX)
Site-wide bayes and individual bayes
Is it possible to have a site-wide bayes AND individual bayes for some users (or all users)? And, if not, is it generally better to do sitewide? And, is it possible to take all the individual bayes and combine them into a stitewide db? -- You've got to dance like nobody's watching. - Kathy Mattea
Re: sa-learn site-wide bayes on Redis
W dniu 20.08.2014 o 14:42, Axb pisze: On 08/20/2014 02:25 PM, Matteo Dessalvi wrote: Hi all. I am managing a bunch of Linux MTAs which are placed in front of some Exchange servers. In such a configuration the Bayes filter is deployed site-wide. For a new deployment of these servers I am planning to use Redis as a centralized backend (previously the bayes db were just files saved on the disk). My question is: do I have to use a specific option to tell sa-learn that the bayes db is now hosted on Redis? Or sa-learn will use the info from the bayes_sql_dsn directive in my local.cf? Looking into the wiki: http://wiki.apache.org/spamassassin/SiteWideBayesSetup or into the sa-learn docs: http://spamassassin.apache.org/full/3.4.x/doc/sa-learn.html did not give me any clues. see http://svn.apache.org/repos/asf/spamassassin/trunk/contrib/HOWTO.Bayes-Redis/ hope that helps. This is not an official doc, so if you see anything that needs to be added/changed, pls let me know. Hi! I'm reading bayes_redis.cf and I can see: #NOTE: We're not using authentication assuming the Redis server/port should not be reachable form the outside # You can add authentication once you've seen it work. Does it means that this example config doesn't include authentication options or it means that SA doesn't support auth for redis? Marcin
Re: sa-learn site-wide bayes on Redis
I am pretty sure SA support the Redis authentication mechanism. For my tests I have used the following line: bayes_sql_dsn server=127.0.0.1:6379;password=MySecretPWD;database=2 Matteo On 21.08.2014 12:56, Marcin Mirosław wrote: Hi! I'm reading bayes_redis.cf and I can see: #NOTE: We're not using authentication assuming the Redis server/port should not be reachable form the outside # You can add authentication once you've seen it work. Does it means that this example config doesn't include authentication options or it means that SA doesn't support auth for redis? Marcin
Re: sa-learn site-wide bayes on Redis
W dniu 21.08.2014 o 13:45, Matteo Dessalvi pisze: I am pretty sure SA support the Redis authentication mechanism. For my tests I have used the following line: bayes_sql_dsn server=127.0.0.1:6379;password=MySecretPWD;database=2 Thanks Matteo, firstly I should try then write to ML:) So now I did own check. It looks that SA doesn't authenticate when connects to redis. It didn't work for me with your example not when I used bayes_sql_password password When redis needs passowrd then SA throws bayes: Redis failed: Redis error: ERR operation not permitted, tcpdump also confirms that SA doesn't do AUTH. It's strange because in Redis.pm I can see that authentication is supported. Now I'm thinking where I could made mistake in configuration... Thanks, Marcin
Re: sa-learn site-wide bayes on Redis
Which version of Redis are you using? I did have some problems with the 2.4 version packaged by Debian and I did solve a similar problem using a more recent version, like the 2.7 or 2.8. Matteo On 21.08.2014 14:45, Marcin Mirosław wrote: W dniu 21.08.2014 o 13:45, Matteo Dessalvi pisze: I am pretty sure SA support the Redis authentication mechanism. For my tests I have used the following line: bayes_sql_dsn server=127.0.0.1:6379;password=MySecretPWD;database=2 Thanks Matteo, firstly I should try then write to ML:) So now I did own check. It looks that SA doesn't authenticate when connects to redis. It didn't work for me with your example not when I used bayes_sql_password password When redis needs passowrd then SA throws bayes: Redis failed: Redis error: ERR operation not permitted, tcpdump also confirms that SA doesn't do AUTH. It's strange because in Redis.pm I can see that authentication is supported. Now I'm thinking where I could made mistake in configuration... Thanks, Marcin
Re: BayesStore::Redis can't do AUTH when Redis is =2.6 (was: sa-learn site-wide bayes on Redis)
W dniu 21.08.2014 o 15:20, Matteo Dessalvi pisze: Which version of Redis are you using? I did have some problems with the 2.4 version packaged by Debian and I did solve a similar problem using a more recent version, like the 2.7 or 2.8. And you fixed my problem! Indeed, upgrading from redis-2.6.15 to 2.8.13 fixed problem with not working AUTH. Thanks Matteo!
sa-learn site-wide bayes on Redis
Hi all. I am managing a bunch of Linux MTAs which are placed in front of some Exchange servers. In such a configuration the Bayes filter is deployed site-wide. For a new deployment of these servers I am planning to use Redis as a centralized backend (previously the bayes db were just files saved on the disk). My question is: do I have to use a specific option to tell sa-learn that the bayes db is now hosted on Redis? Or sa-learn will use the info from the bayes_sql_dsn directive in my local.cf? Looking into the wiki: http://wiki.apache.org/spamassassin/SiteWideBayesSetup or into the sa-learn docs: http://spamassassin.apache.org/full/3.4.x/doc/sa-learn.html did not give me any clues. Thanks in advance! Best regards, Matteo
Re: sa-learn site-wide bayes on Redis
On 08/20/2014 02:25 PM, Matteo Dessalvi wrote: Hi all. I am managing a bunch of Linux MTAs which are placed in front of some Exchange servers. In such a configuration the Bayes filter is deployed site-wide. For a new deployment of these servers I am planning to use Redis as a centralized backend (previously the bayes db were just files saved on the disk). My question is: do I have to use a specific option to tell sa-learn that the bayes db is now hosted on Redis? Or sa-learn will use the info from the bayes_sql_dsn directive in my local.cf? Looking into the wiki: http://wiki.apache.org/spamassassin/SiteWideBayesSetup or into the sa-learn docs: http://spamassassin.apache.org/full/3.4.x/doc/sa-learn.html did not give me any clues. see http://svn.apache.org/repos/asf/spamassassin/trunk/contrib/HOWTO.Bayes-Redis/ hope that helps. This is not an official doc, so if you see anything that needs to be added/changed, pls let me know.
Re: sa-learn site-wide bayes on Redis
No, unfortunately it does not help me. I already have a proper config file for SA to access Redis as backend and most of the configurations are done automatically through a Chef cookbook (Redis included). In the docs you pointed me there's nothing about the interaction between sa-learn and Redis. Best regards, Matteo On 20.08.2014 14:42, Axb wrote: see http://svn.apache.org/repos/asf/spamassassin/trunk/contrib/HOWTO.Bayes-Redis/ hope that helps. This is not an official doc, so if you see anything that needs to be added/changed, pls let me know.
Re: sa-learn site-wide bayes on Redis
bayes_store_module Mail::SpamAssassin::BayesStore::Redis tells SA to use the Redis backend. To sa-learn this becomes transparent, as with any other backed (DBD,SDBM,SQL) bayes_redis.cf shows what parameters are mandatory/optional On 08/20/2014 03:02 PM, Matteo Dessalvi wrote: No, unfortunately it does not help me. I already have a proper config file for SA to access Redis as backend and most of the configurations are done automatically through a Chef cookbook (Redis included). In the docs you pointed me there's nothing about the interaction between sa-learn and Redis. Best regards, Matteo On 20.08.2014 14:42, Axb wrote: see http://svn.apache.org/repos/asf/spamassassin/trunk/contrib/HOWTO.Bayes-Redis/ hope that helps. This is not an official doc, so if you see anything that needs to be added/changed, pls let me know.
Re: sa-learn site-wide bayes on Redis
Ok, perfect! Thanks a lot! This is what I want to know and I was not so sure about. I may be wrong but it looks to me the fact that tools like sa-learn can access transparently the backends configured for SA is not exactly clear from the docs. It would be great if the wiki maintainers could add a short note somewhere in the pages regarding the SiteWide deployment or related topics. Best regards, Matteo On 20.08.2014 15:08, Axb wrote: bayes_store_module Mail::SpamAssassin::BayesStore::Redis tells SA to use the Redis backend. To sa-learn this becomes transparent, as with any other backed (DBD,SDBM,SQL) bayes_redis.cf shows what parameters are mandatory/optional
Re: sa-learn site-wide bayes on Redis
I so love to posters. On 08/20/2014 03:33 PM, Matteo Dessalvi wrote: Ok, perfect! Thanks a lot! This is what I want to know and I was not so sure about. I may be wrong but it looks to me the fact that tools like sa-learn can access transparently the backends configured for SA is not exactly clear from the docs. It would be great if the wiki maintainers could add a short note somewhere in the pages regarding the SiteWide deployment or related topics. Best regards, Matteo On 20.08.2014 15:08, Axb wrote: bayes_store_module Mail::SpamAssassin::BayesStore::Redis tells SA to use the Redis backend. To sa-learn this becomes transparent, as with any other backed (DBD,SDBM,SQL) bayes_redis.cf shows what parameters are mandatory/optional Watch your memory usage: If you configure Redis to dump data from memory to file, it's safe to *double* the amount of memory you planned for Redis usage as in my case: sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 25218483 0 non-token data: nspam 0.000 0 11919587 0 non-token data: nham # Memory used_memory:3637407032 used_memory_human:3.39G used_memory_rss:4068585472 used_memory_peak:3702485960 used_memory_peak_human:3.45G used_memory_lua:205824 mem_fragmentation_ratio:1.12 mem_allocator:jemalloc-3.2.0 I keep at least 5 GB of free memory for the dump to file to avoid ugly swaps or crashes. free total used free sharedbuffers cached Mem:1426264857866648475984 0 162744 1343408 -/+ buffers/cache:42805129982136 Swap: 2046968 02046968
Re: Site-wide Bayes
On Wed, 16 Dec 2009 09:36:12 -0500 Michael Scheidell scheid...@secnap.net wrote: On 12/16/09 9:27 AM, Thomas Harold wrote: I'm guessing that you'd also want to change the autolearn thresholds to be stricter? Like only auto-learning if it scores below -2 or above +10? (That might be an amavisd-new feature.) I still use 0, but have the high score at +15. The default is 0.1 IIRC, and I wouldn't recommend setting it lower without negative-scoring custom rules - it's set positive for good reasons. BAYES and userconf whitelisting rules don't count for autolearning, so if you set a negative threshold with the default rules, you rely on DNS whitelisting to define ham - the likes of HABEOUS. Setting it at exactly 0.0 is also problematical since the decision to learn is commonly going to be determined by nominally scored rules that score 0.001 and -0.001.
Re: Site-wide Bayes
On 12/17/2009 10:30 AM, RW wrote: On Wed, 16 Dec 2009 09:36:12 -0500 Michael Scheidellscheid...@secnap.net wrote: On 12/16/09 9:27 AM, Thomas Harold wrote: I'm guessing that you'd also want to change the autolearn thresholds to be stricter? Like only auto-learning if it scores below -2 or above +10? (That might be an amavisd-new feature.) I still use 0, but have the high score at +15. The default is 0.1 IIRC, and I wouldn't recommend setting it lower without negative-scoring custom rules - it's set positive for good reasons. BAYES and userconf whitelisting rules don't count for autolearning, so if you set a negative threshold with the default rules, you rely on DNS whitelisting to define ham - the likes of HABEOUS. Setting it at exactly 0.0 is also problematical since the decision to learn is commonly going to be determined by nominally scored rules that score 0.001 and -0.001. Looking at the wiki... http://wiki.apache.org/spamassassin/BasicConfiguration We're not using userconf whitelisting, our whitelisting is done by amavisd-new mappings (where we score specific domains/addresses with a small -2 to -5 score). The wiki, as it is currently, makes it sound like the +0.1 default for ham auto-learn is not conservative enough. And that the +6.0 default for auto-learning spam is too risky. (We run with -0.5 and +9.5 as our boundaries for auto-learning.)
Re: Site-wide Bayes
On 12/15/2009 11:55 AM, Michael Scheidell wrote: On 12/15/09 11:49 AM, Charles Gregory wrote: On Tue, 15 Dec 2009, Matt Garretson wrote: Heartily agreed. Site-wide bayes here (single database for 2000+ users) catches 40% of the spam here. But what is the FP rate? Is it safe for an ISP with a widely varied user base to use site-wide Bayes? I find that you should reduce scores on the high and low end (bayes_00 and bayes_95) and the 'meta rules' that might combine them also. (so, yes, an ISP, or for our hosted clients, we have modified the bayes scores. . if one client is a plastic surgeon, one is a stock broker, and one is a mortgage broker, each will be getting wildly different ham) setting up a 'per domain' bayes might work, might be tricky, especially if an inbound email is going to several domains, and only if you are doing B2B (commercial clients) I'm guessing that you'd also want to change the autolearn thresholds to be stricter? Like only auto-learning if it scores below -2 or above +10? (That might be an amavisd-new feature.)
Re: Site-wide Bayes
On 12/16/09 9:27 AM, Thomas Harold wrote: I'm guessing that you'd also want to change the autolearn thresholds to be stricter? Like only auto-learning if it scores below -2 or above +10? (That might be an amavisd-new feature.) I still use 0, but have the high score at +15. watch the 'sa-learn dump --magic' if you can keep the 'spam/ham' ratio close to your sites 'spam vs ham' ratio, you should be ok. -- Michael Scheidell, CTO Phone: 561-999-5000, x 1259 *| *SECNAP Network Security Corporation * Certified SNORT Integrator * 2008-9 Hot Company Award Winner, World Executive Alliance * Five-Star Partner Program 2009, VARBusiness * Best Anti-Spam Product 2008, Network Products Guide * King of Spam Filters, SC Magazine 2008 _ This email has been scanned and certified safe by SpammerTrap(r). For Information please see http://www.secnap.com/products/spammertrap/ _
Re: Site-wide Bayes (was: Spam from compromised web mails)
On Tue, 15 Dec 2009, Matt Garretson wrote: Heartily agreed. Site-wide bayes here (single database for 2000+ users) catches 40% of the spam here. But what is the FP rate? Is it safe for an ISP with a widely varied user base to use site-wide Bayes? - Charles
Re: Site-wide Bayes
On 12/15/09 11:49 AM, Charles Gregory wrote: On Tue, 15 Dec 2009, Matt Garretson wrote: Heartily agreed. Site-wide bayes here (single database for 2000+ users) catches 40% of the spam here. But what is the FP rate? Is it safe for an ISP with a widely varied user base to use site-wide Bayes? I find that you should reduce scores on the high and low end (bayes_00 and bayes_95) and the 'meta rules' that might combine them also. (so, yes, an ISP, or for our hosted clients, we have modified the bayes scores. . if one client is a plastic surgeon, one is a stock broker, and one is a mortgage broker, each will be getting wildly different ham) setting up a 'per domain' bayes might work, might be tricky, especially if an inbound email is going to several domains, and only if you are doing B2B (commercial clients) -- Michael Scheidell, CTO Phone: 561-999-5000, x 1259 *| *SECNAP Network Security Corporation * Certified SNORT Integrator * 2008-9 Hot Company Award Winner, World Executive Alliance * Five-Star Partner Program 2009, VARBusiness * Best Anti-Spam Product 2008, Network Products Guide * King of Spam Filters, SC Magazine 2008 _ This email has been scanned and certified safe by SpammerTrap(r). For Information please see http://www.secnap.com/products/spammertrap/ _
Re: Site-wide Bayes
On 12/15/2009 5:49 PM, Charles Gregory wrote: On Tue, 15 Dec 2009, Matt Garretson wrote: Heartily agreed. Site-wide bayes here (single database for 2000+ users) catches 40% of the spam here. But what is the FP rate? Is it safe for an ISP with a widely varied user base to use site-wide Bayes? from my experience, yes. the auto-fodder is just as diverse making Bayes very rugged and effective. You just need a good amount of ham traffic...
per-user and site-wide bayes databases toghether
Hi, I would like to have side by side a per-user and a site-wide database. AFAIK, right now I can have either one or the other. IMHE, I think that the per-user database is more effective, specially for HAM, but a side wide one will help improve SPAM detection (lower false negatives) and improve users with low mail count. So, is this possible right now? (I dont think so, but had to ask.) I have no problem in writting perl code. If I have to implement/hack this, any tips on where to start or how to implement are very welcome. Any opinions in why to not do this (or to do this) are also welcome. Raul Dias
RE: per-user and site-wide bayes databases toghether
If they say you can't, then this is how you'd do it.g (Training would need to be via scripts, not Autolearn, I imagine) SpamAssassin uses Bayes via database queries. So, you rename the tables to something different, and define a view of the same name as the table had been. It will be called by SA, but will return whatever you want the view to return. In this case, I'd guess it would be the union of the personal bayes and the site-wide bayes. You'd need to look into the actual columns to see if you must sum them for dups, but I imagine that would be pretty trivial logic. The only hack I see is to update the sa-learn process to use the correct (renamed) table names. Views are your friend! Dan ps: they are the folks who know SpamAssassin. I know squirrel (er, ah, Ess Que El). -Original Message- From: Raul Dias [mailto:[EMAIL PROTECTED] Sent: Friday, January 26, 2007 1:13 PM To: users@spamassassin.apache.org Subject: per-user and site-wide bayes databases toghether Hi, I would like to have side by side a per-user and a site-wide database. AFAIK, right now I can have either one or the other. IMHE, I think that the per-user database is more effective, specially for HAM, but a side wide one will help improve SPAM detection (lower false negatives) and improve users with low mail count. So, is this possible right now? (I dont think so, but had to ask.) I have no problem in writting perl code. If I have to implement/hack this, any tips on where to start or how to implement are very welcome. Any opinions in why to not do this (or to do this) are also welcome. Raul Dias
Site-Wide Bayes Question
I have just set up a Sendmail server with MIMEDefang and SpamAssassin 3.0.1. This machine is a front-end box to my IMAP server. I am using a site wide bayes database. I am curious how other people are handling spam and ham with the bayes database. I have set up two accounts on the front-end server for a spam mailbox and a ham mailbox for sa-learn. If my users just forward the message to either one of those mailboxes, will sa-learn be able to properly register that e-mail? Or should the user be using redirect? Or since it has already been sent on to another mail server, is it worthless without the raw message? Thanks for any help you can offer me. Jeff
RE: Site-Wide Bayes Question
Jeff Grossman wrote: I have just set up a Sendmail server with MIMEDefang and SpamAssassin 3.0.1. This machine is a front-end box to my IMAP server. I have a similar setup but with Exchange 2000 as the IMAP server. I've created two public folders: FN: spam but not tagged FP: tagged but not spam Users drag and drop errors to the appropriate folder, preserving the headers If your IMAP server supports public folders, this may be the best way to go Otherwise you might consider having a pair of error folders inside each mailbox - then have a script with universal access to all mailboxes walk through each mailbox, pulling from the error folders only Matthew.van.Eerde (at) hbinc.com 805.964.4554 x902 Hispanic Business Inc./HireDiversity.com Software Engineer perl -emap{y/a-z/l-za-k/;print}shift Jjhi pcdiwtg Ptga wprztg,
Re: Site-Wide Bayes Question
[EMAIL PROTECTED] wrote: Jeff Grossman wrote: I have just set up a Sendmail server with MIMEDefang and SpamAssassin 3.0.1. This machine is a front-end box to my IMAP server. I have a similar setup but with Exchange 2000 as the IMAP server. I've created two public folders: FN: spam but not tagged FP: tagged but not spam Users drag and drop errors to the appropriate folder, preserving the headers If your IMAP server supports public folders, this may be the best way to go Otherwise you might consider having a pair of error folders inside each mailbox - then have a script with universal access to all mailboxes walk through each mailbox, pulling from the error folders only Thank you for the suggestions. Jeff
Site-wide bayes database, autolearn address
Hi, Just upgraded to 3.0.1 running under qmail on OpenBSD and am happy to report no problems. However, whilst I was doing this, I had a few ideas. I've had a shufty through the archives for these but I didn't find an appropriate answer. I have 3 questions: 1. I would like to setup a sitewide bayes database that all mailboxes will use. This saves having to make every user learn their own spam and should improve the overall accuracy of the system. Is this particularly difficult to setup with an SQL backend? What happens if the database is unavailable? What is the performance hit on the database in these situations? We see around 2 messages a day on the server. 2. I would like to setup an automatic email address that people can send uncaught spam to, which will then be learnt as spam and put into the bayes database. Has anyone managed to do this? The problem I forsee is handling the forward as attachment or forward inline that different mail clients use. Presumably we would need to make people forward them as attachments, then have a procmail script that handles all mail accordingly. 3. I see entries such as: autolearn=ham autolearn=spam autolearn=unavailable autolearn=none In the mail logs. Is there a spam score threshold that triggers the autolearning behaviour? Is the default sensible? Should it be a little lower? I see high-scored spam not being learned as such and wonder if this ought to be tweaked a little. Gaby -- Ha! Ha! Ha! Dislocation... - Phil Ken Sebben [EMAIL PROTECTED] http://vanhegan.net
Re: Site-wide bayes database, autolearn address
Hi, Just upgraded to 3.0.1 running under qmail on OpenBSD and am happy to report no problems. However, whilst I was doing this, I had a few ideas. I've had a shufty through the archives for these but I didn't find an appropriate answer. I have 3 questions: 1. I would like to setup a sitewide bayes database that all mailboxes will use. This saves having to make every user learn their own spam and should improve the overall accuracy of the system. Is this particularly difficult to setup with an SQL backend? What happens if the database is unavailable? What is the performance hit on the database in these situations? We see around 2 messages a day on the server. 2. I would like to setup an automatic email address that people can send uncaught spam to, which will then be learnt as spam and put into the bayes database. Has anyone managed to do this? The problem I forsee is handling the forward as attachment or forward inline that different mail clients use. Presumably we would need to make people forward them as attachments, then have a procmail script that handles all mail accordingly. 3. I see entries such as: autolearn=ham autolearn=spam autolearn=unavailable autolearn=none In the mail logs. Is there a spam score threshold that triggers the autolearning behaviour? Is the default sensible? Should it be a little lower? I see high-scored spam not being learned as such and wonder if this ought to be tweaked a little. Gaby -- Ha! Ha! Ha! Dislocation... - Phil Ken Sebben [EMAIL PROTECTED] http://vanhegan.net As for 1 and 3, I don't know, but 2, I did myself. Actually, the biggest problem you'll run into is that when you forward the message, it tinkers with the headers of the message. I found a solution to this that doesn't require special scripts to strip the 'false' headers. We run SquirrelMail as a webmail front-end to courier-imap. I created a couple buttons as an extension to the amavis-sa plugins in SquirrelMail. The buttons are this is spam and this isn't spam. When a user clicks one of these, it actually moves the message (yes, at the OS level) from the mbox of the user who is viewing their email to my spam only mailbox. Fortunately, courier is pretty tolerant to this type of abuse. Keith
Re: Site-wide bayes database, autolearn address
Keith Hackworth wrote: As for 1 and 3, I don't know, but 2, I did myself. Actually, the biggest problem you'll run into is that when you forward the message, it tinkers with the headers of the message. I found a solution to this that doesn't require special scripts to strip the 'false' headers. Forwarding the email as an attachment may help, but as you say, it will rip out most of the headers. We do have SquirrelMail installed on our server though, but not many of our users use that, preferring to pop from home. I suppose we could put some instructions up where the user would view the message source, paste that into web form and that would get piped directly into sa-learn and then into the SQL bayes database. It's pernickerty but it would work, and relies on the sitewide SQL database working. Gaby -- Ha! Ha! Ha! Dislocation... - Phil Ken Sebben [EMAIL PROTECTED] http://vanhegan.net