Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Thu, 15 Feb 2018 14:32:36 -0600 (CST) sha...@shanew.net wrote: > I haven't checked the math in the Bayes plugin, but it explicitly > mentions using the "chi-square probability combiner" which is > described at http://www.linuxjournal.com/print.php?sid=6467 > > Maybe I'm misunderstanding what that article describes, but I'm pretty > sure what it boils down to is that when the occurence of a token is > too small (he uses the phrase "rare words") it can lead to > probabilities at the extremes (like a token that occurs only once and > is in spam, so its probability is 1). The way to address these > extremely low or extremely high probabilities is to use the Fisher > calculation (which is described in the second page of the article). Tokens with low counts are detuned a bit, but not as much as you might think. In a database with a 1:1 ratio you get hapax token probabilities of 0.016 and 0.987, IIRC Robinson anticipated something much closer to neutral. This is similar to the defaults in spambayes and bogofilter, and I think at least one of the three project would have derived them from optimization. My guess it's because enough tokens with low counts are very strong, but short-lived indicators that it's worth putting with the noise.
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Thu, 15 Feb 2018, RW wrote: On Thu, 15 Feb 2018 11:56:55 -0600 (CST) sha...@shanew.net wrote: So, the sample size doesn't matter when calculating the probability of a message being spam based on individual tokens, but it can matter when we bring them all together to make a final calculation. It's not a matter of how they combine, smaller counts just lead to less accurate token probabilities. I'm not saying that it doesn't matter how much you train, I'm saying that if you have enough spam and enough ham Bayes is insensitive to the ratio. I agree that past a certain minimum threshold, the ratio doesn't matter much. But as I understand it, larger sample size makes a difference. I haven't checked the math in the Bayes plugin, but it explicitly mentions using the "chi-square probability combiner" which is described at http://www.linuxjournal.com/print.php?sid=6467 Maybe I'm misunderstanding what that article describes, but I'm pretty sure what it boils down to is that when the occurence of a token is too small (he uses the phrase "rare words") it can lead to probabilities at the extremes (like a token that occurs only once and is in spam, so its probability is 1). The way to address these extremely low or extremely high probabilities is to use the Fisher calculation (which is described in the second page of the article). Maybe this is where I'm making a logical leap that I shouldn't, but I think that "non-rare words" increasingly outnumber "rare words" as the sample size of messages (and thus tokens) increases. -- Public key #7BBC68D9 at| Shane Williams http://pgp.mit.edu/| System Admin - UT CompSci =--+--- All syllogisms contain three lines | sha...@shanew.net Therefore this is not a syllogism | www.ischool.utexas.edu/~shanew
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Thu, 15 Feb 2018 20:16:24 +0100 Reindl Harald wrote: > Am 15.02.2018 um 20:10 schrieb RW: > > I'm not saying that it doesn't matter how much you train, I'm saying > > that if you have enough spam and enough ham Bayes is insensitive to > > the ratio > > but not when the ratio differs in magnitudes like the values from the > OP not more, and not less Based on the mathematics of "I reckon", and your database going off the rails after (by your own admission) you mistrained it. Actually the ratio was only 4:1, which isn't all that big.
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Thu, 15 Feb 2018 11:56:55 -0600 (CST) sha...@shanew.net wrote: > On Thu, 15 Feb 2018, RW wrote: > > > As I said, Bayes is based on frequencies. > > > > If a token occurs in 10% of ham and 0.5% of spam based on 10,000 > > hams and 10,000 spams, what do you think is likely to happen to > > those percentages with 10,000 hams and 1,000,000 spams? > > ... > So, the sample size doesn't matter when calculating the probability of > a message being spam based on individual tokens, but it can matter > when we bring them all together to make a final calculation. It's not a matter of how they combine, smaller counts just lead to less accurate token probabilities. I'm not saying that it doesn't matter how much you train, I'm saying that if you have enough spam and enough ham Bayes is insensitive to the ratio.
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Thu, 15 Feb 2018 19:24:14 +0100 Reindl Harald wrote: > Am 15.02.2018 um 19:20 schrieb RW: > > On Thu, 15 Feb 2018 17:15:47 +0100 > > You are talking about ultra-rare tokens here, the chances of these > > dominating a classification is negligibl > it is not - in 2015 i had to purge "in doubt" a few days of training > because unreasonable amount of ham was classified as BAYES_50 or even > tagged instead BAYES_00 and we talk about a bay with around 100.000 > sample sin total where with your logic you would not expect to get > biased within a few days - yes, that was training-mistakes for sure - > but when you are able to bias a bayes with a few years of corpus > within a few days your exmples are wrong I have no idea what you are talking about, how it's relevant, or what you did wrong, but it doesn't trump mathematics.
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Thu, 15 Feb 2018 17:15:47 +0100 Reindl Harald wrote: > Am 15.02.2018 um 17:01 schrieb RW: > > On Thu, 15 Feb 2018 00:01:18 +0100 > > Reindl Harald wrote: > > > >> Am 14.02.2018 um 23:07 schrieb RW: > > > >>> My point is that an imbalance doesn't create a bias > > > >> wrong - what you tried to say was "doesn't necessarily create a > >> bias" > >> - but in fact when the imbalance is too big *it does* > >> > >> simply think about how bayes works makes that clear: eahc word a > >> token with ham/spam counter - when you have 1 Mio of one type and > >> 1 of the other type guess how that counter start to get > >> biased > > > > As I said, Bayes is based on frequencies. > > > > If a token occurs in 10% of ham and 0.5% of spam based on 10,000 > > hams and 10,000 spams, what do you think is likely to happen to > > those percentages with 10,000 hams and 1,000,000 spams? > > the 10% and 0.5% is just an unbacked assumption It's not an assumption, it's an example. > what if every word except a few relevant of the spam mail and so > every token exists in a relevant percent of your 1.4 Mio ham samples > and so 90% of every token has a high ham-conuter You are talking about ultra-rare tokens here, the chances of these dominating a classification is negligible.
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Thu, 15 Feb 2018, RW wrote: On Thu, 15 Feb 2018 00:01:18 +0100 Reindl Harald wrote: Am 14.02.2018 um 23:07 schrieb RW: My point is that an imbalance doesn't create a bias wrong - what you tried to say was "doesn't necessarily create a bias" - but in fact when the imbalance is too big *it does* simply think about how bayes works makes that clear: eahc word a token with ham/spam counter - when you have 1 Mio of one type and 1 of the other type guess how that counter start to get biased As I said, Bayes is based on frequencies. If a token occurs in 10% of ham and 0.5% of spam based on 10,000 hams and 10,000 spams, what do you think is likely to happen to those percentages with 10,000 hams and 1,000,000 spams? Perhaps it would help to state Bayes' formula explicitly. The probabality that a message is spam given a specific token is equal to: (the probabilty of a token occuring in spam) times (the probability that a message is spam) divided by (the probabilty of that token occuring in all messages) The important feature in this formula is that every value being operated on is a probability, so if a given token occurs in .5% of 10,000 spams, we would expect it to occur in .5% of 100,000 or 1,000,000. If that assumption is true, and the .5% probability doesn't change, the resulting calculated probability also doesn't change. For actual spam detection, this is complicated by the fact that we end up with a whole stack of calculated probabilites for each token (including the probabilities that a message is non-spam given specific tokens), and we have to take all of them into account to calculate a final probability. In this process, it's not unusual that some individual calculated probablities "matter" more than others, and one basis for how much weight a particular probability gets is how much we can trust that probability. Here's where the 10,000 vs. 1,000,000 comes into play, because we can rely on the .5% probability out of 1,000,000 samples more than we can the .5% probability out of 10,000 samples, and both of those are better than a .5% probability out of 100 samples (that said, the difference in trust increases more between 100 samples and 10,000 samples than from 10,000 samples to 1,000,000 samples due to diminishing return). So, the sample size doesn't matter when calculating the probability of a message being spam based on individual tokens, but it can matter when we bring them all together to make a final calculation. -- Public key #7BBC68D9 at| Shane Williams http://pgp.mit.edu/| System Admin - UT CompSci =--+--- All syllogisms contain three lines | sha...@shanew.net Therefore this is not a syllogism | www.ischool.utexas.edu/~shanew
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Thu, 15 Feb 2018 00:01:18 +0100 Reindl Harald wrote: > Am 14.02.2018 um 23:07 schrieb RW: > > My point is that an imbalance doesn't create a bias > wrong - what you tried to say was "doesn't necessarily create a bias" > - but in fact when the imbalance is too big *it does* > > simply think about how bayes works makes that clear: eahc word a > token with ham/spam counter - when you have 1 Mio of one type and > 1 of the other type guess how that counter start to get biased As I said, Bayes is based on frequencies. If a token occurs in 10% of ham and 0.5% of spam based on 10,000 hams and 10,000 spams, what do you think is likely to happen to those percentages with 10,000 hams and 1,000,000 spams?
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Wed, 14 Feb 2018 16:20:30 +0100 Matus UHLAR - fantomas wrote: > >On Tue, 13 Feb 2018 21:02:46 + > >Horváth Szabolcs wrote: > >> One more question: is there a recommended ham to spam ratio? 1:1? > > On 14.02.18 15:09, RW wrote: > >No, this is a myth. Bayes computes token probabilities from a > >token's frequencies in spam and ham, so it all scales through. If > >you have 2000 ham and 200 spam the problem is too few spams, not a > >bad ratio. > > my experience says you will need more ham than spam, because you want > to get rid of false positives (ham marked as spam) much more than of > false negatives. My point is that an imbalance doesn't create a bias.
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On 02/14/2018 09:20 AM, Matus UHLAR - fantomas wrote: On Tue, 13 Feb 2018 21:02:46 + Horváth Szabolcs wrote: One more question: is there a recommended ham to spam ratio? 1:1? On 14.02.18 15:09, RW wrote: No, this is a myth. Bayes computes token probabilities from a token's frequencies in spam and ham, so it all scales through. If you have 2000 ham and 200 spam the problem is too few spams, not a bad ratio. my experience says you will need more ham than spam, because you want to get rid of false positives (ham marked as spam) much more than of false negatives. This is also my experience. what really matters is how many of FP/FNs you have, you can decrease probability by training anything too far from BAYES_00 for ham and BAYES_99 for ham Correct. You want to get ham hitting BAYES_00 and spam hitting BAYES_80, BAYES_95, BAYES_99, or BAYES_999 which mine does very well. A problem I have found is you shouldn't blindly train all spam as spam. I have some spam hitting BAYES_00 because it truly could be ham based on the body contents but it's spam because it was unsolicited email from someone "cold" emailing for a meeting or something. In this case, I block the sender and report it to SpamCop and other abuse so the account can be blocked/locked/disabled hopefully. If I had trained my Bayes with this email as spam, then legit email could hit BAYES_99. That is why my nightly process to train my Bayes DB in redis learns ham first then spam second. This seems to be the best order from my experience. -- David Jones
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Tue, 13 Feb 2018 21:02:46 + Horváth Szabolcs wrote: One more question: is there a recommended ham to spam ratio? 1:1? On 14.02.18 15:09, RW wrote: No, this is a myth. Bayes computes token probabilities from a token's frequencies in spam and ham, so it all scales through. If you have 2000 ham and 200 spam the problem is too few spams, not a bad ratio. my experience says you will need more ham than spam, because you want to get rid of false positives (ham marked as spam) much more than of false negatives. what really matters is how many of FP/FNs you have, you can decrease probability by training anything too far from BAYES_00 for ham and BAYES_99 for ham -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. LSD will make your ECS screen display 16.7 million colors
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Tue, 13 Feb 2018 21:02:46 + Horváth Szabolcs wrote: > One more question: is there a recommended ham to spam ratio? 1:1? No, this is a myth. Bayes computes token probabilities from a token's frequencies in spam and ham, so it all scales through. If you have 2000 ham and 200 spam the problem is too few spams, not a bad ratio. Theoretically there is a case for new training to match the ratio that's already in the database because then a new token will get a token probability that reflects its frequencies in recent mail. But I wouldn't worry about that, it's hard to stick to, and probably minor.
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
They cannot (do not want, do not have the know how) study the e-mails, and therefore they cannot build a reliable corpus. All they can do is to trust the ability of their users to study their own e-mails well enough to do the job, hence the mess with ham/spam when feeding the Bayesian filter. They need to consult with a lawyer, fix their paperwork, hire people who can teach them everything they need to know, and invest at least two years full-time in the process. They just cannot install centos and SA and hope Bayesian filters to do their job out of magic. It just does not work that way. Sent from ProtonMail Mobile On Wed, Feb 14, 2018 at 05:48, Bill Cole wrote: > On 13 Feb 2018, at 9:33, Horváth Szabolcs wrote: > This is a production mail > gateway serving since 2015. I saw that a few > messages (both hams and spams) > automatically learned by > amavisd/spamassassin. Today's statistics: > > 3616 > autolearn=ham > 10076 autolearn=no > 2817 autolearn=spam > 134 > autolearn=unavailable That's quite high for spam, ham, AND "unavailable" > (which indicates something wrong with the Bayes subsystem, usually > transient.) This seems like a recipe for a mis-learning disaster. For > comparison, my 2018 autolearn counts: spam: 418 ham: 15018 unavailable: 166 > no: 129555 I also manually train any spam that gets through to me (the > biggest spam target,) a small number of spams reported by others, and 'trap' > hits. A wide variety of ham is harder to get for training but I have found it > useful to give users a well-documented and simple way to help. One way is to > look at what happens to mail AFTER delivery which can indicate that a message > is ham without needing an admin to try to make a determination based on > content. The simplest one is to learn anything users mark as $NotJunk as ham. > Another is to create an "Archive" mailbox for every user and learn anything > as ham that has been moved there a day after it is moved. The most important > factor (especially in jurisdictions where human examination of email is a > problem) is to tell users how to protect their email and then do what you > tell them, robotically. In the US, Canada, and *SOME* of the EU, this is not > risky. However, I have been told by people in *SOME* EU countries that they > can't even robotically scan ANY mail content, so you shouldn't take my advice > as authoritative: I'm not even a lawyer in the US, much less Hungary... > I > think I have no control over what is learnt automatically. Yes, you do. Run > "perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold" for details. You can > set the learning thresholds, which control what gets learned. The defaults > (0.1 and 12) mis-learn far too much spam as ham and not enough spam. I use > -0.2 and 6, which means I don't autolearn a lot but everything I autolearn as > ham has at least one hit on a substantial "nice" rule or 2 hits on weak ones. > There's a lot of vehemence against autolearn expressed here but not a lot of > evidence that it operates poorly when configured wisely. The defaults are NOT > wise. > Let's just assume for a moment that 1.4M ham-samples are valid. Bad > assumption. Your Bayes checks are uncertain about mail you've told SA is > definitely spam. That's broken. It's a sort of breakage that cannot exist if > you do not have a large quantity of spam that has been learned as ham. > Is > there a ham:spam ratio I should stick to it? No. > I presume if we have a 1:1 > ratio then future messages won't be > considered as spam as well. The > ham:spam ratio in the Bayes DB or its autolearning is not a generally useful > metric. 1:1 is not magically good and neither is any other ratio, even with > reference to a single site's mailstream. A very large ratio *on either side* > indicates a likely problem in what is being learned, but you can't correlate > the ratio to any particularly wrong bias in Bayes scoring. It is an > inherently chaotic relationship. Factors that actually matter are correctness > of learning, sample quality, and currency. You can control how current your > Bayes DB is (USE AUTO-EXPIRE) but the other two factors are never going to be > perfect.
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On 13 Feb 2018, at 9:33, Horváth Szabolcs wrote: This is a production mail gateway serving since 2015. I saw that a few messages (both hams and spams) automatically learned by amavisd/spamassassin. Today's statistics: 3616 autolearn=ham 10076 autolearn=no 2817 autolearn=spam 134 autolearn=unavailable That's quite high for spam, ham, AND "unavailable" (which indicates something wrong with the Bayes subsystem, usually transient.) This seems like a recipe for a mis-learning disaster. For comparison, my 2018 autolearn counts: spam: 418 ham: 15018 unavailable: 166 no: 129555 I also manually train any spam that gets through to me (the biggest spam target,) a small number of spams reported by others, and 'trap' hits. A wide variety of ham is harder to get for training but I have found it useful to give users a well-documented and simple way to help. One way is to look at what happens to mail AFTER delivery which can indicate that a message is ham without needing an admin to try to make a determination based on content. The simplest one is to learn anything users mark as $NotJunk as ham. Another is to create an "Archive" mailbox for every user and learn anything as ham that has been moved there a day after it is moved. The most important factor (especially in jurisdictions where human examination of email is a problem) is to tell users how to protect their email and then do what you tell them, robotically. In the US, Canada, and *SOME* of the EU, this is not risky. However, I have been told by people in *SOME* EU countries that they can't even robotically scan ANY mail content, so you shouldn't take my advice as authoritative: I'm not even a lawyer in the US, much less Hungary... I think I have no control over what is learnt automatically. Yes, you do. Run "perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold" for details. You can set the learning thresholds, which control what gets learned. The defaults (0.1 and 12) mis-learn far too much spam as ham and not enough spam. I use -0.2 and 6, which means I don't autolearn a lot but everything I autolearn as ham has at least one hit on a substantial "nice" rule or 2 hits on weak ones. There's a lot of vehemence against autolearn expressed here but not a lot of evidence that it operates poorly when configured wisely. The defaults are NOT wise. Let's just assume for a moment that 1.4M ham-samples are valid. Bad assumption. Your Bayes checks are uncertain about mail you've told SA is definitely spam. That's broken. It's a sort of breakage that cannot exist if you do not have a large quantity of spam that has been learned as ham. Is there a ham:spam ratio I should stick to it? No. I presume if we have a 1:1 ratio then future messages won't be considered as spam as well. The ham:spam ratio in the Bayes DB or its autolearning is not a generally useful metric. 1:1 is not magically good and neither is any other ratio, even with reference to a single site's mailstream. A very large ratio *on either side* indicates a likely problem in what is being learned, but you can't correlate the ratio to any particularly wrong bias in Bayes scoring. It is an inherently chaotic relationship. Factors that actually matter are correctness of learning, sample quality, and currency. You can control how current your Bayes DB is (USE AUTO-EXPIRE) but the other two factors are never going to be perfect.
RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Tue, 13 Feb 2018, Horváth Szabolcs wrote: 3. populate the ham database That's the tricky part. As I mentioned earlier, I don't really want end-users involved in this. You might be able to find a few that are somewhat technically competent and don't mind their ham samples being manually reviewed. One more question: is there a recommended ham to spam ratio? 1:1? I suggest "try to match your ham:spam ratio at your MTA before filtering", but others may have different advice. Generally: the more *reliable* data you can feed Bayes, the better it does. I'm thinking about if you see my "populating the ham database automatically with the outgoing emails" idea as a complete nonsense, then I would find sysadministrator resource to collect 2000 legit emails and train those mails as hams, but cannot allocate 2 workhours/day for months. (Also I'm not sure if 2000 legit emails are enough for training) 2000 is enough to start, but it would have to be ongoing as the nature of mail changes over time. Generally training on misclassifications is what you do after the initial training. So if a ham drops into a user's quarantine folder, you'd want to train that as ham. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Windows Genuine Advantage (WGA) means that now you use your computer at the sufferance of Microsoft Corporation. They can kill it remotely without your consent at any time for any reason; it also shuts down in sympathy when the servers at Microsoft crash. --- 9 days until George Washington's 286th Birthday
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
John Hardin skrev den 2018-02-14 02:28: Properly training your Bayes and increasing the score for BAYES_80, BAYES_95, and BAYES_99 and BAYES_999 score BAYES_999 5000 /me hiddes, could not resists :=)
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Tue, 13 Feb 2018, David Jones wrote: Properly training your Bayes and increasing the score for BAYES_80, BAYES_95, and BAYES_99 and BAYES_999 is the best bet on this one. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Windows Genuine Advantage (WGA) means that now you use your computer at the sufferance of Microsoft Corporation. They can kill it remotely without your consent at any time for any reason; it also shuts down in sympathy when the servers at Microsoft crash. --- 9 days until George Washington's 286th Birthday
RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam
Hello, David Jones [mailto:djo...@ena.com] wrote: > With non-English email flow, it's more challenging. If no RBLs hit, then you > really must train your Bayes properly which requires some way to accurately > determine the ham and spam. You must keep a copy of the ham and spam corpi and be allowed to review suspicious email. I really appreciate you to take time helping on this. Yes, I can confirm that we usually have issues with Hungarian spams. English spams often caught by the default rules. As far as I understood today, I need to re-build the bayes database from scratch: 1. turn off autolearning 2. populate then spam database Guys behind the http://artinvoice.hu/spams/ site are doing an excellent work, they publish catched spams in mbox format I checked, many spam e-mails that was sent for investigation are in their mbox. 3. populate the ham database That's the tricky part. As I mentioned earlier, I don't really want end-users involved in this. And I don't have the necessary resource to do that manually. I assume I can hack something into the mailflow to copy all outgoing e-mails to a separate mailbox and - we'll assume that every outgoing e-mail are hams - these mails are learnt. That should do it? End-users are working in a heavily controlled environment (both technically and legally), in the last ten years, we haven't experienced spams that were sent from inside. That's why I would blindly trust outgoing emails as hams. One more question: is there a recommended ham to spam ratio? 1:1? I'm thinking about if you see my "populating the ham database automatically with the outgoing emails" idea as a complete nonsense, then I would find sysadministrator resource to collect 2000 legit emails and train those mails as hams, but cannot allocate 2 workhours/day for months. (Also I'm not sure if 2000 legit emails are enough for training) Best regards, Szabolcs Horvath
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On 02/13/2018 11:45 AM, Horváth Szabolcs wrote: Reindl Harald [mailto:h.rei...@thelounge.net] wrote: I think I have no control over what is learnt automatically. surely, don't do autolearning at all This is a mail gateway for multiple companies. I'm not supposed to read e-mails on that, or picking mails that can be used for learning ham. And I can't ask users to use a "ham" mailbox, because they are not IT experts, sometimes they have problems with a simple mail forwarding. If you aren't allowed to check specific emails with a suspicious subject or that are reported as spam by your users, there's no way you can do your job of accurately filtering email. Without autolearning and without the help of the end-users, I can't build a proper ham bayes database, can I? SA's autolearning doesn't use the results from BAYES_* rules since that could make incorrect training even worse so you are going to have to build local rules or get help from RBLs and other SA plugins to get to the autolearning thresholds. With non-English email flow, it's more challenging. If no RBLs hit, then you really must train your Bayes properly which requires some way to accurately determine the ham and spam. You must keep a copy of the ham and spam corpi and be allowed to review suspicious email. Can you setup a split copy of the email that can redact the recipient or anonymize it enough to allow for review? If not, your filtering is not going to be accurate. Best regards Szabolcs -- David Jones
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On 02/13/2018 11:24 AM, Horváth Szabolcs wrote: Hello, David Jones [mailto:djo...@ena.com] wrote: There should be many more rule hits than just these 3. It looks like network tests aren't happening. Can you post the original email to pastebin.com with minimal redacting so the rest of us can run it through our SA to see how it scores to help with suggestions? Thanks for taking time to answer. Here it is: https://pastebin.com/5XZ5kbus My SA instance would have blocked it but the 2 rules that did it won't apply to your mail flow based on language and non-US relays. Properly training your Bayes and increasing the score for BAYES_80, BAYES_95, and BAYES_99 is the best bet on this one. It might take some local content rules but I can't read the subject or body. :) Content analysis details: (10.2 points, 5.0 required) pts rule name description -- -- 5.2 BAYES_99 BODY: Bayes spam probability is 99 to 100% [score: 0.9926] 0.0 HTML_IMAGE_RATIO_08BODY: HTML has a low ratio of text to image area 2.8 UNWANTED_LANGUAGE_BODY BODY: Message written in an undesired language 0.0 HTML_MESSAGE BODY: HTML included in message 2.2 ENA_RELAY_NOT_US Relayed from outside the US and not on whitelists 0.0 ENA_BAD_SPAM Spam hitting really bad rules. This brings up a good point that we need help with non-English masscheckers and SA rules. The sending mail server 79.96.0.147 is not listed on any major RBLs and it has proper FCrDNS. I can't tell the envelope-from domain but it must not have an SPF record. Definitely no DMARC record for fiok.com. The "IdeaSmtpServer" might be something to investigate it's relationship to spam to see if it's an indicator worthy of a local rule. The domain in the Message-ID might be worth checking with other spam to see if that is a pattern worth a local rule. If there are unique body phrases or misspellings, then that is definitely something to put into a local rule to add a point or two in the future. I suspect there needs to be some MTA tuning in front of SA along with some SA tuning that is mentioned on this list every couple of months -- add extra RBLs, add KAM.cf, enable some SA plugins, etc. Oops. I'm a new member on this list. Could you please tell us which customizations do you mean? I already looked KAM.cf, doesn't really help in situation. We're using a lot of RBLs. -- David Jones
RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam
Reindl Harald [mailto:h.rei...@thelounge.net] wrote: >> This is a mail gateway for multiple companies. I'm not supposed to read >> e-mails on that, or picking mails that can be used for learning ham > > how did you then manage 1.4 Mio ham-samples in your biased corpus Looks like in this amavisd-spamassassin combo, it automatically learnt a lot of ham (which weren't hams) Feb 11 03:37:31 amavis[20024]: (20024-06) spam-tag, -> , No, score=-0.099 tagged_above=- required=4 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001] autolearn=ham I never configured autolearning, I assume it came with this centos setup. Man spamassassin says, bayes_auto_learn has a default value of 1. >> Without autolearning and without the help of the end-users, I can't build a >> proper ham bayes database, can I? > surely, or don't you and people around you which can help don't send and > reveive mails? I don't want to go in this "fight", but end-users have limited IT knowledge. They are 100% outlook users (forwarding inline and attached always confuse them). If I really want this, I need something user-proof one click solutions like gmail's "spam" and "not spam" buttons which magically saves e-mails to the proper technical mailbox (which is reviewed by the admins then trained SA). With outlook users, exchange internal mta's, my options are limited. So, if I understood correctly, you all agree on that bayesian database is f* up, let's start with a new one, autolearn turned off, and train SA from the stratch both with ham and spam mails. Best regards Szabolcs
RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam
Reindl Harald [mailto:h.rei...@thelounge.net] wrote: >> I think I have no control over what is learnt automatically. > surely, don't do autolearning at all This is a mail gateway for multiple companies. I'm not supposed to read e-mails on that, or picking mails that can be used for learning ham. And I can't ask users to use a "ham" mailbox, because they are not IT experts, sometimes they have problems with a simple mail forwarding. Without autolearning and without the help of the end-users, I can't build a proper ham bayes database, can I? Best regards Szabolcs
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On Tue, 13 Feb 2018, Horváth Szabolcs wrote: After: pts rule name description -- -- 0.0 HTML_IMAGE_RATIO_08BODY: HTML has a low ratio of text to image area 0.0 HTML_MESSAGE BODY: HTML included in message 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] BAYES_50 is "can't decide". Version: spamassassin-3.3.2-4.el6.rfx.x86_64 $ sa-learn --dump magic --dbpath /var/spool/amavisd/.spamassassin/ 0.000 0 3 0 non-token data: bayes db version 0.000 0 338770 0 non-token data: nspam 0.000 01460807 0 non-token data: nham That ratio is really suspicious. I'd expect something closer to 1:1 or even a bit heavier on spam. It *seems* that you have spam trained as ham; that would explain BAYES_50 with that much in the BAYES database. My questions are: 1) is there any chance to change spamassassin settings to mark similar messages as SPAM in the future? bayes_50 with 0.8 points are really-really low. No, it's not. "BAYES_50" is "I can't decide" and increasing the score for that implies "I can't decide" means "spam". That's not justified. Don't adjust the score of BAYES_50. It's recommended (if possible) to retain the training corpora so that it can be reviewed and retrained from scratch if necessary. Your admin is manually vetting user-submitted training messages. Are they retained after being trained? You might consider reviewing the training corpus and retraining Bayes from scratch. Another note: the "before" result: Before: spamassassin -D -t ...with *no* BAYES hits at all (not even BAYES_50) suggests your SA is *not* using the database whose statistics you reported above. First: verify which Bayes database your SA install is using, and that it is the one you're training into and getting those stats from. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Maxim IX: Never turn your back on an enemy. --- 9 days until George Washington's 286th Birthday
RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam
Hello, David Jones [mailto:djo...@ena.com] wrote: > There should be many more rule hits than just these 3. It looks like > network tests aren't happening. > Can you post the original email to pastebin.com with minimal redacting > so the rest of us can run it through our SA to see how it scores to help > with suggestions? Thanks for taking time to answer. Here it is: https://pastebin.com/5XZ5kbus > I suspect there needs to be some MTA tuning in front of SA along with > some SA tuning that is mentioned on this list every couple of months -- > add extra RBLs, add KAM.cf, enable some SA plugins, etc. Oops. I'm a new member on this list. Could you please tell us which customizations do you mean? I already looked KAM.cf, doesn't really help in situation. We're using a lot of RBLs. > > It only assigns 0.8. (required_hits around 4.0) > You are certainly free to set a local score higher if you want but that is > probably not the main resolution to this issue. I agree. > > Version: spamassassin-3.3.2-4.el6.rfx.x86_64 > This is very old and no longer supported. Why not upgrade to 3.4.x? Because centos6 ships with this version. When the infrastructure was built, there were no centos7 around. Migration between the major versions is still not an easy thing to do. > > My questions are: > > 1) is there any chance to change spamassassin settings to mark similar > > messages as SPAM in the future? > > bayes_50 with 0.8 points are really-really low. > > > > You should be hitting BAYES_95, BAYES_99, and BAYES_999 on these really > bad emails with proper training which would give it a higher probability > and thus a higher score. I agree. Can't wait to see what your results are on this e-mail. Best regards Szabolcs Horvath
Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
On 02/13/2018 07:55 AM, Horváth Szabolcs wrote: Dear members, User repeatedly sends us spam messages to train SA. Traning - at the moment - requires manual intervention: administrator verifies if it's really spam then issues sa-learn. Then the user thinks the process is done, and the next time when the same email arrives, it will automatically marked as spam. However, that doesn't happen. Before: spamassassin -D -t There should be many more rule hits than just these 3. It looks like network tests aren't happening. Can you post the original email to pastebin.com with minimal redacting so the rest of us can run it through our SA to see how it scores to help with suggestions? I suspect there needs to be some MTA tuning in front of SA along with some SA tuning that is mentioned on this list every couple of months -- add extra RBLs, add KAM.cf, enable some SA plugins, etc. It only assigns 0.8. (required_hits around 4.0) You are certainly free to set a local score higher if you want but that is probably not the main resolution to this issue. Version: spamassassin-3.3.2-4.el6.rfx.x86_64 This is very old and no longer supported. Why not upgrade to 3.4.x? $ sa-learn --dump magic --dbpath /var/spool/amavisd/.spamassassin/ 0.000 0 3 0 non-token data: bayes db version 0.000 0 338770 0 non-token data: nspam 0.000 01460807 0 non-token data: nham 0.000 0 187804 0 non-token data: ntokens 0.000 0 1512318030 0 non-token data: oldest atime 0.000 0 1518524875 0 non-token data: newest atime 0.000 0 1518524876 0 non-token data: last journal sync atime 0.000 0 1518508126 0 non-token data: last expiry atime 0.000 0 43238 0 non-token data: last expire atime delta 0.000 0 136970 0 non-token data: last expire reduction count I obviously see that nspam is increased after the sa-learn. When I tried to understand what was happening, I found the following: # https://wiki.apache.org/spamassassin/BayesInSpamAssassin The Bayesian classifier in Spamassassin tries to identify spam by looking at what are called tokens; words or short character sequences that are commonly found in spam or ham. If I've handed 100 messages to sa-learn that have the phrase penis enlargement and told it that those are all spam, when the 101st message comes in with the words penis and enlargment, the Bayesian classifier will be pretty sure that the new message is spam and will increase the spam score of that message. My questions are: 1) is there any chance to change spamassassin settings to mark similar messages as SPAM in the future? bayes_50 with 0.8 points are really-really low. You should be hitting BAYES_95, BAYES_99, and BAYES_999 on these really bad emails with proper training which would give it a higher probability and thus a higher score. I know that I'm able to write custom rules based on e-mail body content but I flattered myself that sa-learn would do that by manipulating the bayes database. I suspect that after the MTA and SA are tuned, this would be blocked without requiring a local custom rule but I would need to see the rule hits on my SA platform before I could say for sure. Sometimes it does require a header or body rule combine with other hits in a local custom meta rule to block them. 2) or tell users that learning process doesn't necessarily mean that future messages will be flagged SPAM. Rather than it should be considered as a "warning sign". I appreciate any feedback on this. Already try to find docs that answers those questions, but no luck so far. If you have a good documentation, just send me. I love reading manuals. Best regards, Szabolcs Horvath -- David Jones
RE: Train SA with e-mails 100% proven spams and next time it should be marked as spam
Reindl Harald [mailto:h.rei...@thelounge.net] wrote: > > However, that doesn't happen. > > 0.000 0 338770 0 non-token data: nspam > > 0.000 01460807 0 non-token data: nham > what do you expect when you train 4 times more ham than spam? > frankly you "flooded" your bayes with 1.4 Mio ham-samples and i thought > our 140k total corpus is large - don' forget that ham messages are > typically larger than junk trying to point you with some words to a URL > > 108897 SPAM > 31492HAM This is a production mail gateway serving since 2015. I saw that a few messages (both hams and spams) automatically learned by amavisd/spamassassin. Today's statistics: 3616 autolearn=ham 10076 autolearn=no 2817 autolearn=spam 134 autolearn=unavailable I think I have no control over what is learnt automatically. Let's just assume for a moment that 1.4M ham-samples are valid. Is there a ham:spam ratio I should stick to it? I presume if we have a 1:1 ratio then future messages won't be considered as spam as well. Regards Szabolcs
Train SA with e-mails 100% proven spams and next time it should be marked as spam
Dear members, User repeatedly sends us spam messages to train SA. Traning - at the moment - requires manual intervention: administrator verifies if it's really spam then issues sa-learn. Then the user thinks the process is done, and the next time when the same email arrives, it will automatically marked as spam. However, that doesn't happen. Before: spamassassin -D -t https://wiki.apache.org/spamassassin/BayesInSpamAssassin The Bayesian classifier in Spamassassin tries to identify spam by looking at what are called tokens; words or short character sequences that are commonly found in spam or ham. If I've handed 100 messages to sa-learn that have the phrase penis enlargement and told it that those are all spam, when the 101st message comes in with the words penis and enlargment, the Bayesian classifier will be pretty sure that the new message is spam and will increase the spam score of that message. My questions are: 1) is there any chance to change spamassassin settings to mark similar messages as SPAM in the future? bayes_50 with 0.8 points are really-really low. I know that I'm able to write custom rules based on e-mail body content but I flattered myself that sa-learn would do that by manipulating the bayes database. 2) or tell users that learning process doesn't necessarily mean that future messages will be flagged SPAM. Rather than it should be considered as a "warning sign". I appreciate any feedback on this. Already try to find docs that answers those questions, but no luck so far. If you have a good documentation, just send me. I love reading manuals. Best regards, Szabolcs Horvath