Re: recent update to __STYLE_GIBBERISH_1 leads to 100% CPU usage
Hi, seems to work. Had to add score __STYLE_GIBBERISH_1 0 to my SA config to make your mail pass. Markus Am 2019-05-28 10:31, schrieb Stoiko Ivanov: Hello, with a recent update to the ruleset, we're encountering certain mails, which cause the rule-evaluation to use 100% cpu. The effect was reproduced with Proxmox Mailgateway 5.2 (running Spamassassin 3.4.2 ) and Ubuntu 19.04 (also running Spamassassin 3.4.2) After some debugging the issue was narrowed down to the rule __STYLE_GIBBERISH_1 . I'm attaching a rather small sample mail which reproduces the issue. Since this is my first post to this list I'm pretty sure I forgot some detail, so please don't hesitate to ask for any further information needed Thanks and Best Regards, stoiko
recent update to __STYLE_GIBBERISH_1 leads to 100% CPU usage
Hello, with a recent update to the ruleset, we're encountering certain mails, which cause the rule-evaluation to use 100% cpu. The effect was reproduced with Proxmox Mailgateway 5.2 (running Spamassassin 3.4.2 ) and Ubuntu 19.04 (also running Spamassassin 3.4.2) After some debugging the issue was narrowed down to the rule __STYLE_GIBBERISH_1 . I'm attaching a rather small sample mail which reproduces the issue. Since this is my first post to this list I'm pretty sure I forgot some detail, so please don't hesitate to ask for any further information needed Thanks and Best Regards, stoiko Date: Mon, 27 May 2019 12:39:10 +0200 (CEST) From: TEST To: Message-ID: <0.test.example.com> Subject: Modification MIME-Version: 1.0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit Modification
Re: my spamassassin has serious config problems
Hi, zimbra is working as it should and I won't change it with another software :-) I started to deploy a new configuration but when I discovered that the auto learn internal ti spamassassin was active (by error) we decided that the best solution is to zap the database and load only our corpus. I'm in the process to prepare the corpus between "training messages" and "test messages" to test if the training is ok, on another server, of course I think that on 29 we will proceed on the production server. I will report the results. The zimbra man loaded a big corpus of spam that arrived on some mailboxes and you can see the 1.000.000 nspam present in the db... I think I won't let him load them again... Only "approved" spam must be used to feed the bayes engine... Francesco On Tue, May 28, 2019 at 7:48 PM Matus UHLAR - fantomas wrote: > On 28.05.19 15:34, hg user wrote: > >I did some more research and I think I have to report the new discovery so > >that the thread can be useful to other Readers. > > > >First: > >0.000 0 5232 0 non-token data: nspam > >0.000 0 70408 0 non-token data: nham > >0.000 0 388070 0 non-token data: ntokens > >nspam and nham values are definitively the number of messages learnt. > > > >Second: > >I saw that nham increased every few seconds. I discovered that > >bayes_auto_learn was enabled ! > >My situation yesterday: > >0.000 01042011 0 non-token data: nspam > >0.000 0 66472 0 non-token data: nham > >0.000 0 663479 0 non-token data: ntokens > >My situation now: > >0.000 01042049 0 non-token data: nspam > >0.000 0 71228 0 non-token data: nham > >0.000 01040661 0 non-token data: ntokens > > > >So, at least, I now know that the system is feeding the bayes engine with > >some new data and that in this way the results can change. > > > >Third: > >in 72_active.cf there are a lot of bayes_ignore_header directives, but > they > >don't include the ones added by my commercial antivirus. Should I create a > >patch? > > > >Fourth: > >I added a dbg statement to bayes.pm, sub tokenize, to print the tokens it > >extracts from the message. > >I agree with some, I don't with others. I'd like to know if there is some > >doc that lists why tokens are extracted this way (some notes are in the > >source code) > >I discovered that probably some words should be added to the stopwords > list > >but there is no way to do it in a configuration file, I should modify > >spamassassin code directly... > > > > > > > >To end: > >I think that the only way to proceed now is to nuke the bayes db and start > >from scratch: > >- setup bayes configuration correctly > >- double check the corpus to be correctly classified > >- run sa-learn > > Do you still use Zimbra? if so, have you configured Zimbra? > Did you consult your Zimbra-man? > > > >For the "setup bayes configuration correctly" step I accept your > >contributions :-) I excluded all the headers of my antivirus and > >internal/external/trusted. > > -- > Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ > Warning: I wish NOT to receive e-mail advertising to this address. > Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. > Spam is for losers who can't get business any other way. >
Re: Is Bayes forgetting ?
Attempting to answer all your questions related to zimbra Zimbra doesn't do any training until it is kicked off by cron for the zimbra user... Training a message as spam from the MUA interface sends the email to a special spam account... Training messages as not spam goes to a special ham account. You can pull those emails or use your admin interface to see what the users have trained before actual training occurs as a consequence. Look at your zimbra crontab entry to see when zmtrainsa runs which pulls the emails and trains them. With everything in zimbra, make sure you are the zimbra user before running any SA commands. If you want to train vast amounts of email as spam or ham, you can use the zmtrainsa and point it to a directory of messages. zmtrainsa is just a wrapper to get your zimbra spamassassin environment correct so that would show you sa-learn and the correct options (ie. paths), etc. Given this is really zimbra specific, probably better to ask this on the Zimbra administrator forums if you have specific questions so the rest of the community can learn. Note: it doesn't require anything special to run spamassassin in debug mode to test rules... just make sure you do it as the zimbra user so your environment is established. HTH, Jim - On May 27, 2019, at 5:13 AM, hg user wrote: > Ok, sorry for my bad answers... > I test spamassassin results using a command line: > spamassassin -t /path/file > sometimes I add -D to look at the debug logs. > spamassassin is actually used via amavis, in a zimbra setup. > To teach sa-learn and to test the results via command line I use zimbra user. > On Mon, May 27, 2019 at 1:18 PM Matus UHLAR - fantomas < [ > mailto:uh...@fantomas.sk | uh...@fantomas.sk ] > wrote: >> On 27.05.19 12:51, hg user wrote: >> >the Linux user is the same. >> the same as what? >> >Bayes db is on Linux. >> seems I wasn't clear at my question: >> How do you use spamassassin? milter, amavis, procmail filter, postfix >> filter ... ? >> -- >> Matus UHLAR - fantomas, [ mailto:uh...@fantomas.sk | uh...@fantomas.sk ] ; [ >> http://www.fantomas.sk/ | http://www.fantomas.sk/ ] >> Warning: I wish NOT to receive e-mail advertising to this address. >> Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. >> Eagles may soar, but weasels don't get sucked into jet engines.
Re: Spamd error message "spamd: error: addr is not a string"
On 28 May 2019, at 14:55, Emanuel Gonzalez wrote: Hello, i updated perl via yum. # spamassassin --version SpamAssassin version 3.4.2 running on Perl version 5.16.3 CentOS Linux release 7.6.1810 (Core) x86_64 I see this error: May 28 15:40:03 server spamd[17267]: spamd: error: addr is not a string at /usr/share/perl5/vendor_perl/IO/Socket/IP.pm line 685. any ideas.? Run "yum list installed perl-Socket" to verify that you have version 2.010-4.el7. If not, upgrade (or install) the perl-Socket package. There are other *possible* sources of that error, so you may find it useful to update all of your Perl packages to the latest versions in the CentOS 7 repository.
Spamd error message "spamd: error: addr is not a string"
Hello, i updated perl via yum. # spamassassin --version SpamAssassin version 3.4.2 running on Perl version 5.16.3 CentOS Linux release 7.6.1810 (Core) x86_64 I see this error: May 28 15:40:03 server spamd[17267]: spamd: error: addr is not a string at /usr/share/perl5/vendor_perl/IO/Socket/IP.pm line 685. any ideas.?
Re: my spamassassin has serious config problems
On 28.05.19 15:34, hg user wrote: I did some more research and I think I have to report the new discovery so that the thread can be useful to other Readers. First: 0.000 0 5232 0 non-token data: nspam 0.000 0 70408 0 non-token data: nham 0.000 0 388070 0 non-token data: ntokens nspam and nham values are definitively the number of messages learnt. Second: I saw that nham increased every few seconds. I discovered that bayes_auto_learn was enabled ! My situation yesterday: 0.000 01042011 0 non-token data: nspam 0.000 0 66472 0 non-token data: nham 0.000 0 663479 0 non-token data: ntokens My situation now: 0.000 01042049 0 non-token data: nspam 0.000 0 71228 0 non-token data: nham 0.000 01040661 0 non-token data: ntokens So, at least, I now know that the system is feeding the bayes engine with some new data and that in this way the results can change. Third: in 72_active.cf there are a lot of bayes_ignore_header directives, but they don't include the ones added by my commercial antivirus. Should I create a patch? Fourth: I added a dbg statement to bayes.pm, sub tokenize, to print the tokens it extracts from the message. I agree with some, I don't with others. I'd like to know if there is some doc that lists why tokens are extracted this way (some notes are in the source code) I discovered that probably some words should be added to the stopwords list but there is no way to do it in a configuration file, I should modify spamassassin code directly... To end: I think that the only way to proceed now is to nuke the bayes db and start from scratch: - setup bayes configuration correctly - double check the corpus to be correctly classified - run sa-learn Do you still use Zimbra? if so, have you configured Zimbra? Did you consult your Zimbra-man? For the "setup bayes configuration correctly" step I accept your contributions :-) I excluded all the headers of my antivirus and internal/external/trusted. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Spam is for losers who can't get business any other way.
Re: my spamassassin has serious config problems
On Tue, 28 May 2019 15:42:05 + David Jones wrote: > On 5/27/19 5:13 PM, hg user wrote: > > The server was installed and configured by a "zimbra man", a person > > I fully trust. Since I manage a commercial antivirus/antispam > > solution that is not properly working for the italian language, I > > was tasked to join the project in order to understand if we could > > switch from the proprietary solution to spamassassin. > > > > I'm now in the process of double-checking the configuration of > > spamassassin and feeding the bayes engine... > > > > Testing the system I noticed that spamassassin logged the internal > > MTAs (including the antivirus server) as external and I asked *the > > zimbra man* to correct the configuration. He replied it was not > > necessary. Sorry I didn't specify I asked the person in charge of > > the system. > > > > In the end, I need to think about the answer of RW: spamassassin is > > run by amavis but with no internal servers defined, it uses my > > internal one as the external. Received header needs some more care, > > and probably also the list of stop words should be expanded. > > Probably there is a ratio behind some decisions taken by the > > developers, but I can't fully understand how the destination > > address can help on whether a message is spam or not, at least > > not 6 times. > > The internal_networks and trusted_networks are _very important_ to be > set correctly for a number of reasons, not just Bayes. This gives SA > the proper "view" to the outside/Internet. Keep in mind > internal_networks is not literally your RFC 1918 internal_networks > and the trusted_networks are not only ones that you managed/control. > > Internal_networks is any public or private IP space that you trust to > not forge the Received or synthetic received headers like > X-Originating-IP. Not really, internal_networks is there to establish which relay is the last-external. > > Trusted_networks can be external/public networks that you know won't > change or forge the Received or synthetic received headers. > > I have recently added all Google and Office 365 IP blocks to my > trusted_networks to better detect last-external client IPs. You need to add them to internal_networks to affect the last-external. > This > allows for deep Received header inspection since I know that Google > and Microsoft aren't going to forge those headers. Very interesting > information comes out into the open as a result of this. You always get deep checks because allowing spammers to forge their way into blocklists and other spam tests is harmless. Adding external addresses to trusted_networks prevents unnecessary blocklist look-ups and allows whitelist tests to run on the first-trusted relay.
Re: my spamassassin has serious config problems
On 5/27/19 5:13 PM, hg user wrote: > The server was installed and configured by a "zimbra man", a person I > fully trust. Since I manage a commercial antivirus/antispam solution > that is not properly working for the italian language, I was tasked to > join the project in order to understand if we could switch from the > proprietary solution to spamassassin. > > I'm now in the process of double-checking the configuration of > spamassassin and feeding the bayes engine... > > Testing the system I noticed that spamassassin logged the internal MTAs > (including the antivirus server) as external and I asked *the zimbra > man* to correct the configuration. He replied it was not necessary. > Sorry I didn't specify I asked the person in charge of the system. > > In the end, I need to think about the answer of RW: spamassassin is run > by amavis but with no internal servers defined, it uses my internal one > as the external. Received header needs some more care, and probably also > the list of stop words should be expanded. Probably there is a ratio > behind some decisions taken by the developers, but I can't fully > understand how the destination address can help on whether a message is > spam or not, at least not 6 times. The internal_networks and trusted_networks are _very important_ to be set correctly for a number of reasons, not just Bayes. This gives SA the proper "view" to the outside/Internet. Keep in mind internal_networks is not literally your RFC 1918 internal_networks and the trusted_networks are not only ones that you managed/control. Internal_networks is any public or private IP space that you trust to not forge the Received or synthetic received headers like X-Originating-IP. Trusted_networks can be external/public networks that you know won't change or forge the Received or synthetic received headers. I have recently added all Google and Office 365 IP blocks to my trusted_networks to better detect last-external client IPs. This allows for deep Received header inspection since I know that Google and Microsoft aren't going to forge those headers. Very interesting information comes out into the open as a result of this. P.S. To implement/try this extended trusted_networks, set the score for ALL_TRUSTED to -0.001 and disable it from shortcircuit'ing. score ALL_TRUSTED -0.001 shortcircuit ALL_TRUSTED off -- David Jones
Re: my spamassassin has serious config problems
On Tue, 28 May 2019 15:34:06 +0200 hg user wrote: > Fourth: > I added a dbg statement to bayes.pm, sub tokenize, to print the > tokens it extracts from the message. > I agree with some, I don't with others. I'd like to know if there is > some doc that lists why tokens are extracted this way (some notes are > in the source code) > I discovered that probably some words should be added to the > stopwords list but there is no way to do it in a configuration file, > I should modify spamassassin code directly... The stoplist is just there to drop tokens that are deemed to be not worth using because they are likely to be neutral. Neutral tokens don't affect the result. For testing purposes I'd suggest stripping any purely internal headers, except headers that contain envelope information as zimba may be supplying this by other means. If you can turn-off auto-training and clear the database, I suggest you do that.
Re: bayes_path ignored
On Mon, 27 May 2019 19:38:55 -0500 Andy Howell wrote: > On 5/27/19 6:43 PM, Reindl Harald wrote: > > Am 28.05.19 um 01:05 schrieb Andy Howell: > >> How do I get spamassassin to honor the setting of bayes_path in > >> /etc/spamassassin/local.cf ? bayes_path > >> /var/spool/postfix/spamassassin/bayes_db It never gets used. I can > >> get sa-learn to create the files there by specifying the path: > >> sa-learn --username=amavis > >> --dbpath=/var/spool/postfix/spamassassin/bayes_db/ >> path> > > this line with the trailing slash is wrong anyways, they bayes_db > > is supposed to be the prefix of the files below > > /var/spool/postfix/spamassassin/ > > > Reindl Harald, > > I saw "path" and assumed it was a path, not a prefix. The --dbpath in > the sa-learn command is a path. Not exactly, if you specify a directory here then the default file prefix bayes is added automatically. > I guess it should be /var/spool/postfix/spamassassin/bayes_db/bayes Yes, if you omit the prefix here it's simply invalid and you fall back to a value of '~/.spamassassin/bayes'. The '-u amavis' argument to sa-learn is wrong too, as this specifies a virtual mail user, not a unix user to run as. If you are not using per unix user files under ~/ you need to run sa-learn as the user that owns the global files. Do not run with 'bayes_file_mode 0775' as suggested in the howto. This is an ugly and insecure kludge to allow unix users to share a single database. Don't test as root as it will mask problems with permissions.
Re: my spamassassin has serious config problems
I did some more research and I think I have to report the new discovery so that the thread can be useful to other Readers. First: 0.000 0 5232 0 non-token data: nspam 0.000 0 70408 0 non-token data: nham 0.000 0 388070 0 non-token data: ntokens nspam and nham values are definitively the number of messages learnt. Second: I saw that nham increased every few seconds. I discovered that bayes_auto_learn was enabled ! My situation yesterday: 0.000 01042011 0 non-token data: nspam 0.000 0 66472 0 non-token data: nham 0.000 0 663479 0 non-token data: ntokens My situation now: 0.000 01042049 0 non-token data: nspam 0.000 0 71228 0 non-token data: nham 0.000 01040661 0 non-token data: ntokens So, at least, I now know that the system is feeding the bayes engine with some new data and that in this way the results can change. Third: in 72_active.cf there are a lot of bayes_ignore_header directives, but they don't include the ones added by my commercial antivirus. Should I create a patch? Fourth: I added a dbg statement to bayes.pm, sub tokenize, to print the tokens it extracts from the message. I agree with some, I don't with others. I'd like to know if there is some doc that lists why tokens are extracted this way (some notes are in the source code) I discovered that probably some words should be added to the stopwords list but there is no way to do it in a configuration file, I should modify spamassassin code directly... To end: I think that the only way to proceed now is to nuke the bayes db and start from scratch: - setup bayes configuration correctly - double check the corpus to be correctly classified - run sa-learn For the "setup bayes configuration correctly" step I accept your contributions :-) I excluded all the headers of my antivirus and internal/external/trusted. Thanks Francesco
Re: my spamassassin has serious config problems
On 28.05.19 00:13, hg user wrote: The server was installed and configured by a "zimbra man", a person I fully trust. Since I manage a commercial antivirus/antispam solution that is not properly working for the italian language, I was tasked to join the project in order to understand if we could switch from the proprietary solution to spamassassin. I'm now in the process of double-checking the configuration of spamassassin and feeding the bayes engine... Testing the system I noticed that spamassassin logged the internal MTAs (including the antivirus server) as external and I asked *the zimbra man* to correct the configuration. He replied it was not necessary. Sorry I didn't specify I asked the person in charge of the system. I believe that that is not necessary, because zimbra takes the control itself, uses modified SA source. If your "spamassassin" binary is not the one from zimbra, it's apparently the reason why you have probvlems with trustparh configuration and also the bayes database. I don't recommend mixing usage of zimbra's internal SA and SA installed from elsewhere. Unfortunately, spamassassin documentation is not really clear and asking google can be even more confusiong... I found posts stating that nham/nspam reported by --dump magic are either tokens or messages... according with a test I did this afternoon, feeding 2 messages to sa-learn ham, those numbers are tokens. 0.000 0 5232 0 non-token data: nspam 0.000 0 70408 0 non-token data: nham 0.000 0 388070 0 non-token data: ntokens I believe first two are counts of mail, last one is count of tokens and also that it's self-explanatory. I noticed that the nham counter kept increasing for several minutes after sa-learn ended, probably due to the --no-sync parameter... this could also explain why immediately after the sa-learn of the spam message bayes reported BAYES_50 and a few minutes later BAYES_00: the engine was still learning and as new tokens were recorded they changed the result. In the end, I need to think about the answer of RW: spamassassin is run by amavis but with no internal servers defined, it uses my internal one as the external. Received header needs some more care, and probably also the list of stop words should be expanded. Probably there is a ratio behind some decisions taken by the developers, but I can't fully understand how the destination address can help on whether a message is spam or not, at least not 6 times. Tomorrow I will try some -D bayes on different messages to try understand better what the plugin is doing, and I will try to read all the source code. Unfortunately I don't know perl... Probably the best solution is to change the configuration, zap the bayes db and sa-learn all the corpus I put apart I recommend to ask zimbra forums when you are messing up with zimbra's bayes database and zimbra SA settings. On Mon, May 27, 2019 at 8:06 PM Matus UHLAR - fantomas wrote: On 27.05.19 18:04, hg user wrote: >I was writing a message requesting advice on bayes_ignore_header since I >was sure something was wrong when I decided to have a look at spamassassin >-D bayes output... and I was shocked by what I saw ! > >x-spam-relays-external lists all the hops of the message *including* internal >servers and so x-spam-relays-internal is empty... I specifically asked to >add the antivirus and other internal MTAs to the internal list... how? -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. The only substitute for good manners is fast reflexes.
Re: my spamassassin has serious config problems
On Mon, 27 May 2019 18:04:35 +0200 hg user wrote: I was writing a message requesting advice on bayes_ignore_header since I was sure something was wrong when I decided to have a look at spamassassin -D bayes output... and I was shocked by what I saw ! x-spam-relays-external lists all the hops of the message *including* internal servers and so x-spam-relays-internal is empty On 27.05.19 20:18, RW wrote: You can fix this by setting internal_networks and trusted_networks. However if SA is running from amavis it probably doesn't matter, I believe that the zimbra installation does have these configs set properly, he just must use zimbra's settings which I don't know how to manage. "spamassassin" binary ues $HOME/ while zimbra installation stores them in directory that is not in $HOME of any user. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Quantum mechanics: The dreams stuff is made of.