Re: why does that mail not get any bayes-classification
On Sat, 11 Jun 2016 04:52:48 +0200 Reindl Harald wrote: > Am 10.06.2016 um 23:52 schrieb RW: > > On Fri, 10 Jun 2016 16:57:45 +0200 > > Reindl Harald wrote: > > > >> see attachemnt, no bayes tag at all looks like a major bug > >> somewhere > > > > In the absence of any debug it's hard to say. > > hence i attached the sample An email is not debug. I can't run it on *your* system. > > It is possible for no tokens to make it through the selection, in > > which case there is no result. That's more likely than normal in > > your case since you don't train on headers. > > if you would have looked at the message you would have seen that > there is content and not only headers and it looks like the message > has just incorrect mime-definitions (missing end headers) Of course I looked at it. And I ran it through spamassassin. Aside from header tokens, what made it past the token selection on my database was only: 'marcus','Marcus','enclosed','invoice','business' and 'thank' It's quite possible that all the body tokens in that email were in the neutral range on your system, which would cause Bayes to exit without producing a classification. In the absence of any debug against your database, there is nothing particularly suspicious here.
Re: why does that mail not get any bayes-classification
On Sat, 11 Jun 2016, Reindl Harald wrote: Am 10.06.2016 um 23:52 schrieb RW: On Fri, 10 Jun 2016 16:57:45 +0200 Reindl Harald wrote: see attachemnt, no bayes tag at all looks like a major bug somewhere In the absence of any debug it's hard to say. hence i attached the sample It is possible for no tokens to make it through the selection, in which case there is no result. That's more likely than normal in your case since you don't train on headers. if you would have looked at the message you would have seen that there is content and not only headers and it looks like the message has just incorrect mime-definitions (missing end headers) since thunderbird shows the attachment as well as the mail content that would be a way for spammers to completly trick out SA There may be a bug but I don't it is in the SA distro. I took your sample and fed it to my SA kit. First time thru it hit BAYES_50, I then did a "sa-learn --spam < /tmp/ignored_by_bayes_stripped.eml" and retested it. It then hit BAYES_999. So I'd say standard SA + Bayes works on that message. Somebody at your site may have done some modifications to your SA that is causing you problems. -- Dave Funk University of Iowa College of Engineering 319/335-5751 FAX: 319/384-0549 1256 Seamans Center Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527 #include Better is not better, 'standard' is better. B{ smime.p7s Description: S/MIME Cryptographic Signature
Re: Where to find DETAIL for spamassassin default RULES
On 10 Jun 2016, at 3:09, jimimaseye wrote: REGEXP: I dont mind having a go at reading them (I have written some myself) but, as you know, even though some are easy and obvious sometimes it can be like reading music - a blur of blobs, dots and squiggles that take a lot of deciphering. Of course, many of them rely on 'functionality' of the plugins (which I cant say I would fully understand) and the understanding of a RULE structure (some are easy and obvious, some are very convoluted). (I recently developed this one from scratch: Its an RFC2822 email address validator: ^(?=.{1,64}@)("[^<>@\\]+"|(?!\.|.*\.(\.|@))[^<> @\\"]+)@(\[(\d{1,3}\.){3}\d{1,3}\]|\[IPv6:(?:[A-Fa-f\d]{1,4}:){7}[A-Fa-f\d]{1,4}\]|(?=.{1,255}$)((?!-|\.|\d+($|\.))[a-zA-Z\d-]{0,62}[a-zA-Z\d])(|\.(?!-|\.|\d+($|\.))[a-zA-Z\d-]{0,62}[a-zA-Z\d]){1,126})$ Very proud of it too. ) So, you thought validating email addresses was a problem demanding a solution? And you "solved" it with a regular expression? Congratulations on now having 2 problems. They should be very happy together.
Re: why does that mail not get any bayes-classification
Am 10.06.2016 um 23:52 schrieb RW: On Fri, 10 Jun 2016 16:57:45 +0200 Reindl Harald wrote: see attachemnt, no bayes tag at all looks like a major bug somewhere In the absence of any debug it's hard to say. hence i attached the sample It is possible for no tokens to make it through the selection, in which case there is no result. That's more likely than normal in your case since you don't train on headers. if you would have looked at the message you would have seen that there is content and not only headers and it looks like the message has just incorrect mime-definitions (missing end headers) since thunderbird shows the attachment as well as the mail content that would be a way for spammers to completly trick out SA signature.asc Description: OpenPGP digital signature
Re: Email with attachment caused 100% CPU usage.
On 10 Jun 2016, at 3:40, Merijn van den Kroonenberg wrote: On 9 Jun 2016, at 0:53, Henrik K wrote: Garbage text/plain is known problem.. text/html too. From GMail. Last week I had a *perfectly legitimate* message with a 151KB logical single line of HTML (QP encoded of course) freeze up a server scaled for 10k users. [snip] Are there publically available some mails which might cause these kind of problems? It would be interesting for testing set-ups. Not that I'm aware of. This particular case involved a customer organization who I wouldn't ever consider asking to share their mail for any reason, as they specialize in a field where the privacy issues would be insurmountable. One could mock up such messages by hand rather easily: find & copy or create (maybe Word would be a choice tool for this...) a bloated HTML page, replace all the line breaks with spaces in a proper text editor. Attach that to an email message or just make it the sole body part of a pure text/html message. It MAY be that this was a case where someone wrote something big-ish and highly-formatted inside GMail or Google Docs and mailed it so the structure is GMail's fault. It could also be that some MUA or document generation tool constructed the mail and it came through GMail via SMTP submission. I do not know which it is but I would bet on the latter, given Google's core expertise in working with HTML. My reflexes in these sorts of cases where I have to handle customer non-spam mail are to not look too closely at anything non-critical to solving the problem at hand and swiftly forget anything unimportant I happen to see (which gets easier every year...)
Re: Email with attachment caused 100% CPU usage.
On 9 Jun 2016, at 1:40, Olivier wrote: Mark London writes: On 6/8/2016 1:20 PM, John Hardin wrote: On Wed, 8 Jun 2016, Mark London wrote: Hi - We received an email with several large postscript attachments, and the content type was "text/plain". This caused our spamassassin Sorry to jump in, but should SA trust the content-type or the file(1) type, or should try to compare both and do something if they missmatch? No. The root of this problem isn't a type mismatch. PostScript *IS* (or at least *can be*) plain text. If my recollection is correct, PS was originally specified to use "Base85" encoding for anything that went outside the "ascii85" (33-117) subset of ASCII, which just happened to be the digits used in Base85 encoding. Many Unix-ish machines have the 'atob' and 'btoa' utilities on them to do that encoding and decoding. As long as a proper mail-safe transfer encoding is used (Base64 or QP) there is no limit on the decoded "line" length in text/plain (Thanks, Microsoft!) of PostScript, so an unanchored and/or imprudently loose regular expression can end up doing a LOT of false starts in a big chunk of Postscript that claims correctly to be text/plain with (probably) Q-P encoding used only to soft-break it into transport-safe lines. The root of the problem is using unanchored and/or imprudently loose regular expression rules. If it's not an innocent PS today because there was a type mismatch, it will be correctly-typed HTML tomorrow. A handy heuristic: if a rule does not start with '^' immediately followed by something restrictive (even '^.{0,80}' followed by a string of literals) or has '.*' anywhere, it's risky. This is probably most critical with 'rawbody' rules and rules that use the multiline option. It is important to understand that "rawbody" isn't the body parts of 'full' (pristine RFC2822 format with any CTE) but rather the text/* body parts with any Content-Transfer-Encoding transformation reversed. Due to the odd way SA deals with line ends (there are WONTFIX bugs on it and probably a mention on some wiki page...) you basically have to use the multiline option to anchor to line-ends, and that can inadvertently land you in trouble.
Re: Email with attachment caused 100% CPU usage.
On Fri, 2016-06-10 at 18:26 -0400, Bill Cole wrote: > It will be interesting to see the stats on scantimes this week to see > if my tightening up on sloppy rules has an impact. I expect it will, > since I now have a concrete theory to explain that long tail out to 2 > minutes, which before now I've ignored as pure noise. > Thanks for the heads-up. It prodded me into scanning my local rule set for unguarded '.*' in local rules. I found a few - more or less what I expected and virtually all in my older rules - and limited them all to runs of up to 32 characters before running my spam corpus against them. This showed only two instances of a rule now failing to fire. Finding the rule and changing it to allow up to 64 characters to match in the 'don't care' parts of the regex fixed that. FWIW this is a rule that looks for two consecutive URLs in body text. I knew that some of these can be quite long, but since, in all the spam I've inspected they were used as a more or less neat final line in the message, they probably don't exceed 32 chars by much. So yes, while I know that 64 chars is probably overkill, its not so much overkill that its worth further fiddling. Martin
Re: Email with attachment caused 100% CPU usage.
Merijn van den Kroonenberg wrote on 10/06/16 10:17 PM: > What does this mean, can still a single operation take more than this > time_limit? There is a fundamental difficulty in perl that the built-in timer alarm facility cannot always interrupt the built-in regular expression matching facility. That means there is not a good way of protecting against a rule pattern that can take a very long time to process certain input being given that input. There is some effort to do that with the time_limit setting and spawning of individual child processes in spamd that can be individually killed if one of them gets hung up, but that only goes so far. More technical discussion of how to better deal with this probably would be better in the dev mailing list. There have been a number of bug reports having to do with some "toxic" email bogging down SpamAssassin which were tracked down to some use of, for example, .* in a rule pattern. It can be tricky to fix a pattern to keep this from happening. When you write your own rules for local use you are more likely to run into this problem because you probably have less experience in writing patterns to avoid it and don't run your rules through the many test cases of our mass check system. Sidney
Re: Email with attachment caused 100% CPU usage.
On 10 Jun 2016, at 6:17, Merijn van den Kroonenberg wrote: [...] From the manual: This is a best-effort advisory setting, processing will not be abruptly aborted at an arbitrary point in processing when the time limit is exceeded, but only on reaching one of locations in the program flow equipped with a time test. Currently equipped with the test are the main checking loop, asynchronous DNS lookups, plugins which are calling external programs. Rule evaluation is guarded by starting a timer (alarm) on each set of compiled rules. The last line is critical. The alarm isn't even on individual rules but rather on precompiled sets of rules. I already knew it was soft but that particular detail had not stuck in any of the many times I've read that man page. Thanks for quoting it. What does this mean, can still a single operation take more than this time_limit? But I guess the timer on the rules means the rules at least cannot take more than time_limit, right? Nope: time_limit is in seconds and I had it set to 270. The default is 300 which is 1/2 the canonical SMTP EOD timeout but matches the canonical "server timeout" (how long a server should wait for a client command. See RFC 5321) and I'm sure there's misunderstanding on that distinction. Since under peak loads the plumbing of that particular system can add whole seconds and some clients can be a bit impatient, I gave it what I thought was very generous additional headroom although it really didn't need it. The actual normal spamd scantime distribution there roughly 2/3 under 1 second, 90% under 2 seconds, 99% under 10 seconds, 99.9% under a minute. There's a long thin tail out to ~2 minutes, but until last week I'd never had anything actually hit the timeout that I can recall, and the 1st bad rule was hardly fresh. I think it is >8 years old and the others I found had not caused trouble in their multiple (3?) years of existence. I figured out which rule was the proximal cause by running the message through the spamassassin script with rule debugging on so I could narrow down the bad rule based on what matched right before TIME_LIMIT_EXCEEDED. I didn't dtrace the process to nail it down, but my hypothesis is that ultimately Perl is calling its low-level internal equivalent of regexec() which, like POSIX regexec(), has no timeout facility: it runs until it matches or exhausts the input of starting points. At the Perl level, I've never encountered any way to limit the time '=~' takes to operate, but I'm no Larry Wall so maybe there's some arcane way to do that. It seems clear that if Perl has such a feature to break out of an operator routine that is taking too much clock time, it has not been used in SpamAssassin. I'd bet on there NOT being such a feature and further, that the process doing the match might not even die immediately with a SIGKILL while inside that call. What most annoys me about this is that the potential for blowing up systems with REs is the first thing one learns about them in a formal setting (rather than just by reading man pages above the BUGS section.) Back when I first got that warning the emphasis was on an ability of REs to compile to disastrous scale but back then that meant a few megabytes, and who cares about that today. However, I also got the warning decades ago that '.*' could cause a RE to take a long time if you didn't take care to limit your input size and write the RE to rule out most starting points fast, but again absolute sizes matter and until last week I'd not envisioned "optimizing" HTML by removing all formally unnecessary whitespace including line breaks. This is obviously somewhat rare, but it's apparently A Thing HTML Parsers Like and this was a big hunk of HTML, so I guess optimizing parsing was important... It will be interesting to see the stats on scantimes this week to see if my tightening up on sloppy rules has an impact. I expect it will, since I now have a concrete theory to explain that long tail out to 2 minutes, which before now I've ignored as pure noise.
Re: why does that mail not get any bayes-classification
On Fri, 10 Jun 2016 16:57:45 +0200 Reindl Harald wrote: > see attachemnt, no bayes tag at all looks like a major bug somewhere In the absence of any debug it's hard to say. It is possible for no tokens to make it through the selection, in which case there is no result. That's more likely than normal in your case since you don't train on headers.
Re: SA bayes file db permission issue
On Fri, 2016-06-10 at 15:38 -0400, Joseph Brennan wrote: > Look out for big-endian and little-endian, too. That affects > databases. > This bit us once when we copied a berkeley db from solaris to linux. > Endian-ness is based on the cpu hardware, but apparently Macs and > most hardware used for Linux (like Intel) are both little-endian-- so > it is probably not the answer in this case. > Has to be an implementation difference in that case, e.g UTF-8 vs ASCII or somebody decided that using an int was wasteful and used a short instead. > This is a nice test I found: > echo -n I | od -to2 | awk '{ print substr($2,6,1); exit}' > > 1 little-endian > 0 big-endian > Very nice indeed. Thanks for posting it. Martin
Re: SA bayes file db permission issue
On Fri, 10 Jun 2016 15:38:44 -0400 Joseph Brennan wrote: > wrote: > > > The main database file is binary anyway. > > > Look out for big-endian and little-endian, too. That affects > databases. This bit us once when we copied a berkeley db from solaris > to linux. That may have changed; they are supposed to be compatible: http://www.oracle.com/technetwork/database/berkeleydb/db-faq-095848.html It's just a bit less efficient. > Endian-ness is based on the cpu hardware, but apparently > Macs and most hardware used for Linux (like Intel) are both > little-endian-- so it is probably not the answer in this case. IIRC older OS X macs used big-endian powerpc processors.
Re: SA bayes file db permission issue
wrote: The main database file is binary anyway. Look out for big-endian and little-endian, too. That affects databases. This bit us once when we copied a berkeley db from solaris to linux. Endian-ness is based on the cpu hardware, but apparently Macs and most hardware used for Linux (like Intel) are both little-endian-- so it is probably not the answer in this case. This is a nice test I found: echo -n I | od -to2 | awk '{ print substr($2,6,1); exit}' 1 little-endian 0 big-endian Joseph Brennan Columbia U
Re: How SA reactes to a bunch of garbage characters
On 09.06.16 10:43, Olivier wrote: For years I am having FuzzyOcr pluging running, but it helps little, because it has it's own list of words to keep updated. I am wondering if, instead of using that own list of words, the result was injected back into the body of the main message. I raised this issue some years ago. The result was that pushing OCR-ed data bach to SA for evaluating BAYES and other rules could cause troubles, because freely availabel OCR SW was not very presice. Most of the time, what will be injected back is plain garbade: w_T___l_e?_ But other time the result is interesting like a proper English sentence full of spam. what exactly do you use for OCR? 10 years ago I made a comparison between gocr, ocrad and tesseract, where gocr gave best results. Now, since google sponsors tesseract development, the scaning looks much much better, and I started thinking about tryint that again. So how SA will react if I reinject the garbage? Wil lit just ignore it? would be nice to see trhe results. I'm mostly afraid about FUZZY_* rules... -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Save the whales. Collect the whole set.
Re: SA bayes file db permission issue
On Fri, 10 Jun 2016 00:08:01 +0100 Martin Gregorie wrote: > On Thu, 2016-06-09 at 15:01 -0700, John Hardin wrote: > > On Thu, 9 Jun 2016, Martin Gregorie wrote: > > > > > On Thu, 2016-06-09 at 16:54 -0400, Yu Qian wrote: > > >> Ok, I found out. so the db files generated on Mac can not be > > >> used > > on > > >> Linux. vice versa. > > > > > > Newline symbols differ: '/n' is 0x0a (LF) for Linux, 0x0d (CR) for > > > Macs. > > > > WTF? I thought Mac's OS was based on Mach, which is an offshoot of > > Unix? > > > Since MACs used CR from their debut I thought this got carried over > into OS-X for file compatibility reasons. Seems that I was wrong > (except for Excel for OS X, which still uses CR for CSV files. The main database file is binary anyway.
Re: Email with attachment caused 100% CPU usage.
> > > Am 10.06.2016 um 04:49 schrieb Bill Cole: >> On 9 Jun 2016, at 0:53, Henrik K wrote: >> >>> Garbage text/plain is known problem.. >> >> text/html too. From GMail. >> >> Last week I had a *perfectly legitimate* message with a 151KB logical >> single line of HTML (QP encoded of course) freeze up a server scaled for >> 10k users. It did it slowly over a day, because it took a spamd child >> ~20 minutes to scan > > why in the world do you allow a single spamd child to scan 20 minutes > for a message and what happens if all your childs have such mails to > proceed - that's hardly scaled for 10k users on rainy days > > time_limit 20 > > read the manual, it works like shortcircuit meaning all other rules > already finished (RBL/URIBL in any case) will give their score and so > you don't open the machine widely while stop easy DOS attacks with > handcrafted mails >From the manual: This is a best-effort advisory setting, processing will not be abruptly aborted at an arbitrary point in processing when the time limit is exceeded, but only on reaching one of locations in the program flow equipped with a time test. Currently equipped with the test are the main checking loop, asynchronous DNS lookups, plugins which are calling external programs. Rule evaluation is guarded by starting a timer (alarm) on each set of compiled rules. What does this mean, can still a single operation take more than this time_limit? But I guess the timer on the rules means the rules at least cannot take more than time_limit, right? > > if the server is not a feature-phone when you don't have a result within > 20 seconds you hardly get one 5 minutes later (besides that in a proper > setup rejecting based on teh result the client don't wait that long and > comes again and again) > >
Re: Email with attachment caused 100% CPU usage.
Am 10.06.2016 um 04:49 schrieb Bill Cole: On 9 Jun 2016, at 0:53, Henrik K wrote: Garbage text/plain is known problem.. text/html too. From GMail. Last week I had a *perfectly legitimate* message with a 151KB logical single line of HTML (QP encoded of course) freeze up a server scaled for 10k users. It did it slowly over a day, because it took a spamd child ~20 minutes to scan why in the world do you allow a single spamd child to scan 20 minutes for a message and what happens if all your childs have such mails to proceed - that's hardly scaled for 10k users on rainy days time_limit 20 read the manual, it works like shortcircuit meaning all other rules already finished (RBL/URIBL in any case) will give their score and so you don't open the machine widely while stop easy DOS attacks with handcrafted mails if the server is not a feature-phone when you don't have a result within 20 seconds you hardly get one 5 minutes later (besides that in a proper setup rejecting based on teh result the client don't wait that long and comes again and again) signature.asc Description: OpenPGP digital signature
Re: Email with attachment caused 100% CPU usage.
> On 9 Jun 2016, at 0:53, Henrik K wrote: > >> Garbage text/plain is known problem.. > > text/html too. From GMail. > > Last week I had a *perfectly legitimate* message with a 151KB logical > single line of HTML (QP encoded of course) freeze up a server scaled for > 10k users. > [snip] Are there publically available some mails which might cause these kind of problems? It would be interesting for testing set-ups.
Re: Where to find DETAIL for spamassassin default RULES
Thanks for the replies guys So in essence, there is no user friendly method as there were before. On 09/06/2016 14:19, Joe Quinn wrote: > I have a bookmark in Firefox that points to > http://ruleqa.spamassassin.org/?rule=%s&srcpath=&g=Change which is the > status page for the nightly rule updates and is likely what you are > looking for. > > I give it a keyword too, so I can type "ruleqa RULENAME" and it will > replace the "%s" with whatever I type. As for looking up and search those nightly listings, its true I can find an individual rule, but then I cant exactly see how to drill in to it and see its expression or detail - I can only see a load of links showing how effective it is in tests (its not really what I was looking for). Am I missing something? REGEXP: I dont mind having a go at reading them (I have written some myself) but, as you know, even though some are easy and obvious sometimes it can be like reading music - a blur of blobs, dots and squiggles that take a lot of deciphering. Of course, many of them rely on 'functionality' of the plugins (which I cant say I would fully understand) and the understanding of a RULE structure (some are easy and obvious, some are very convoluted). (I recently developed this one from scratch: Its an RFC2822 email address validator: ^(?=.{1,64}@)("[^<>@\\]+"|(?!\.|.*\.(\.|@))[^<> @\\"]+)@(\[(\d{1,3}\.){3}\d{1,3}\]|\[IPv6:(?:[A-Fa-f\d]{1,4}:){7}[A-Fa-f\d]{1,4}\]|(?=.{1,255}$)((?!-|\.|\d+($|\.))[a-zA-Z\d-]{0,62}[a-zA-Z\d])(|\.(?!-|\.|\d+($|\.))[a-zA-Z\d-]{0,62}[a-zA-Z\d]){1,126})$ Very proud of it too. ) -- View this message in context: http://spamassassin.1065346.n5.nabble.com/Where-to-find-DETAIL-for-spamassassin-default-RULES-tp121218p121250.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.