Re: Large (usually legitimate) HTML mails choking SA
Karsten Bräckelmann wrote: However, using (?:\s|\ )* also does the trick. Yes, keeping the nasty asterisk quantifier. The difference is merely dropping the \n from the alternation, which is part of \s whitespace anyway. Wondering if this is a case where Perl fails to optimize out the \n. Which would result in an alternation with overlap... Hmm. This may be a Perl-version-specific (or which-flags-Perl-was-built-with thing) then, because I've been adding \n on rawbody rules where I want to match multiple physical lines because \s *hasn't* been matching newlines - at least, not all the time. -kgd
Re: Large (usually legitimate) HTML mails choking SA
On Fri, 2011-05-27 at 13:14 -0400, Kris Deugau wrote: > Karsten Bräckelmann wrote: > > Yes, that sounds like the culprit indeed is one or more custom rule. If > > that "much faster" equals twice as fast, > > Probably closer to 4-6x; dual PIII/866 -> Core i3 3GHz. Sure -- that "twice" assumption was just a quickly assumed lower bound, that still shows the dramatic difference of the custom rule burning a whopping 25 times the CPU. > > Bisection is your friend. > > > > Go hunt down that bugger, that in conjunction with the specific sample > > kills your performance. Once you found it, maybe you can post it? > > Seems to have been this: > > rawbody TOO_MANY_DIVS /(?:<[Dd][Ii][Vv]>(?:\s|\n|\ \;)*){6}/ Aha! Yes, that nesting of quantifiers sure looks like a prime candidate. Even though this isn't the pure evil form -- which would be to have two alternatives with overlap in sub-patterns. Or maybe it is. Frankly, not sure what exactly causes the RE to go berserk. > Changing the * to {,100} drops the processing time down to ~8s. Confirmed, grabbed your sample and this eliminates the issue. However, using (?:\s|\ )* also does the trick. Yes, keeping the nasty asterisk quantifier. The difference is merely dropping the \n from the alternation, which is part of \s whitespace anyway. Wondering if this is a case where Perl fails to optimize out the \n. Which would result in an alternation with overlap... -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Large (usually legitimate) HTML mails choking SA
John Hardin wrote: On Thu, 26 May 2011, Kris Deugau wrote: Whitelisting these once they're found lets them bypass SA altogether, but in the meantime they get stuck in the mail queue. Has anyone got any suggestions for decreasing the load SA imposes trying to process one of these? Any possibility of getting a sample? Eugh, that was *nasty*. Thoroughly anonymized version at http://www.deepnet.cx/~kdeugau/spamtools/nastyhtml.eml. And the HTML is really, truly, *nasty*. I've never seen such a spectacular mess that's still legal HTML, even from Word or Frontpage. And of course, because it's so nasty, I had to hand-edit it to anonymize it because otherwise any HTML editor would have cleaned it up >_< -kgd
Re: Large (usually legitimate) HTML mails choking SA
Karsten Bräckelmann wrote: On Fri, 2011-05-27 at 10:38 -0400, Kris Deugau wrote: Mmmm. I don't *think* so, but testing the message on a stock SA 3.3.1 took "only" a minute (on slow hardware) vs 13 (on my much faster desktop). The latter being the production system with the custom rules, or at least having an identical set of custom rules? Yeah; I create the rules on my desktop (usually with an example spam on hand to make sure the rule hits what I intended it to hit), commit to svn, and periodically merge changes to a branch that's autopublished in something resembling the same way as the official stock rules and JM's SOUGHT rules. Yes, that sounds like the culprit indeed is one or more custom rule. If that "much faster" equals twice as fast, Probably closer to 4-6x; dual PIII/866 -> Core i3 3GHz. Bisection is your friend. Go hunt down that bugger, that in conjunction with the specific sample kills your performance. Once you found it, maybe you can post it? Seems to have been this: rawbody TOO_MANY_DIVS /(?:<[Dd][Ii][Vv]>(?:\s|\n|\ \;)*){6}/ describe TOO_MANY_DIVS 6 or move tags in a row score TOO_MANY_DIVS 0.75 Changing the * to {,100} drops the processing time down to ~8s. I've got a number of similar rules for other "many logical/physical linebreaks with no content". I don't have a specific spample to point to just now, but from memory the original targets really did have a widely varying number of linebreaks or whitespace (logical or otherwise) in between the HTML tags, and I've been bitten before with applying bounds to matches (related rules for garbage HTML comments) not being *large* enough. O_o This particular message has page after page of: =09=09=09 =09=09=09 =09=09=09 =09 =09 =09 etc, with a few or tags for excitement. -kgd
Re: Large (usually legitimate) HTML mails choking SA
On Fri, 2011-05-27 at 10:38 -0400, Kris Deugau wrote: > Karsten Bräckelmann wrote: > > > However, we've just had a couple of *legitimate* messages get stuck for > > > essentially the same reason - a whole lot of pathologically bad HTML. > > > > Rings a bell. Such reports usually turned out to be caused by custom > > rules. Any custom rawbody rules, in particular ones matching HTML tags, > > Yes, a few. > > > or otherwise prone to trigger RE backtracking? (That is, may consume > > large sub-strings, before a following sub-pattern.) > > Mmmm. I don't *think* so, but testing the message on a stock SA 3.3.1 > took "only" a minute (on slow hardware) vs 13 (on my much faster desktop). The latter being the production system with the custom rules, or at least having an identical set of custom rules? Yes, that sounds like the culprit indeed is one or more custom rule. If that "much faster" equals twice as fast, your custom rules are taking 25(!) times as long as the complete stock rule-set, including all the parsing and stuff. Bisection is your friend. Go hunt down that bugger, that in conjunction with the specific sample kills your performance. Once you found it, maybe you can post it? > I have a couple of instances of [a-z]+ and similar; is that effectively > as troublesome as .+ or .*? That on its own (i.e. not nested inside an alternation, etc) is very unlikely to be the issue, since it appears to be triggered by the HTML in the message. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Large (usually legitimate) HTML mails choking SA
On 05/27, John Hardin wrote: > Yes. "*" is "zero or more, unbounded" and "+" is "one or more, unbounded". > > It's much better to have an upper limit in body and rawbody rules, > e.g. {0,80} or {1,80} > > The upper limit may need some experimentation to set in specific > cases, but even so, {0,255} can be much less painful than *. So somebody should (open a bug to) go through all the rules we provide and replace all instances of "*" with {0,255} and "+" with {1,255}? > Header and URI texts are inherently fairly short so it's safer to > use unbounded matches against them, but even so it's good idea to But still vulnerable to regex DoS -- "I don't want to die... just yet... not while there's... women." - J. Matthew Root, 8/23/02 (http://www.jmrart.com/) http://www.ChaosReigns.com
Re: Large (usually legitimate) HTML mails choking SA
On Fri, 27 May 2011, Kris Deugau wrote: I have a couple of instances of [a-z]+ and similar; is that effectively as troublesome as .+ or .*? Yes. "*" is "zero or more, unbounded" and "+" is "one or more, unbounded". It's much better to have an upper limit in body and rawbody rules, e.g. {0,80} or {1,80} The upper limit may need some experimentation to set in specific cases, but even so, {0,255} can be much less painful than *. Header and URI texts are inherently fairly short so it's safer to use unbounded matches against them, but even so it's good idea to simply get in the habit of always using bounded matches when writing rules. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- How can you reason with someone who thinks we're on a glidepath to a police state and yet their solution is to grant the government a monopoly on force? They are insane. --- 3 days until Memorial Day - honor those who sacrificed for our liberty
Re: Large (usually legitimate) HTML mails choking SA
On Fri, 27 May 2011 10:38:17 -0400 Kris Deugau wrote: > I have a couple of instances of [a-z]+ and similar; is that > effectively as troublesome as .+ or .*? It could be, depending on what else is in the regex. There's a fairly nice Wikipedia article about evil regexes: http://en.wikipedia.org/wiki/ReDoS#Evil_regexes When I write SA rules, I never use the * or + operators. I always use something like {0,40} or {1,40} just to be on the safe side. (That still does not eliminate the possiblity of exponential behaviour from bad regexes, but it does offer some protection against bad behaviour from unfortunate strings to be matched.) Regards, David.
Re: Large (usually legitimate) HTML mails choking SA
Karsten Bräckelmann wrote: On Thu, 2011-05-26 at 15:02 -0400, Kris Deugau wrote: Every so often we get a message or two stuck in our inbound mail queue because it took too long for SA to process during mail delivery. However, we've just had a couple of *legitimate* messages get stuck for essentially the same reason - a whole lot of pathologically bad HTML. Rings a bell. Such reports usually turned out to be caused by custom rules. Any custom rawbody rules, in particular ones matching HTML tags, Yes, a few. or otherwise prone to trigger RE backtracking? (That is, may consume large sub-strings, before a following sub-pattern.) Mmmm. I don't *think* so, but testing the message on a stock SA 3.3.1 took "only" a minute (on slow hardware) vs 13 (on my much faster desktop). I have a couple of instances of [a-z]+ and similar; is that effectively as troublesome as .+ or .*? ... Hm. I also notice I have more custom local rules than there are stock rules. I *really* need to get some testing infrastructure in place to trim that list down. O_o -kgd
Re: "day old bread" DNSBL
yes. URIBL_RHS_DOB is somewhat useful. It's not _very_ reliable alone though, so I use it with META rules that add points for combinations with other things that are common with uri type spam. It seems to hit much of the same things as fresh.spameatingmonkey.net ymmv. Ken On 5/27/2011 3:17 AM, Andreas Schulze wrote: Hi all, yesterday I learned about "day old bread", a list of domains registered in the last five day. I found informations from 2007: http://mail-archives.apache.org/mod_mbox/spamassassin-users/200704.mbox/<4615e4b7.5010...@inetmsg.com> Has anybody current experiences ?? Thanks
"day old bread" DNSBL
Hi all, yesterday I learned about "day old bread", a list of domains registered in the last five day. I found informations from 2007: http://mail-archives.apache.org/mod_mbox/spamassassin-users/200704.mbox/<4615e4b7.5010...@inetmsg.com> Has anybody current experiences ?? Thanks -- Viele Grüße Andreas Schulze