Re: Large (usually legitimate) HTML mails choking SA

2011-05-27 Thread Kris Deugau

Karsten Bräckelmann wrote:

However, using (?:\s|\ )* also does the trick. Yes, keeping the
nasty asterisk quantifier. The difference is merely dropping the \n from
the alternation, which is part of \s whitespace anyway.

Wondering if this is a case where Perl fails to optimize out the \n.
Which would result in an alternation with overlap...


Hmm.  This may be a Perl-version-specific (or 
which-flags-Perl-was-built-with thing) then, because I've been adding \n 
on rawbody rules where I want to match multiple physical lines because 
\s *hasn't* been matching newlines - at least, not all the time.


-kgd


Re: Large (usually legitimate) HTML mails choking SA

2011-05-27 Thread Karsten Bräckelmann
On Fri, 2011-05-27 at 13:14 -0400, Kris Deugau wrote:
> Karsten Bräckelmann wrote:

> > Yes, that sounds like the culprit indeed is one or more custom rule. If
> > that "much faster" equals twice as fast,
> 
> Probably closer to 4-6x;  dual PIII/866 -> Core i3 3GHz.

Sure -- that "twice" assumption was just a quickly assumed lower bound,
that still shows the dramatic difference of the custom rule burning a
whopping 25 times the CPU.

> > Bisection is your friend.
> >
> > Go hunt down that bugger, that in conjunction with the specific sample
> > kills your performance. Once you found it, maybe you can post it?
> 
> Seems to have been this:
> 
> rawbody TOO_MANY_DIVS /(?:<[Dd][Ii][Vv]>(?:\s|\n|\ \;)*){6}/

Aha! Yes, that nesting of quantifiers sure looks like a prime candidate.
Even though this isn't the pure evil form -- which would be to have two
alternatives with overlap in sub-patterns.

Or maybe it is. Frankly, not sure what exactly causes the RE to go
berserk.

> Changing the * to {,100} drops the processing time down to ~8s.

Confirmed, grabbed your sample and this eliminates the issue.

However, using (?:\s|\ )* also does the trick. Yes, keeping the
nasty asterisk quantifier. The difference is merely dropping the \n from
the alternation, which is part of \s whitespace anyway.

Wondering if this is a case where Perl fails to optimize out the \n.
Which would result in an alternation with overlap...


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Large (usually legitimate) HTML mails choking SA

2011-05-27 Thread Kris Deugau

John Hardin wrote:

On Thu, 26 May 2011, Kris Deugau wrote:


Whitelisting these once they're found lets them bypass SA altogether,
but in the meantime they get stuck in the mail queue.

Has anyone got any suggestions for decreasing the load SA imposes
trying to process one of these?


Any possibility of getting a sample?


Eugh, that was *nasty*.

Thoroughly anonymized version at 
http://www.deepnet.cx/~kdeugau/spamtools/nastyhtml.eml.


And the HTML is really, truly, *nasty*.  I've never seen such a 
spectacular mess that's still legal HTML, even from Word or Frontpage.


And of course, because it's so nasty, I had to hand-edit it to anonymize 
it because otherwise any HTML editor would have cleaned it up   >_<


-kgd


Re: Large (usually legitimate) HTML mails choking SA

2011-05-27 Thread Kris Deugau

Karsten Bräckelmann wrote:

On Fri, 2011-05-27 at 10:38 -0400, Kris Deugau wrote:

Mmmm.  I don't *think* so, but testing the message on a stock SA 3.3.1
took "only" a minute (on slow hardware) vs 13 (on my much faster desktop).


The latter being the production system with the custom rules, or at
least having an identical set of custom rules?


Yeah;  I create the rules on my desktop (usually with an example spam on 
hand to make sure the rule hits what I intended it to hit), commit to 
svn, and periodically merge changes to a branch that's autopublished in 
something resembling the same way as the official stock rules and JM's 
SOUGHT rules.



Yes, that sounds like the culprit indeed is one or more custom rule. If
that "much faster" equals twice as fast,


Probably closer to 4-6x;  dual PIII/866 -> Core i3 3GHz.


Bisection is your friend.

Go hunt down that bugger, that in conjunction with the specific sample
kills your performance. Once you found it, maybe you can post it?


Seems to have been this:

rawbody TOO_MANY_DIVS   /(?:<[Dd][Ii][Vv]>(?:\s|\n|\ \;)*){6}/
describe TOO_MANY_DIVS  6 or move  tags in a row
score TOO_MANY_DIVS 0.75

Changing the * to {,100} drops the processing time down to ~8s.

I've got a number of similar rules for other "many logical/physical 
linebreaks with no content".  I don't have a specific spample to point 
to just now, but from memory the original targets really did have a 
widely varying number of linebreaks or whitespace (logical or otherwise) 
in between the HTML tags, and I've been bitten before with applying 
bounds to matches (related rules for garbage HTML comments) not being 
*large* enough.  O_o


This particular message has page after page of:

=09=09=09
=09=09=09
=09=09=09
=09
=09
=09

etc, with a few  or  tags for excitement.

-kgd


Re: Large (usually legitimate) HTML mails choking SA

2011-05-27 Thread Karsten Bräckelmann
On Fri, 2011-05-27 at 10:38 -0400, Kris Deugau wrote:
> Karsten Bräckelmann wrote:
> > > However, we've just had a couple of *legitimate* messages get stuck for
> > > essentially the same reason - a whole lot of pathologically bad HTML.
> >
> > Rings a bell. Such reports usually turned out to be caused by custom
> > rules. Any custom rawbody rules, in particular ones matching HTML tags,
> 
> Yes, a few.
> 
> > or otherwise prone to trigger RE backtracking? (That is, may consume
> > large sub-strings, before a following sub-pattern.)
> 
> Mmmm.  I don't *think* so, but testing the message on a stock SA 3.3.1 
> took "only" a minute (on slow hardware) vs 13 (on my much faster desktop).

The latter being the production system with the custom rules, or at
least having an identical set of custom rules?

Yes, that sounds like the culprit indeed is one or more custom rule. If
that "much faster" equals twice as fast, your custom rules are taking
25(!) times as long as the complete stock rule-set, including all the
parsing and stuff.

Bisection is your friend.

Go hunt down that bugger, that in conjunction with the specific sample
kills your performance. Once you found it, maybe you can post it?


> I have a couple of instances of [a-z]+ and similar;  is that effectively 
> as troublesome as .+ or .*?

That on its own (i.e. not nested inside an alternation, etc) is very
unlikely to be the issue, since it appears to be triggered by the HTML
in the message.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: Large (usually legitimate) HTML mails choking SA

2011-05-27 Thread darxus
On 05/27, John Hardin wrote:
> Yes. "*" is "zero or more, unbounded" and "+" is "one or more, unbounded".
> 
> It's much better to have an upper limit in body and rawbody rules,
> e.g. {0,80} or {1,80}
> 
> The upper limit may need some experimentation to set in specific
> cases, but even so, {0,255} can be much less painful than *.

So somebody should (open a bug to) go through all the rules we provide
and replace all instances of "*" with {0,255} and "+" with {1,255}?

> Header and URI texts are inherently fairly short so it's safer to
> use unbounded matches against them, but even so it's good idea to

But still vulnerable to regex DoS

-- 
"I don't want to die... just yet... not while there's... women."
- J. Matthew Root, 8/23/02 (http://www.jmrart.com/)
http://www.ChaosReigns.com


Re: Large (usually legitimate) HTML mails choking SA

2011-05-27 Thread John Hardin

On Fri, 27 May 2011, Kris Deugau wrote:

I have a couple of instances of [a-z]+ and similar;  is that effectively as 
troublesome as .+ or .*?


Yes. "*" is "zero or more, unbounded" and "+" is "one or more, unbounded".

It's much better to have an upper limit in body and rawbody rules, e.g. 
{0,80} or {1,80}


The upper limit may need some experimentation to set in specific cases, 
but even so, {0,255} can be much less painful than *.


Header and URI texts are inherently fairly short so it's safer to use 
unbounded matches against them, but even so it's good idea to simply get 
in the habit of always using bounded matches when writing rules.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  How can you reason with someone who thinks we're on a glidepath to
  a police state and yet their solution is to grant the government a
  monopoly on force? They are insane.
---
 3 days until Memorial Day - honor those who sacrificed for our liberty


Re: Large (usually legitimate) HTML mails choking SA

2011-05-27 Thread David F. Skoll
On Fri, 27 May 2011 10:38:17 -0400
Kris Deugau  wrote:

> I have a couple of instances of [a-z]+ and similar;  is that
> effectively as troublesome as .+ or .*?

It could be, depending on what else is in the regex.  There's a fairly
nice Wikipedia article about evil regexes:

http://en.wikipedia.org/wiki/ReDoS#Evil_regexes

When I write SA rules, I never use the * or + operators.  I always
use something like {0,40} or {1,40} just to be on the safe side.

(That still does not eliminate the possiblity of exponential behaviour
from bad regexes, but it does offer some protection against bad behaviour
from unfortunate strings to be matched.)

Regards,

David.


Re: Large (usually legitimate) HTML mails choking SA

2011-05-27 Thread Kris Deugau

Karsten Bräckelmann wrote:

On Thu, 2011-05-26 at 15:02 -0400, Kris Deugau wrote:

Every so often we get a message or two stuck in our inbound mail queue
because it took too long for SA to process during mail delivery.



However, we've just had a couple of *legitimate* messages get stuck for
essentially the same reason - a whole lot of pathologically bad HTML.


Rings a bell. Such reports usually turned out to be caused by custom
rules. Any custom rawbody rules, in particular ones matching HTML tags,


Yes, a few.


or otherwise prone to trigger RE backtracking? (That is, may consume
large sub-strings, before a following sub-pattern.)


Mmmm.  I don't *think* so, but testing the message on a stock SA 3.3.1 
took "only" a minute (on slow hardware) vs 13 (on my much faster desktop).


I have a couple of instances of [a-z]+ and similar;  is that effectively 
as troublesome as .+ or .*?


...  Hm.  I also notice I have more custom local rules than there are 
stock rules.  I *really* need to get some testing infrastructure in 
place to trim that list down.  O_o


-kgd


Re: "day old bread" DNSBL

2011-05-27 Thread Ken A
yes. URIBL_RHS_DOB is somewhat useful. It's not _very_ reliable alone 
though, so I use it with META rules that add points for combinations 
with other things that are common with uri type spam.


It seems to hit much of the same things as fresh.spameatingmonkey.net

ymmv.

Ken



On 5/27/2011 3:17 AM, Andreas Schulze wrote:

Hi all,

yesterday I learned about "day old bread", a list of domains registered in the 
last five day.
I found informations from 2007:
http://mail-archives.apache.org/mod_mbox/spamassassin-users/200704.mbox/<4615e4b7.5010...@inetmsg.com>

Has anybody current experiences ??

Thanks




"day old bread" DNSBL

2011-05-27 Thread Andreas Schulze
Hi all,

yesterday I learned about "day old bread", a list of domains registered in the 
last five day.
I found informations from 2007:
http://mail-archives.apache.org/mod_mbox/spamassassin-users/200704.mbox/<4615e4b7.5010...@inetmsg.com>

Has anybody current experiences ??

Thanks


-- 
Viele Grüße

Andreas Schulze