On 10 Jun 2016, at 6:17, Merijn van den Kroonenberg wrote:

[...]
From the manual:

This is a best-effort advisory setting, processing will not be abruptly
aborted at an arbitrary point in processing when the time limit is
exceeded, but only on reaching one of locations in the program flow
equipped with a time test. Currently equipped with the test are the main
checking loop, asynchronous DNS lookups, plugins which are calling
external programs. Rule evaluation is guarded by starting a timer (alarm)
on each set of compiled rules.

The last line is critical. The alarm isn't even on individual rules but rather on precompiled sets of rules. I already knew it was soft but that particular detail had not stuck in any of the many times I've read that man page. Thanks for quoting it.

What does this mean, can still a single operation take more than this
time_limit? But I guess the timer on the rules means the rules at least
cannot take more than time_limit, right?

Nope: time_limit is in seconds and I had it set to 270. The default is 300 which is 1/2 the canonical SMTP EOD timeout but matches the canonical "server timeout" (how long a server should wait for a client command. See RFC 5321) and I'm sure there's misunderstanding on that distinction. Since under peak loads the plumbing of that particular system can add whole seconds and some clients can be a bit impatient, I gave it what I thought was very generous additional headroom although it really didn't need it. The actual normal spamd scantime distribution there roughly 2/3 under 1 second, 90% under 2 seconds, 99% under 10 seconds, 99.9% under a minute. There's a long thin tail out to ~2 minutes, but until last week I'd never had anything actually hit the timeout that I can recall, and the 1st bad rule was hardly fresh. I think it is >8 years old and the others I found had not caused trouble in their multiple (3?) years of existence.

I figured out which rule was the proximal cause by running the message through the spamassassin script with rule debugging on so I could narrow down the bad rule based on what matched right before TIME_LIMIT_EXCEEDED. I didn't dtrace the process to nail it down, but my hypothesis is that ultimately Perl is calling its low-level internal equivalent of regexec() which, like POSIX regexec(), has no timeout facility: it runs until it matches or exhausts the input of starting points. At the Perl level, I've never encountered any way to limit the time '=~' takes to operate, but I'm no Larry Wall so maybe there's some arcane way to do that. It seems clear that if Perl has such a feature to break out of an operator routine that is taking too much clock time, it has not been used in SpamAssassin. I'd bet on there NOT being such a feature and further, that the process doing the match might not even die immediately with a SIGKILL while inside that call.

What most annoys me about this is that the potential for blowing up systems with REs is the first thing one learns about them in a formal setting (rather than just by reading man pages above the BUGS section.) Back when I first got that warning the emphasis was on an ability of REs to compile to disastrous scale but back then that meant a few megabytes, and who cares about that today. However, I also got the warning decades ago that '.*' could cause a RE to take a long time if you didn't take care to limit your input size and write the RE to rule out most starting points fast, but again absolute sizes matter and until last week I'd not envisioned "optimizing" HTML by removing all formally unnecessary whitespace including line breaks. This is obviously somewhat rare, but it's apparently A Thing HTML Parsers Like and this was a big hunk of HTML, so I guess optimizing parsing was important...

It will be interesting to see the stats on scantimes this week to see if my tightening up on sloppy rules has an impact. I expect it will, since I now have a concrete theory to explain that long tail out to 2 minutes, which before now I've ignored as pure noise.

Reply via email to