On 10 Jun 2016, at 6:17, Merijn van den Kroonenberg wrote:
[...]
From the manual:
This is a best-effort advisory setting, processing will not be
abruptly
aborted at an arbitrary point in processing when the time limit is
exceeded, but only on reaching one of locations in the program flow
equipped with a time test. Currently equipped with the test are the
main
checking loop, asynchronous DNS lookups, plugins which are calling
external programs. Rule evaluation is guarded by starting a timer
(alarm)
on each set of compiled rules.
The last line is critical. The alarm isn't even on individual rules but
rather on precompiled sets of rules. I already knew it was soft but that
particular detail had not stuck in any of the many times I've read that
man page. Thanks for quoting it.
What does this mean, can still a single operation take more than this
time_limit? But I guess the timer on the rules means the rules at
least
cannot take more than time_limit, right?
Nope: time_limit is in seconds and I had it set to 270. The default is
300 which is 1/2 the canonical SMTP EOD timeout but matches the
canonical "server timeout" (how long a server should wait for a client
command. See RFC 5321) and I'm sure there's misunderstanding on that
distinction. Since under peak loads the plumbing of that particular
system can add whole seconds and some clients can be a bit impatient, I
gave it what I thought was very generous additional headroom although it
really didn't need it. The actual normal spamd scantime distribution
there roughly 2/3 under 1 second, 90% under 2 seconds, 99% under 10
seconds, 99.9% under a minute. There's a long thin tail out to ~2
minutes, but until last week I'd never had anything actually hit the
timeout that I can recall, and the 1st bad rule was hardly fresh. I
think it is >8 years old and the others I found had not caused trouble
in their multiple (3?) years of existence.
I figured out which rule was the proximal cause by running the message
through the spamassassin script with rule debugging on so I could narrow
down the bad rule based on what matched right before
TIME_LIMIT_EXCEEDED. I didn't dtrace the process to nail it down, but my
hypothesis is that ultimately Perl is calling its low-level internal
equivalent of regexec() which, like POSIX regexec(), has no timeout
facility: it runs until it matches or exhausts the input of starting
points. At the Perl level, I've never encountered any way to limit the
time '=~' takes to operate, but I'm no Larry Wall so maybe there's some
arcane way to do that. It seems clear that if Perl has such a feature to
break out of an operator routine that is taking too much clock time, it
has not been used in SpamAssassin. I'd bet on there NOT being such a
feature and further, that the process doing the match might not even die
immediately with a SIGKILL while inside that call.
What most annoys me about this is that the potential for blowing up
systems with REs is the first thing one learns about them in a formal
setting (rather than just by reading man pages above the BUGS section.)
Back when I first got that warning the emphasis was on an ability of REs
to compile to disastrous scale but back then that meant a few megabytes,
and who cares about that today. However, I also got the warning decades
ago that '.*' could cause a RE to take a long time if you didn't take
care to limit your input size and write the RE to rule out most starting
points fast, but again absolute sizes matter and until last week I'd not
envisioned "optimizing" HTML by removing all formally unnecessary
whitespace including line breaks. This is obviously somewhat rare, but
it's apparently A Thing HTML Parsers Like and this was a big hunk of
HTML, so I guess optimizing parsing was important...
It will be interesting to see the stats on scantimes this week to see if
my tightening up on sloppy rules has an impact. I expect it will, since
I now have a concrete theory to explain that long tail out to 2 minutes,
which before now I've ignored as pure noise.