On 10/20/2018 7:24 AM, Peer Heinlein wrote:
we're monitoring the amount of active smtpd processes to make sure, that
we do not reach the max-proc limit from master.cf.
If a client disconnects very early, the smtpd is still "unused" and
remains in server memory, waiting for the next connection.
If a server was flooded with a short peak of new connections, a server
could have $process_limit instances remaining ready-to-tun in memory.
In that situations we're seeing false positives in our monitoring.
The number I found most useful to indicate something was going wrong is
the number of messages in the queue. For the servers I manage, normally
that number would be single digit, maybe get to two digits on occasion.
When something gets broken, the number of messages in the queue tends to
balloon. There are two primary causes I've seen for a large queue: 1)
A particularly massive email storm, either spam or internally generated
messages. 2) Delivery problems. There are lots of things that can cause
delivery problems. The most common problem I ran into was one of the
webservers deciding that it needed to send thousands of messages.
Waiting for those to clear out on their own so normal mail can make it
through could take DAYS.
I would typically get notified about a problem with email after an hour
or two where no messages were getting through, which is why I eventually
added a monitor for the queue size, so I could know about the problem
BEFORE it was noticed by high-profile people at the company. With that,
I could fix the problem quickly and find the right developer to chew out
for sending thousands of messages.
For a particularly busy server, you probably would want to set the queue
size alarm threshold at a fairly large number (at least 1000), but for
one that's not very busy, more than about 100 is probably enough of a
reason to investigate and see if there's a problem. Calculating the
total size of the message queue would be as simple as looking at the
contents of some of the directories in /var/spool/postfix. You could
potentially run the 'mailq' command and parse its output, but I have
seen that take a REALLY long time to finish, so counting files in the
spool directories is probably better.
Thanks,
Shawn