On 10/20/2018 7:24 AM, Peer Heinlein wrote:
we're monitoring the amount of active smtpd processes to make sure, that
we do not reach the max-proc limit from master.cf.

If a client disconnects very early, the smtpd is still "unused" and
remains in server memory, waiting for the next connection.

If a server was flooded with a short peak of new connections, a server
could have $process_limit instances remaining ready-to-tun in memory.

In that situations we're seeing false positives in our monitoring.

The number I found most useful to indicate something was going wrong is the number of messages in the queue.  For the servers I manage, normally that number would be single digit, maybe get to two digits on occasion.

When something gets broken, the number of messages in the queue tends to balloon.  There are two primary causes I've seen for a large queue:  1) A particularly massive email storm, either spam or internally generated messages.  2) Delivery problems. There are lots of things that can cause delivery problems.  The most common problem I ran into was one of the webservers deciding that it needed to send thousands of messages.  Waiting for those to clear out on their own so normal mail can make it through could take DAYS.

I would typically get notified about a problem with email after an hour or two where no messages were getting through, which is why I eventually added a monitor for the queue size, so I could know about the problem BEFORE it was noticed by high-profile people at the company.  With that, I could fix the problem quickly and find the right developer to chew out for sending thousands of messages.

For a particularly busy server, you probably would want to set the queue size alarm threshold at a fairly large number (at least 1000), but for one that's not very busy, more than about 100 is probably enough of a reason to investigate and see if there's a problem.  Calculating the total size of the message queue would be as simple as looking at the contents of some of the directories in /var/spool/postfix.  You could potentially run the 'mailq' command and parse its output, but I have seen that take a REALLY long time to finish, so counting files in the spool directories is probably better.

Thanks,
Shawn

Reply via email to