It happened again :( Not in connection with backup, but in another situation with high load.
Output of ps http://div.org/postfix_debug/postfix.processes.txt http://div.org/postfix_debug/stack_trace.28848 - qmgr http://div.org/postfix_debug/stack_trace.7175 - smtp http://div.org/postfix_debug/core.28848 http://div.org/postfix_debug/core.7175 the bit of log with the last qmgr and smtp lines before hang. no hits for grep -i "watchdog" http://div.org/postfix_debug/maillog.12.02.09 > I am guessing a "ready" indication arrived for the private/smtp socket, > but accept() blocked indefinitely. This would then be a kernel issue. Does this look like that? Thanks Gaute > On Mon, Feb 02, 2009 at 05:26:10PM +0100, Gaute Amundsen wrote: > > On Monday 02 February 2009 15:43:19 Victor Duchovni wrote: > > > On Mon, Feb 02, 2009 at 01:50:30PM +0100, Gaute Amundsen wrote: > > > > Jan 25 05:59:19 hotell01 postfix/smtp[595]: fatal: watchdog timeout > > > > Jan 25 05:59:20 hotell01 postfix/master[734]: warning: process > > > > /usr/libexec/postfix/smtp pid 595 exit status 1 > > > > Jan 25 05:59:20 hotell01 postfix/master[734]: warning: > > > > /usr/libexec/postfix/smtp: bad command startup -- throttling > > > > > > This happens when the smtp(8) process has been stuck waiting for > > > something to happen for 5 hours. What was happening around 00:59:xx on > > > the same day? > > > > Apparently nothing in particular: > > > > http://pastebin.ca/1325397 > > Jan 25 00:56:53 hotell01 postfix/qmgr[738]: B75CA147967: > from=<aaaa...@...>, size=29074, nrcpt=1 (queue active) > > The delivery agent scheduled to handle this message locked up for 5 > hours and gave up. It got stuck before reporting "busy" to the master > daemon, so no other smtp(8) processes were allocated. > > > our Munin http://munin.projects.linpro.no/ > > has lost the fine details that far back but there is a regular high peak > > on IOstsat just before 01:00 every night. Backup related I guess. > > > > both today and Jan 25 was a monday, so I had a look at cron.weekly which > > runs > > Perhaps your system runs out of resources during backup, and perhaps when > this happens the system behaves in ways it should not. > > I am guessing a "ready" indication arrived for the private/smtp socket, > but accept() blocked indefinitely. This would then be a kernel issue. > > If this happens again, you need to catch the stuck smtp(8) *before* the > watchdog timer expires, and get a core file via "gcore". Then report a > stack trace of the process.