It happened again :(
Not in connection with backup, but in another situation with high load.

Output of ps
http://div.org/postfix_debug/postfix.processes.txt  

http://div.org/postfix_debug/stack_trace.28848  - qmgr
http://div.org/postfix_debug/stack_trace.7175 - smtp

http://div.org/postfix_debug/core.28848  
http://div.org/postfix_debug/core.7175   

the bit of log with the last qmgr and smtp lines before hang.
no hits for grep -i "watchdog"
http://div.org/postfix_debug/maillog.12.02.09

> I am guessing a "ready" indication arrived for the private/smtp socket,
> but accept() blocked indefinitely. This would then be a kernel issue.

Does this look like that?

Thanks
Gaute


> On Mon, Feb 02, 2009 at 05:26:10PM +0100, Gaute Amundsen wrote:
> > On Monday 02 February 2009 15:43:19 Victor Duchovni wrote:
> > > On Mon, Feb 02, 2009 at 01:50:30PM +0100, Gaute Amundsen wrote:
> > > > Jan 25 05:59:19 hotell01 postfix/smtp[595]: fatal: watchdog timeout
> > > > Jan 25 05:59:20 hotell01 postfix/master[734]: warning: process
> > > > /usr/libexec/postfix/smtp pid 595 exit status 1
> > > > Jan 25 05:59:20 hotell01 postfix/master[734]: warning:
> > > > /usr/libexec/postfix/smtp: bad command startup -- throttling
> > >
> > > This happens when the smtp(8) process has been stuck waiting for
> > > something to happen for 5 hours. What was happening around 00:59:xx on
> > > the same day?
> >
> > Apparently nothing in particular:
> >
> > http://pastebin.ca/1325397
>
> Jan 25 00:56:53 hotell01 postfix/qmgr[738]: B75CA147967:
> from=<aaaa...@...>, size=29074, nrcpt=1 (queue active)
>
> The delivery agent scheduled to handle this message locked up for 5
> hours and gave up. It got stuck before reporting "busy" to the master
> daemon, so no other smtp(8) processes were allocated.
>
> > our Munin http://munin.projects.linpro.no/
> > has lost the fine details that far back but there is a regular high peak
> > on IOstsat just before 01:00 every night. Backup related I guess.
> >
> > both today and Jan 25 was a monday, so I had a look at cron.weekly which
> > runs
>
> Perhaps your system runs out of resources during backup, and perhaps when
> this happens the system behaves in ways it should not.
>
> I am guessing a "ready" indication arrived for the private/smtp socket,
> but accept() blocked indefinitely. This would then be a kernel issue.
>
> If this happens again, you need to catch the stuck smtp(8) *before* the
> watchdog timer expires, and get a core file via "gcore". Then report a
> stack trace of the process.


Reply via email to