Re: [HACKERS] SIGUSR1 pingpong between master na autovacum launcher causes crash

Zdenek Kotala Mon, 24 Aug 2009 04:46:31 -0700

Tom Lane píše v so 22. 08. 2009 v 09:56 -0400:
> Zdenek Kotala <[email protected]> writes:
> > There are most important records from yesterdays issues. 
> > Messages:
> > ---------
> > Aug 20 11:14:54 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap 
> > space to grow stack for pid 507 (postgres)
> 
> Hmm, that seems to confirm the idea that something had run the machine
> out of memory/swap space, which would explain the repeated ENOMEM fork
> failures.  But we're still no closer to understanding how come the
> delay in the avlauncher didn't do what it was supposed to.


I found hungry process which eats up all memory and fortunately it is
not postgres :-).

I run also following dtrace script:

dtrace  -n 'syscall::kill:entry / execname=="postgres"/ { printf("%i  %
s, %i->%i : %i", timestamp, execname, pid, arg0, arg1); }'

and it show following (little bit modified) output:

<snip>
CPU       Timestamp[ns]         diff[ms]        caller          callee  sig
0       2750745000052090        899,96          28604   ->      28608   16
3       2750745100280460        100,23          28608   ->      28604   16
1       2750746000144690        899,86          28604   ->      28608   16
3       2750746100380940        100,24          28608   ->      28604   16
2       2750747000135380        899,75          28604   ->      28608   16
3       2750747100171650        100,04          28608   ->      28604   16
0       2750748000101050        899,93          28604   ->      28608   16
3       2750748100331900        100,23          28608   ->      28604   16
1       2750749000148550        899,82          28604   ->      28608   16
3       2750749100386640        100,24          28608   ->      28604   16
2       2750750000095040        899,71          28604   ->      28608   16
3       2750750100127780        100,03          28608   ->      28604   16

We can see there that AVlauncher really wait 100ms, but it is not enough
when system is under stress.

I tested Alvaro's patch and it works, because it does not lead to stack
consumption, but it shows another bug in StartAutovacuumWorker() code.
When fork fails bn structure is freed but 
ReleasePostmasterChildSlot() should be called as well. See error:

2009-08-24 11:50:20.360 CEST 3468 FATAL:  no free slots in PMChildFlags array

and comment in source code:

/* Out of slots ... should never happen, else postmaster.c messed up */

I think that Alvaro's patch is good and it fix a crash problem. I also
think that AVlauncher could wait little bit more then 100ms. When system
cannot fork, I don't see any reason why hurry to repeat a fork
operation. Maybe 1s is good compromise. 

        Zdenek






-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SIGUSR1 pingpong between master na autovacum launcher causes crash

Reply via email to