Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures

Tom Lane Tue, 09 Feb 2016 11:14:03 -0800

I wrote:
> Noah Misch <n...@leadboat.com> writes:
>> On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:
>>> We've seen variants
>>> on this theme on half a dozen machines just in the past week --- and it
>>> seems to mostly happen in 9.5 and HEAD, which is fishy.


>> It has been affecting only the four AIX animals, which do share hardware.
>> (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)

> Certainly your AIX critters have shown this a bunch, but here's another
> current example:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23

While we've not seen an actual test failure since I installed the tracing
code, we do have reports from the four AIX critters, and they are pretty
instructive.  See for example
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-09%2016%3A27%3A05&stg=stopdb-C-2
which is

Snapshot: 2016-02-09 16:27:05

waiting for server to shut down at 2016-02-09 18:04:09.159 
UTC................................................................................
 done
server stopped
=========== db log file ==========
2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:2] LOG:  received fast shutdown 
request at 2016-02-09 18:04:09.159 UTC
2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:3] LOG:  aborting any active 
transactions
2016-02-09 18:04:09.160 UTC [56ba2944.14501d8:2] LOG:  autovacuum launcher 
shutting down at 2016-02-09 18:04:09.160 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:40] LOG:  shutting down at 
2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:41] LOG:  CheckPointGuts starting 
at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:42] LOG:  CheckPointCLOG() done 
at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:43] LOG:  CheckPointCommitTs() 
done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:44] LOG:  CheckPointSUBTRANS() 
done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:45] LOG:  CheckPointMultiXact() 
done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:46] LOG:  CheckPointPredicate() 
done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:47] LOG:  CheckPointRelationMap() 
done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:48] LOG:  
CheckPointReplicationSlots() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:49] LOG:  CheckPointSnapBuild() 
done at 2016-02-09 18:04:09.259 UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:50] LOG:  
CheckPointLogicalRewriteHeap() done at 2016-02-09 18:04:09.259 UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:51] LOG:  BufferSync(5) beginning 
to write 632 buffers at 2016-02-09 18:04:09.259 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:52] LOG:  BufferSync(5) done, 
wrote 632/632 buffers at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:53] LOG:  CheckPointBuffers() 
done at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:54] LOG:  
CheckPointReplicationOrigin() done at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:55] LOG:  CheckPointGuts done at 
2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.967 UTC [56ba2944.363000e:56] LOG:  checkpoint WAL record 
flushed at 2016-02-09 18:04:09.967 UTC
2016-02-09 18:04:09.967 UTC [56ba2944.363000e:57] LOG:  pg_control updated at 
2016-02-09 18:04:09.967 UTC
2016-02-09 18:04:29.375 UTC [56ba2944.363000e:58] LOG:  in mdpostckpt, 3072 
unlinks remain to do at 2016-02-09 18:04:29.375 UTC
2016-02-09 18:04:48.207 UTC [56ba2944.363000e:59] LOG:  in mdpostckpt, 2048 
unlinks remain to do at 2016-02-09 18:04:48.207 UTC
2016-02-09 18:05:07.381 UTC [56ba2944.363000e:60] LOG:  in mdpostckpt, 1024 
unlinks remain to do at 2016-02-09 18:05:07.381 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:61] LOG:  smgrpostckpt() done at 
2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:62] LOG:  RemoveOldXlogFiles() 
done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:63] LOG:  TruncateSUBTRANS() done 
at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:64] LOG:  shutdown checkpoint 
complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:65] LOG:  ShutdownCLOG() complete 
at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:66] LOG:  ShutdownCommitTs() 
complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:67] LOG:  ShutdownSUBTRANS() 
complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:68] LOG:  database system is shut 
down at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:26.165 UTC [56ba2944.ee01bc:4] LOG:  lock files all released 
at 2016-02-09 18:05:26.165 UTC

This says that the bulk of the shutdown time is being spent in mdpostckpt,
where we are finally unlinking files belonging to relations that were
dropped since the last checkpoint.  That took 75 seconds here, while the
parts before and after it took less than a second each.  There were
somewhere between 3000 and 4000 files to be unlinked, so it's not
astonishing that it took some time, but our other critters seem to get
through this a lot quicker.  (All four AIX critters reported pretty
similar results, BTW.)

So my diagnosis at this point is

(1) Slow file system, specifically slow unlink, is the core of the
problem.  (I wonder if the AIX critters are using an NFS filesystem?)

(2) The apparent increase in failure rate in more recent branches is
probably real, but the likely cause is simply more test cases = more
stuff to be cleaned up at the end = mdpostckpt takes longer.

I'm not sure whether there's anything to be gained by leaving the tracing
code in there till we see actual buildfarm fails.  There might be another
slowdown mechanism somewhere, but I rather doubt it.  Thoughts?

In any case, we should proceed with fixing things so that buildfarm owners
can specify a higher shutdown timeout for especially slow critters.

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures

Reply via email to