Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures

Noah Misch Thu, 11 Feb 2016 18:16:28 -0800

On Wed, Feb 10, 2016 at 10:55:10AM -0500, Tom Lane wrote:
> Interestingly, we seem to have managed to greatly reduce the "other"
> time (which I presume is basically mdpostchkpt unlinking) since 9.2.
> The worst case observed in HEAD is about 100s:
> 
> regression=# select ts,write,sync,total-write-sync as other, total from 
> chkpts order by total-write-sync desc limit 10;
>              ts             |  write  |  sync   |  other  |  total  
> ----------------------------+---------+---------+---------+---------
>  2016-02-09 08:27:04.365-08 |  14.479 | 298.043 | 100.586 | 413.108
>  2016-02-09 06:15:22.477-08 |  75.099 |  26.590 |  90.452 | 192.141
>  2016-02-09 16:31:34.661-08 | 254.462 | 186.246 |  51.541 | 492.249
>  2016-02-09 15:25:01.586-08 |   0.015 |   2.985 |  47.822 |  50.822
>  2016-02-09 11:40:13.565-08 | 287.327 | 314.218 |  44.059 | 645.604
>  2016-02-09 08:56:51.339-08 | 135.464 |  31.242 |  39.131 | 205.837
>  2016-02-09 20:23:52.797-08 |   1.309 |  12.155 |  36.350 |  49.814
>  2016-02-09 00:51:41.157-08 | 134.502 | 103.482 |  34.110 | 272.094
>  2016-02-09 19:28:33.154-08 |   0.575 |  20.434 |  33.803 |  54.812
>  2016-02-09 03:47:02.57-08  |   0.066 |  37.475 |  33.119 |  70.660
> (10 rows)
> 
> but 9.2's worst cases greatly exceed that:
> 
> regression=# select ts,write,sync,total-write-sync as other, total from 
> chkpts92 order by total-write-sync desc limit 10;
>              ts             | write  |  sync   |  other  |  total  
> ----------------------------+--------+---------+---------+---------
>  2016-02-09 03:27:12.904-08 | 48.761 | 124.453 | 299.877 | 473.091
>  2016-02-09 16:31:25.657-08 |  3.860 | 184.042 | 248.488 | 436.390
>  2016-02-08 23:33:36.16-08  | 66.904 | 397.653 | 229.020 | 693.577
>  2016-02-09 11:39:57.061-08 |  2.471 | 303.924 | 201.923 | 508.318
>  2016-02-09 07:37:44.48-08  |  3.317 |   1.494 | 158.159 | 162.970
>  2016-02-09 04:32:49.668-08 | 11.328 | 292.310 | 148.088 | 451.726
>  2016-02-09 12:37:43.52-08  | 17.620 | 333.983 | 141.172 | 492.775
>  2016-02-09 08:27:04.407-08 | 12.019 | 300.510 | 102.855 | 415.384
>  2016-02-09 18:17:32.193-08 | 90.370 |  25.369 |  63.626 | 179.365
>  2016-02-09 11:02:14.977-08 |  2.713 |   2.306 |  38.581 |  43.600
> (10 rows)
> 
> Not real sure where that improvement came from.  We could hardly
> be unlinking fewer files overall, and the bottleneck seems to be
> in the kernel, so what changed?  The *average*, as opposed to worst
> case, checkpoint time has definitely grown since 9.2:


Good question; I don't know.  I would have expected 9.2's xlog-driven
checkpoints to create more stability than HEAD's time-driven checkpoints.


On the AIX animals, I have now set PGCTLTIMEOUT=900 and removed the "-t 120"
arguments.


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures

Reply via email to