> > The slave (I don't have control on the master) is using 2 NFS file
> > systems, one for WALs and another one for the data, on Netapp controllers:
> >
> > dbnasg401-12a:/vol/dodpupdbtst02 on /ORA/dbs02/PUPDBTST type nfs
> >
> (rw,remount,hard,nointr,rsize=32768,wsize=32768,tcp,actimeo=0,timeo=60
> > 0)
> >
> > dbnasg403-12a:/vol/dodpupdbtst03 on /ORA/dbs03/PUPDBTST type nfs
> >
> (rw,remount,hard,nointr,rsize=32768,wsize=32768,tcp,actimeo=0,timeo=60
> > 0)
>
> You should use noatime to avoid unnecessary IO.
>
Just to mention that changing the mount points from:
dbnasg403-12a:/vol/dodpupdbtst03 on /ORA/dbs03/PUPDBTST type nfs (rw,
actimeo=0,hard,nointr,rsize=65536,wsize=65536,tcp,timeo=600)
dbnasg401-12a:/vol/dodpupdbtst02 on /ORA/dbs02/PUPDBTST type nfs (rw,
actimeo=0,hard,nointr,rsize=65536,wsize=65536,tcp,timeo=600)
to
dbnasg403-12a:/vol/dodpupdbtst03 on /ORA/dbs03/PUPDBTST type nfs
(rw,noatime,hard,nointr,rsize=65536,wsize=65536,tcp,timeo=600)
dbnasg401-12a:/vol/dodpupdbtst02 on /ORA/dbs02/PUPDBTST type nfs
(rw,noatime,hard,nointr,rsize=65536,wsize=65536,tcp,timeo=600)
it did have a big impact. The profile of the recovery process on terms of calls
changed quite a lot:
From:
[postgres@itrac1202 tmp]$ strace -p 9596 -c
Process 9596 attached - interrupt to quit
Process 9596 detached
% time seconds usecs/call callserrors syscall
-- --- --- - -
78.730.217824 0456855381376 read
17.870.049453 0515320 lseek
2.890.007989 12 669 669 poll
0.330.000912 2536 open
0.070.000206 0 994 994 stat
0.050.000151 0 995 787 rt_sigreturn
0.050.000133 0 673 write
0.000.00 036 close
0.000.00 052 kill
-- --- --- - -
100.000.276668975630383826 total
To:
[postgres@itrac1202 tmp]$ strace -p 9596 -c
Process 9596 attached - interrupt to quit
Process 9596 detached
% time seconds usecs/call callserrors syscall
-- --- --- - -
78.730.217824 0456855381376 read
17.870.049453 0515320 lseek
2.890.007989 12 669 669 poll
0.330.000912 2536 open
0.070.000206 0 994 994 stat
0.050.000151 0 995 787 rt_sigreturn
0.050.000133 0 673 write
0.000.00 036 close
0.000.00 052 kill
-- --- --- - -
100.000.276668975630383826 total
We did also increased the shared_buffers from 12 to 24GB.
The lag has decreased most of the time:
*/10 * * * * /usr/local/pgsql/postgresql-9.2.9/bin/psql -U postgres -h
/var/lib/pgsql/ -p 6600 -d puppetdb -c "SELECT now(), now() -
pg_last_xact_replay_timestamp() AS time_lag" | perl -ne 'if
(/\|\s+(\d{2}):(\d{2}):(\d{2})\.\d+/) {$hour=$1;$min=$2;$sec=$3; print $_;}' >>
/tmp/lag929morememmount.log
...
2014-12-14 14:10:01.688947+01 | 00:00:00.096524
2014-12-14 14:20:01.798223+01 | 00:00:00.024083
2014-12-14 14:30:01.884448+01 | 00:00:00.420791
2014-12-14 14:40:01.960623+01 | 00:00:00.168318
2014-12-14 14:50:01.191487+01 | 00:00:00.163832
2014-12-14 15:00:02.146436+01 | 00:00:00.026934
2014-12-14 15:10:01.277963+01 | 00:00:00.332185
2014-12-14 15:20:01.353979+01 | 00:00:00.020616
2014-12-14 15:30:01.417092+01 | 00:00:00.584768
2014-12-14 15:40:01.575347+01 | 00:00:00.151685
2014-12-14 15:50:01.205507+01 | 00:00:00.102073
2014-12-14 16:00:01.321511+01 | 00:00:00.590677
2014-12-14 16:10:01.570474+01 | 00:00:00.182683
2014-12-14 16:20:01.640095+01 | 00:00:00.420185
2014-12-14 16:30:01.767033+01 | 00:00:00.015989
2014-12-14 16:40:01.849532+01 | 00:00:00.106296
2014-12-14 16:50:01.920876+01 | 00:00:00.258851
2014-12-14 17:00:02.000278+01 | 00:00:00.119841
2014-12-14 17:10:01.894227+01 | 00:00:00.091599
2014-12-14 17:20:01.61729+01 | 00:00:00.367367
2014-12-14 17:30:01.683326+01 | 00:00:00.103884
2014-12-14 17:40:01.755904+01 | 00:00:00.051262
2014-12-14 17:50:01.833825+01 | 00:00:00.06901
2014-12-14 18:00:01.901236+01 | 00:00:00.17467
2014-12-14 18:10:01.186283+01 | 00:00:00.214941
2014-12-14 18:20:01.145413+01 | 00:00:00.03517
2014-12-14 18:30:01.241746+01 | 00:00:00.207842
2014-12-14 18:40:01.299413+01 | 00:00:00.147878
2014-12-14 18:50:01.368541+01 | 00:00:00.393893
2014-12-14 19:00:01.430736+01 | 00:00:00.031226
2014-12-14 19:10:01.672117+01 | 00:05:03.512832
2014-12-14 19:20:01.9195+01 | 00:06:39.08761
2014-12-14 19:30:02.184486+01 | 00:00:00.307668
2