[GENERAL] how to decrease the promotion time when performing a multiple failovers.....

Shay Cohavi Thu, 31 Dec 2015 00:30:08 -0800

*Hi,*
*I have postgresql 9.3 setup with 2 nodes (active/standby with streaming
replication & continuos archiving).*
*I have created 2 failover & failback script in order to perform a
switchover between the DB servers:*
*1. failover - create a trigger file in order to promote the new primary.*
*2. failback - perform a base backup as mentions in :*
*   a. start backup on the primary.*
*   b. stop the failed node .*
       didn't delete the DB directory on the failed node
*   c. performing rsync between the nodes.*
*   d.stopping the backup on the primary.*
*   e.performing rsync on the pg_xlog.*
*   f. creating a recovery.conf*


*    standby_mode = 'on'*
*    primary_conninfo = 'host=10.50.1.153 port=5432 user=usr password=pass'*
*    restore_command = 'scp 10.50.1.153:/home/postgres/archive/%f %p'*
*    trigger_file = '/home/postgres/databases/fabrix/trigger'*
*    archive_cleanup_command = 'ssh 10.50.1.153
/home/postgres/pg_utils/archive_cleanup.sh %r'*

*   g. starting the failed node as secondary.*

*the switchover method:*
*1. stop the primary node.*
*2. promote the secondary node (failover.sh).*
*3. perform failback on the failed node.*
*4. start the failed node.*

*this method works great! *


*but if I perform multiple switchovers (>20), each time the new primary
gets promoted (trigger file) - it takes longer because it searches the
timelines on the archive. *
*is there any way to prevent the multiple 'scp' archive commands which
makes the promotion longer! *

*for example:*

*[2015-12-12 20:35:10.769 IST] LOG:  trigger file found:
/home/postgres/databases/fabrix/trigger*
*[2015-12-12 20:35:10.769 IST] FATAL:  terminating walreceiver process due
to administrator command*
*scp: /home/postgres/archive/0000009400000002000000DC: No such file or
directory*
*[2015-12-12 20:35:10.893 IST] LOG:  record with zero length at 2/DC000168*
*[2015-12-12 20:35:10.893 IST] LOG:  redo done at 2/DC000100*
*scp: /home/postgres/archive/0000009400000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009300000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009200000002000000DC: No such file or
directory*
*.*
*.*
*.*

*scp: /home/postgres/archive/0000009100000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/0000009000000002000000DC: No such file or
directory*
*scp: /home/postgres/archive/00000095.history: No such file or directory*
*[2015-12-12 20:35:11.801 IST] LOG:  selected new timeline ID: 149*
*[2015-12-12 20:35:11.931 IST] LOG:  restored log file "00000094.history"
from archive*
*[2015-12-12 20:35:12.173 IST] LOG:  archive recovery complete*
*[2015-12-12 20:35:12.181 IST] LOG:  database system is ready to accept
connections*
*[2015-12-12 20:35:12.181 IST] LOG:  autovacuum launcher started*

*this could take for a least 1 min.....or more.*

*is there any way to skip the timeline searching in order to decrease the
promotion?*


*Thanks,*
*ShayC*

[GENERAL] how to decrease the promotion time when performing a multiple failovers.....

Reply via email to