Hi >>> 1. It doesn't follow standard Master/Slave logic of Pacemaker and that >>> can create a lot of confusion. On start RA decides on what role shall >>> be started where based not on Pacemaker configuration but on existence >>> of recovery.conf file which is out of Pacemaker scope. >> >> I think so, but streaming replication doesn't have an interface >> to check its data is up-to-date or not such as DRBD. >> I kicked around a idea using pg_controldata command to check >> checkpoint location, but it is not perfect. >> Therefore I adopt this method involuntarily. >> >> Do you have a good idea?
> What happens if all nodes have recovery.conf file? If recovery.conf exists at start op, RA decides to be hot standby, but if master doesn't exist in other nodes, start op is failed. It depends on environment variable "$OCF_RESKEY_CRM_meta_notify_master_uname". I mean that if all nodes have recovery.conf, all nodes' start op is failed. On the other hand, if recovery.conf dose not exist at start op, RA decides to be primary. But if master exists in other nodes, start op is failed. > Can we make it a > must attribute for M/S configuration and then just promote master node > by creating trigger file? No. If recovery.conf doesn't exist, RA starts PostgreSQL using pg_ctl command at promote op. Trigger_file is just created when PostgreSQL is running as hot standby. Please see this figure. https://docs.google.com/present/edit?id=0AVpX8lY1m3zZZGdxNHNjcWhfMGM5Ym52Njhm&hl=ja > Obviously recovery.conf should point to a > floating IP address that tied up to master role. >> >>> Moving master >>> role form one node to another cannot be done simply with Pacemaker >>> command. I mean master will be moved, but slave won't be created. >>> Though Pacemaker will be forced to believe that it still exists and >>> administrator will have to do several manual steps to make it real. >>> It's too tricky in my opinion. >> >> Yes, but streaming replication can't transit from primary to >> hot standby, in addition it doesn't have a function to resync automatically >> if local data is too old. >> So I give up to creating slave automatically at this moment. >Completely stopping a node on demote operation and forcing Pacemaker >to believe that it's still up basically incorrect and cannot be >accepted. This approach can create a situation when administrator >moves master role from one node to another and then moves it back and >after that cluster reports that everything is up while actually it's >all down. Now take administrator out of the picture and add pingd >resource, simple network glitch puts cluster into unrecoverable state. To avoid this situation, my patch doesn't repress monitor at demote op, so if demote is called and transit from master to a salve, the slave must transit to stopped with monitor error. I think it's not cool that monitor error occurs as a demote op, but it's better than this situation. >Ideally demote operation should stop a master node and then restart it >in hot-standby mode. It's up to administrator to make sure that no >node with outdated data gets promoted to the master role. One should >follow standard procedures: cluster software shouldn't be configured >for autostart at the boot time, administrator has to make sure that >data was refreshed if the node was down for some prolonged time. Hmm.. Do you mean that RA puts recovery.conf automatically at demote op to start hot standby? Please give me some time to think it over. >> >> I think what is important is that primary is running somewhere. >> >> >> I discussed them with a PostgreSQL developer. >> I hope these functions will be implemented in the future. >> >>> 2, What is the reason of such heavy use of ATTR_UPDATER. What process >>> is the consumer of that pgsql_status attribute? Calling attr_updater >>> so often makes RA slow. >> >> It's just auxiliary function to check complicate states >> so I don't care to delete this. >Why not to use "ocf_log info" and keep all logging in one place? Using >that additional attribute would make sense if we had a GUI that was >able to refresh such attributes on the fly, otherwise I don't see to >much value in it. Tailing ha-log ore messages log file seems simpler >than running cibadmin -Ql and grepping for that attribute. Of course I use "ocf_log info" too, but my customers say that pacemaker's log is confusing and want not to see it. I only want them to see statuses using "crm_mon -A". 2011/2/11, Serge Dubrouski <serge...@gmail.com>: > On Thu, Feb 10, 2011 at 1:28 AM, Takatoshi MATSUO <matsuo....@gmail.com> > wrote: >> Hello Serge >> >>> Hello - >> >>> I'm not sure that I really like this approach. >> >>> 1. It doesn't follow standard Master/Slave logic of Pacemaker and that >>> can create a lot of confusion. On start RA decides on what role shall >>> be started where based not on Pacemaker configuration but on existence >>> of recovery.conf file which is out of Pacemaker scope. >> >> I think so, but streaming replication doesn't have an interface >> to check its data is up-to-date or not such as DRBD. >> I kicked around a idea using pg_controldata command to check >> checkpoint location, but it is not perfect. >> Therefore I adopt this method involuntarily. >> >> Do you have a good idea? > > What happens if all nodes have recovery.conf file? Can we make it a > must attribute for M/S configuration and then just promote master node > by creating trigger file? Obviously recovery.conf should point to a > floating IP address that tied up to master role. > >> >>> Moving master >>> role form one node to another cannot be done simply with Pacemaker >>> command. I mean master will be moved, but slave won't be created. >>> Though Pacemaker will be forced to believe that it still exists and >>> administrator will have to do several manual steps to make it real. >>> It's too tricky in my opinion. >> >> Yes, but streaming replication can't transit from primary to >> hot standby, in addition it doesn't have a function to resync >> automatically >> if local data is too old. >> So I give up to creating slave automatically at this moment. > > Completely stopping a node on demote operation and forcing Pacemaker > to believe that it's still up basically incorrect and cannot be > accepted. This approach can create a situation when administrator > moves master role from one node to another and then moves it back and > after that cluster reports that everything is up while actually it's > all down. Now take administrator out of the picture and add pingd > resource, simple network glitch puts cluster into unrecoverable state. > > Ideally demote operation should stop a master node and then restart it > in hot-standby mode. It's up to administrator to make sure that no > node with outdated data gets promoted to the master role. One should > follow standard procedures: cluster software shouldn't be configured > for autostart at the boot time, administrator has to make sure that > data was refreshed if the node was down for some prolonged time. > >> >> I think what is important is that primary is running somewhere. >> >> >> I discussed them with a PostgreSQL developer. >> I hope these functions will be implemented in the future. >> >>> 2, What is the reason of such heavy use of ATTR_UPDATER. What process >>> is the consumer of that pgsql_status attribute? Calling attr_updater >>> so often makes RA slow. >> >> It's just auxiliary function to check complicate states >> so I don't care to delete this. > > Why not to use "ocf_log info" and keep all logging in one place? Using > that additional attribute would make sense if we had a GUI that was > able to refresh such attributes on the fly, otherwise I don't see to > much value in it. Tailing ha-log ore messages log file seems simpler > than running cibadmin -Ql and grepping for that attribute. > >> >> But I don't feel so slow in my circumstance. >> Doesn't attrd_updater work in synchronization? >> >>> I need some time to build PosgreSQL 9.0 test cluster and try some more >>> tests. >> >> >> >> Regards, >> Takatoshi MATSUO >> >> 2011/2/10, Serge Dubrouski <serge...@gmail.com>: >>> Hello - >>> >>> I'm not sure that I really like this approach. >>> >>> 1. It doesn't follow standard Master/Slav logic of Pacemaker and that >>> can create a lot of confusion. On start RA decides on what role shall >>> be started where based not on Pacemaker configuration but on existence >>> of recovery.conf file which is out of Pacemaker scope. Moving master >>> role form one node to another cannot be done simply with Pacemaker >>> command. I mean master will be moved, but slave won't be created. >>> Though Pacemaker will be forced to believe that it still exists and >>> administrator will have to do several manual steps to make it real. >>> It's too tricky in my opinion. >>> >>> 2, What is the reason of such heavy use of ATTR_UPDATER. What process >>> is the consumer of that pgsql_status attribute? Calling attr_updater >>> so often makes RA slow. >>> >>> I need some time to build PosgreSQL 9.0 test cluster and try some more >>> tests. >>> >>> On Tue, Feb 8, 2011 at 6:03 AM, Takatoshi MATSUO <matsuo....@gmail.com> >>> wrote: >>>> Hi >>>> >>>> Thank you for comments. >>>> >>>>> First of all thanks for doing this. It's definitely a neat option to >>>>> have. Unfortunately I haven't yet worked with ProstgreSQL 9.0. Could >>>>> you please elaborate little bit more on how it works? What happens if >>>>> master dies? Who is responsible for creating/deleting recovery.conf >>>>> file? >>>> >>>> This patch does not setup recovery.conf for hot standby automatically, >>>> so you need setup it manually before starting hot standby. >>>> If recovery.conf exists, resource agent will become slave. >>>> Die of master causes fail-over correctly. >>>> >>>>> Also please be more careful with quoting variables in your if >>>>> statements. >>>> >>>> I checked and quoted. >>>> >>>> >>>> >>>> >>>> I elaborate. >>>> >>>> **************** Condition ********************************* >>>> >>>> 1. Streaming replication works correctly. >>>> 2. Streaming replication works correctly if you swap >>>> primary server for hot standby server. >>>> >>>> FYA I wrote my configuration at the end of the post. >>>> See "Example of configuration". >>>> >>>> **************** Setup and start *************************** >>>> >>>> 1. You decide which server do you want to be master. >>>> In this instance I assume that you choice node1. >>>> >>>> 2. node1# rm /var/lib/pgsql/9.0/data/recovery.conf >>>> >>>> 3. node1# /etc/init.d/heartbeat start >>>> Wait for "ONLINE node1". >>>> >>>> 4. node1# crm configure load update pgsql.crm >>>> >>>> 5. Wait for master started as below >>>> node1# crm_mon -1 -fA >>>> ------------------------------------------------------------ >>>> ============ >>>> Last updated: Tue Feb 8 18:51:00 2011 >>>> Stack: Heartbeat >>>> Current DC: node1 (0c140f90-7de3-438f-b1b5-3b9722bbde21) - >>>> partition with quorum >>>> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 >>>> 2 Nodes configured, unknown expected votes >>>> 3 Resources configured. >>>> ============ >>>> >>>> Online: [ node1 ] >>>> >>>> vip (ocf::heartbeat:IPaddr2): Started node1 >>>> Master/Slave Set: msPostgresql >>>> Masters: [ node1 ] >>>> Stopped: [ postgresql:1 ] >>>> Clone Set: clnPingd >>>> Started: [ node1 ] >>>> Stopped: [ pingd:1 ] >>>> >>>> Node Attributes: >>>> * Node node1: >>>> + master-postgresql:0 : 1000 >>>> + pgsql-status : PRI >>>> + ping_set : 100 >>>> >>>> Migration summary: >>>> * Node node1: >>>> ------------------------------------------------------------ >>>> >>>> If start operation fails, please check PostgreSQL state using "crm_mon >>>> -A". >>>> "pgsql-status" line shows something. >>>> >>>> 6. Backup node1's data and restore to node2. >>>> In my case I use rsync command as below. >>>> >>>> node2# psql -h 192.168.2.107 -U postgres -c "SELECT >>>> pg_start_backup('label')" >>>> node2# rsync -avr --delete --exclude=postmaster.pid >>>> 192.168.2.107:/var/lib/pgsql/9.0/data/ /var/lib/pgsql/9.0/data/ >>>> node2# psql -h 192.168.2.107 -U postgres -c "SELECT pg_stop_backup()" >>>> >>>> 7. Put recovery.conf for node2. >>>> node2# cp recovery.conf /var/lib/pgsql/9.0/data/ >>>> node2# chown postgres:postgres /var/lib/pgsql/9.0/data/recovery.conf >>>> >>>> 8. node2# /etc/init.d/heartbeat start >>>> >>>> 9. Wait for slave started as below. >>>> node2# crm_mon -1 -fA >>>> ------------------------------------------------------------ >>>> ============ >>>> Last updated: Tue Feb 8 19:41:00 2011 >>>> Stack: Heartbeat >>>> Current DC: node1 (0c140f90-7de3-438f-b1b5-3b9722bbde21) - >>>> partition with quorum >>>> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 >>>> 2 Nodes configured, unknown expected votes >>>> 3 Resources configured. >>>> ============ >>>> >>>> Online: [ node1 node2 ] >>>> >>>> vip (ocf::heartbeat:IPaddr2): Started node1 >>>> Master/Slave Set: msPostgresql >>>> Masters: [ node1 ] >>>> Slaves: [ node2 ] >>>> Clone Set: clnPingd >>>> Started: [ node1 node2 ] >>>> >>>> Node Attributes: >>>> * Node node1: >>>> + master-postgresql:0 : 1000 >>>> + pgsql-status : PRI >>>> + ping_set : 100 >>>> * Node node2: >>>> + master-postgresql:1 : 100 >>>> + pgsql-status : HS >>>> + ping_set : 100 >>>> >>>> Migration summary: >>>> * Node node1: >>>> * Node node2: >>>> ------------------------------------------------------------ >>>> >>>> If master doesn't exist in other node or recovery.conf doesn't >>>> exist in node2, >>>> start operation fails. >>>> >>>> ********************** Other Operation >>>> *************************************** >>>> >>>> 1. If node1's master dies, node2 becomes a master. >>>> node1# killall -9 postgres >>>> node1# crm_mon -1 -fA >>>> ------------------------------------------------------------ >>>> ============ >>>> Last updated: Tue Feb 8 20:12:34 2011 >>>> Stack: Heartbeat >>>> Current DC: node1 (0c140f90-7de3-438f-b1b5-3b9722bbde21) - >>>> partition with quorum >>>> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 >>>> 2 Nodes configured, unknown expected votes >>>> 3 Resources configured. >>>> ============ >>>> >>>> Online: [ node1 node2 ] >>>> >>>> vip (ocf::heartbeat:IPaddr2): Started node2 >>>> Master/Slave Set: msPostgresql >>>> Masters: [ node2 ] >>>> Stopped: [ postgresql:0 ] >>>> Clone Set: clnPingd >>>> Started: [ node1 node2 ] >>>> >>>> Node Attributes: >>>> * Node node1: >>>> + master-postgresql:0 : -INFINITY >>>> + pgsql-status : STOP:with errors >>>> + ping_set : 100 >>>> * Node node2: >>>> + master-postgresql:1 : 1000 >>>> + pgsql-status : PRI >>>> + ping_set : 100 >>>> >>>> Migration summary: >>>> * Node node1: >>>> postgresql:0: migration-threshold=1 fail-count=1 >>>> * Node node2: >>>> >>>> Failed actions: >>>> postgresql:0_monitor_9000 (node=node1, call=14, rc=7, >>>> status=complete): not running >>>> ------------------------------------------------------------ >>>> >>>> 2. To recover node1, backup node2's data and restore to node1. >>>> >>>> node1# psql -h 192.168.2.207 -U postgres -c "SELECT >>>> pg_start_backup('label')" >>>> node1# rsync -avr --delete --exclude=postmaster.pid >>>> 192.168.2.207:/var/lib/pgsql/9.0/data/ /var/lib/pgsql/9.0/data/ >>>> node1# psql -h 192.168.2.207 -U postgres -c "SELECT pg_stop_backup()" >>>> >>>> 3. Put recovery.conf for node1. >>>> node1# cp recovery.conf /var/lib/pgsql/9.0/data/ >>>> node1# chown postgres:postgres /var/lib/pgsql/9.0/data/recovery.conf >>>> >>>> 4. Remove flag file of repressing start. >>>> node1# rm /var/lig/pgsql/PGSQL.5432.repress_start >>>> >>>> 5. node1# crm resource cleanup msPostgresql >>>> >>>> 6. Node1 becomes slave as below. >>>> node1# crm_mon -1 -fA >>>> ------------------------------------------------------------ >>>> ============ >>>> Last updated: Tue Feb 8 20:32:04 2011 >>>> Stack: Heartbeat >>>> Current DC: node1 (0c140f90-7de3-438f-b1b5-3b9722bbde21) - >>>> partition with quorum >>>> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 >>>> 2 Nodes configured, unknown expected votes >>>> 3 Resources configured. >>>> ============ >>>> >>>> Online: [ node1 node2 ] >>>> >>>> vip (ocf::heartbeat:IPaddr2): Started node2 >>>> Master/Slave Set: msPostgresql >>>> Masters: [ node2 ] >>>> Slaves: [ node1 ] >>>> Clone Set: clnPingd >>>> Started: [ node1 node2 ] >>>> >>>> Node Attributes: >>>> * Node node1: >>>> + master-postgresql:0 : 100 >>>> + pgsql-status : HS >>>> + ping_set : 100 >>>> * Node node2: >>>> + master-postgresql:1 : 1000 >>>> + pgsql-status : PRI >>>> + ping_set : 100 >>>> >>>> Migration summary: >>>> * Node node1: >>>> * Node node2: >>>> ------------------------------------------------------------ >>>> >>>> >>>> >>>> ********************** Example of configuration >>>> **************************** >>>> >>>> * postgresql.conf (common configuration for node1 and node2) >>>> - Please mkdir /var/lib/pgsql/9.0/data/pg_archive preliminarily >>>> - To cause fail-over correctly, set listen_addresses = '*' >>>> if you use virtual IP together. >>>> --------------------------------------------------------------------------- >>>> listen_addresses = '*' >>>> port = 5432 >>>> wal_level = hot_standby >>>> max_wal_senders = 3 >>>> wal_keep_segments = 16 >>>> archive_mode = on >>>> archive_command = 'cp %p /var/lib/pgsql/9.0/data/pg_archive/%f' >>>> --------------------------------------------------------------------------- >>>> >>>> * pg_hba.conf (common configuration for node1 and node2) >>>> - 192.168.2.0/24 is LAN for replication. >>>> - Node1 has 192.168.2.107 and Node2 has 192.168.2.207. >>>> --------------------------------------------------------------------------- >>>> local all all trust >>>> host all all 127.0.0.1/32 trust >>>> host all all ::1/128 trust >>>> host all all 192.168.2.0/24 trust >>>> host replication all 192.168.2.0/24 trust >>>> --------------------------------------------------------------------------- >>>> >>>> * recovery.conf for node1 >>>> --------------------------------------------------------------------------- >>>> standby_mode = 'on' >>>> primary_conninfo = 'host=192.168.2.207 port=5432 user=postgres' >>>> restore_command = 'cp /var/lib/pgsql/9.0/data/pg_archive/%f %p' >>>> trigger_file = '/var/lib/pgsql/PGSQL.5432.trigger' >>>> --------------------------------------------------------------------------- >>>> >>>> * recovery.conf for node2 >>>> --------------------------------------------------------------------------- >>>> standby_mode = 'on' >>>> primary_conninfo = 'host=192.168.2.107 port=5432 user=postgres' >>>> restore_command = 'cp /var/lib/pgsql/9.0/data/pg_archive/%f %p' >>>> trigger_file = '/var/lib/pgsql/PGSQL.5432.trigger' >>>> --------------------------------------------------------------------------- >>>> >>>> * pgsql.crm for crm command >>>> --------------------------------------------------------------------------- >>>> property \ >>>> no-quorum-policy="ignore" \ >>>> stonith-enabled="false" \ >>>> startup-fencing="false" \ >>>> rsc_defaults \ >>>> resource-stickiness="INFINITY" \ >>>> migration-threshold="1" >>>> ms msPostgresql postgresql \ >>>> meta \ >>>> master-max="1" \ >>>> master-node-max="1" \ >>>> clone-max="2" \ >>>> clone-node-max="1" \ >>>> notify="true" >>>> clone clnPingd pingd \ >>>> meta \ >>>> clone-max="2" \ >>>> clone-node-max="1" >>>> primitive vip ocf:heartbeat:IPaddr2 \ >>>> params \ >>>> ip="172.20.24.127" \ >>>> nic="eth0" \ >>>> cidr_netmask="16" \ >>>> op start timeout="60s" interval="0s" on-fail="restart" \ >>>> op monitor timeout="60s" interval="10s" on-fail="restart" \ >>>> op stop timeout="60s" interval="0s" on-fail="block" >>>> primitive postgresql ocf:heartbeat:pgsql \ >>>> params \ >>>> pgctl="/usr/pgsql-9.0/bin/pg_ctl" \ >>>> psql="/usr/pgsql-9.0/bin/psql" \ >>>> pgdata="/var/lib/pgsql/9.0/data/" \ >>>> start_opt="-p 5432" \ >>>> rep_mode="async" \ >>>> op start timeout="3600s" interval="0s" on-fail="restart" \ >>>> op monitor timeout="60s" interval="10s" on-fail="restart" \ >>>> op monitor timeout="60s" interval="9s" on-fail="restart" >>>> role="Master" \ >>>> op promote timeout="3600s" interval="0s" on-fail="restart" \ >>>> op demote timeout="60s" interval="0s" on-fail="block" \ >>>> op stop timeout="60s" interval="0s" on-fail="block" >>>> primitive pingd ocf:pacemaker:pingd \ >>>> params \ >>>> name="ping_set" \ >>>> host_list="172.20.24.103" \ >>>> multiplier="100" \ >>>> dampen="0" \ >>>> op start timeout="60s" interval="0s" on-fail="restart" \ >>>> op monitor timeout="60s" interval="10s" on-fail="restart" \ >>>> op stop timeout="60s" interval="0s" on-fail="ignore" >>>> >>>> location rsc_location-1 vip \ >>>> rule -inf: not_defined ping_set or ping_set lt 100 >>>> location rsc_location-2 msPostgresql \ >>>> rule -inf: not_defined ping_set or ping_set lt 100 \ >>>> rule -inf: defined fail-count-vip >>>> >>>> colocation rsc_colocation-1 inf: vip clnPingd >>>> colocation rsc_colocation-2 inf: msPostgresql clnPingd >>>> colocation rsc_colocation-3 inf: msPostgresql:Master vip >>>> >>>> order rsc_order-1 0: clnPingd msPostgresql >>>> order rsc_order-2 0: msPostgresql vip >>>> order rsc_order-3 0: vip msPostgresql:promote >>>> --------------------------------------------------------------------------- >>>> >>>> >>>> Regards, >>>> Takatoshi MATSUO >>>> >>>> _______________________________________________________ >>>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >>>> Home Page: http://linux-ha.org/ >>>> >>>> >>> >>> >>> >>> -- >>> Serge Dubrouski. >>> _______________________________________________________ >>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >>> Home Page: http://linux-ha.org/ >>> >> _______________________________________________________ >> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >> Home Page: http://linux-ha.org/ >> > > > > -- > Serge Dubrouski. > _______________________________________________________ > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ > _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/