Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hello Serge 2011/12/13 Serge Dubrouski serge...@gmail.com: On Mon, Dec 12, 2011 at 5:32 AM, Takatoshi MATSUO matsuo@gmail.com wrote: Hello 2011/12/12 Serge Dubrouski serge...@gmail.com: On Thu, Dec 8, 2011 at 10:34 PM, Takatoshi MATSUO matsuo@gmail.com wrote: Hi Attila 2011/12/8 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, One strange thing I noticed and could probably be improved. When there is data inconsistency, I have the following node properties: * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : HS:alone * Node psql1: + default_ping_set : 100 + master-postgresql:0 : 1000 + master-postgresql:1 : -INFINITY + pgsql-data-status : LATEST + pgsql-master-baseline : 58:4B20 + pgsql-status : PRI This is fine, and understandable - but I can see this only if I do a crm_mon -A. My problem is, that CRM shows the following: Master/Slave Set: db-ms-psql [postgresql] Masters: [ psql1 ] Slaves: [ psql2 ] So if I monitor the system from crm_mon, HAWK or ther tools - I have no indication at all that the slave is running in an inconsistent mode. I would expect the RA to stop the psql2 node in such cases, because: - It is running, but has non-up-to-date data, therefore noone will use it (the slave IP points to the master as well, which is good) - In CRM status eveything looks perfect, even though it is NOT perfect and admin intervention is required. Shouldn't the disconnected PSQL server be stopped instead? hmm.. It's not better to stop PGSQL server. RA cannot know whether PGSQL is disconnected because of data-inconsistent or network-down or starting-up and so on. Why does it matter? If the state is degraded and inconsistent and there is no way to fix it from inside of the RA, RA should probably stop it. In this case, HS's data may be cosistent but Primary dosen't have enough wals or HS dosen't have enough wal-archives to be replication-mode. Unfortunately this RA dosen't calculate the number of wals. Honestly I don't know how to better handle this. Pacemaker doesn't have a concept of degraded node state. In this case the RA cannot know whether it is degraded or not for the above reason. Of course, the RA stops PostgreSQL when it is obviously degraded . Let's say that there is pgpool running in front of the cluster, keeping an inconsistent node up would lead to the routing SQL queries to it and possibly getting wrong results. It dosen't happen in my sample configuration. vip-slave is up at master when slave is not HS:sync. So you have a VIP for each slave node? Yes. If you don't need read-only access, it is no problem removing vip-slave. How about using dummy RA such as vip-slave? --- primitive runningSlaveOK ocf:heartbeat:Dummy .(snip) location rsc_location-dummy runningSlaveOK \ rule 200: pgsql-status eq HS:sync --- That probably fixes visibility issue. What about notifications on DISCONNECT state? How administrator would know that cluster is inconsistent? May be the better option in this case would be collocating MailTo resource with HS:alone? Yes, it's good idea if you want to receive notifications. Regards, Takatoshi MATSUO ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hello 2011/12/12 Serge Dubrouski serge...@gmail.com: On Thu, Dec 8, 2011 at 10:34 PM, Takatoshi MATSUO matsuo@gmail.com wrote: Hi Attila 2011/12/8 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, One strange thing I noticed and could probably be improved. When there is data inconsistency, I have the following node properties: * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : HS:alone * Node psql1: + default_ping_set : 100 + master-postgresql:0 : 1000 + master-postgresql:1 : -INFINITY + pgsql-data-status : LATEST + pgsql-master-baseline : 58:4B20 + pgsql-status : PRI This is fine, and understandable - but I can see this only if I do a crm_mon -A. My problem is, that CRM shows the following: Master/Slave Set: db-ms-psql [postgresql] Masters: [ psql1 ] Slaves: [ psql2 ] So if I monitor the system from crm_mon, HAWK or ther tools - I have no indication at all that the slave is running in an inconsistent mode. I would expect the RA to stop the psql2 node in such cases, because: - It is running, but has non-up-to-date data, therefore noone will use it (the slave IP points to the master as well, which is good) - In CRM status eveything looks perfect, even though it is NOT perfect and admin intervention is required. Shouldn't the disconnected PSQL server be stopped instead? hmm.. It's not better to stop PGSQL server. RA cannot know whether PGSQL is disconnected because of data-inconsistent or network-down or starting-up and so on. Why does it matter? If the state is degraded and inconsistent and there is no way to fix it from inside of the RA, RA should probably stop it. In this case, HS's data may be cosistent but Primary dosen't have enough wals or HS dosen't have enough wal-archives to be replication-mode. Unfortunately this RA dosen't calculate the number of wals. Let's say that there is pgpool running in front of the cluster, keeping an inconsistent node up would lead to the routing SQL queries to it and possibly getting wrong results. It dosen't happen in my sample configuration. vip-slave is up at master when slave is not HS:sync. How about using dummy RA such as vip-slave? --- primitive runningSlaveOK ocf:heartbeat:Dummy .(snip) location rsc_location-dummy runningSlaveOK \ rule 200: pgsql-status eq HS:sync --- That probably fixes visibility issue. What about notifications on DISCONNECT state? How administrator would know that cluster is inconsistent? May be the better option in this case would be collocating MailTo resource with HS:alone? Yes, it's good idea if you want to receive notifications. Regards, Takatoshi MATSUO ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
On Mon, Dec 12, 2011 at 5:32 AM, Takatoshi MATSUO matsuo@gmail.comwrote: Hello 2011/12/12 Serge Dubrouski serge...@gmail.com: On Thu, Dec 8, 2011 at 10:34 PM, Takatoshi MATSUO matsuo@gmail.com wrote: Hi Attila 2011/12/8 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, One strange thing I noticed and could probably be improved. When there is data inconsistency, I have the following node properties: * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : HS:alone * Node psql1: + default_ping_set : 100 + master-postgresql:0 : 1000 + master-postgresql:1 : -INFINITY + pgsql-data-status : LATEST + pgsql-master-baseline : 58:4B20 + pgsql-status : PRI This is fine, and understandable - but I can see this only if I do a crm_mon -A. My problem is, that CRM shows the following: Master/Slave Set: db-ms-psql [postgresql] Masters: [ psql1 ] Slaves: [ psql2 ] So if I monitor the system from crm_mon, HAWK or ther tools - I have no indication at all that the slave is running in an inconsistent mode. I would expect the RA to stop the psql2 node in such cases, because: - It is running, but has non-up-to-date data, therefore noone will use it (the slave IP points to the master as well, which is good) - In CRM status eveything looks perfect, even though it is NOT perfect and admin intervention is required. Shouldn't the disconnected PSQL server be stopped instead? hmm.. It's not better to stop PGSQL server. RA cannot know whether PGSQL is disconnected because of data-inconsistent or network-down or starting-up and so on. Why does it matter? If the state is degraded and inconsistent and there is no way to fix it from inside of the RA, RA should probably stop it. In this case, HS's data may be cosistent but Primary dosen't have enough wals or HS dosen't have enough wal-archives to be replication-mode. Unfortunately this RA dosen't calculate the number of wals. Honestly I don't know how to better handle this. Pacemaker doesn't have a concept of degraded node state. Let's say that there is pgpool running in front of the cluster, keeping an inconsistent node up would lead to the routing SQL queries to it and possibly getting wrong results. It dosen't happen in my sample configuration. vip-slave is up at master when slave is not HS:sync. So you have a VIP for each slave node? How about using dummy RA such as vip-slave? --- primitive runningSlaveOK ocf:heartbeat:Dummy .(snip) location rsc_location-dummy runningSlaveOK \ rule 200: pgsql-status eq HS:sync --- That probably fixes visibility issue. What about notifications on DISCONNECT state? How administrator would know that cluster is inconsistent? May be the better option in this case would be collocating MailTo resource with HS:alone? Yes, it's good idea if you want to receive notifications. Regards, Takatoshi MATSUO ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Serge Dubrouski. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
On Thu, Dec 8, 2011 at 10:34 PM, Takatoshi MATSUO matsuo@gmail.comwrote: Hi Attila 2011/12/8 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, One strange thing I noticed and could probably be improved. When there is data inconsistency, I have the following node properties: * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : HS:alone * Node psql1: + default_ping_set : 100 + master-postgresql:0 : 1000 + master-postgresql:1 : -INFINITY + pgsql-data-status : LATEST + pgsql-master-baseline : 58:4B20 + pgsql-status : PRI This is fine, and understandable - but I can see this only if I do a crm_mon -A. My problem is, that CRM shows the following: Master/Slave Set: db-ms-psql [postgresql] Masters: [ psql1 ] Slaves: [ psql2 ] So if I monitor the system from crm_mon, HAWK or ther tools - I have no indication at all that the slave is running in an inconsistent mode. I would expect the RA to stop the psql2 node in such cases, because: - It is running, but has non-up-to-date data, therefore noone will use it (the slave IP points to the master as well, which is good) - In CRM status eveything looks perfect, even though it is NOT perfect and admin intervention is required. Shouldn't the disconnected PSQL server be stopped instead? hmm.. It's not better to stop PGSQL server. RA cannot know whether PGSQL is disconnected because of data-inconsistent or network-down or starting-up and so on. Why does it matter? If the state is degraded and inconsistent and there is no way to fix it from inside of the RA, RA should probably stop it. Let's say that there is pgpool running in front of the cluster, keeping an inconsistent node up would lead to the routing SQL queries to it and possibly getting wrong results. How about using dummy RA such as vip-slave? --- primitive runningSlaveOK ocf:heartbeat:Dummy .(snip) location rsc_location-dummy runningSlaveOK \ rule 200: pgsql-status eq HS:sync --- That probably fixes visibility issue. What about notifications on DISCONNECT state? How administrator would know that cluster is inconsistent? May be the better option in this case would be collocating MailTo resource with HS:alone? Regards, Takatoshi MATSUO ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Serge Dubrouski. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Takatoshi, One strange thing I noticed and could probably be improved. When there is data inconsistency, I have the following node properties: * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : HS:alone * Node psql1: + default_ping_set : 100 + master-postgresql:0 : 1000 + master-postgresql:1 : -INFINITY + pgsql-data-status : LATEST + pgsql-master-baseline : 58:4B20 + pgsql-status : PRI This is fine, and understandable - but I can see this only if I do a crm_mon -A. My problem is, that CRM shows the following: Master/Slave Set: db-ms-psql [postgresql] Masters: [ psql1 ] Slaves: [ psql2 ] So if I monitor the system from crm_mon, HAWK or ther tools - I have no indication at all that the slave is running in an inconsistent mode. I would expect the RA to stop the psql2 node in such cases, because: - It is running, but has non-up-to-date data, therefore noone will use it (the slave IP points to the master as well, which is good) - In CRM status eveything looks perfect, even though it is NOT perfect and admin intervention is required. Shouldn't the disconnected PSQL server be stopped instead? Regards, Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 28. 11:10 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 2011/11/28 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, I understand your point and I agree that the correct behavior is not to start replication when data consistency exists. The only thing I do not really understand is how it could have happened: 1) nodes were in sync (psql1=PRI, psql2=STREAMING|SYNC) 2) I shut down node psql1 (by placing it into standby) 3) At this moment psql1's baseline became higher by 20? What could cause this? Probably the demote operation itself? There were no clients connected - and there was definitively no write operation to the db (except if not from system side). Yes, PostgreSQL executes a CHECKPOINT when it is shut down normally on demote. On the other hand - thank you very much for your contribution, the RA works very well and I really appreciate your work and help! Not at all. Don't mention it. Regards, Takatoshi MATSUO Bests, Attil -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 28. 2:10 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila Primary can not send all wals to HotStandby whether primary is shut down normally. These logs validate it. Nov 27 16:03:27 psql1 pgsql[12204]: INFO: My Timeline ID and Checkpoint : 14:2320 Nov 27 16:03:27 psql1 pgsql[12204]: INFO: psql2 master baseline : 14:2300 psql1's location was 2320 when it was demoted. OTOH psql2's location was 2300 when it was promoted. It means that psql1's data was newer than psql2's one at that time. The gap is 20. As you said you can start psql1's PostgreSQL manually, but PostgreSQL can't realize this occurrence. If you start HotStandby at psql1, data is replicated after 2320. It's inconsistency. Thanks, Takatoshi MATSUO 2011/11/28 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, I don't think it is inconsistency problem - for me it looks like some RA bug. I think so, because postgres starts properly outside pacemaker. When pacemaker starts node psql1 I see only: postgresql:0_start_0 (node=psql1, call=9, rc=1, status=complete): unknown error and the postgres log is empty - so I suppose that it does not even try to start it. What I tested was: - I had a stable cluster, where psql1 was the master, psql2 was the slave - I put psql1 into standby mode. (node psql1 standby) to test failover - After a while psql2 became the PRI, which is very good - When I put psql1 back online, postgres wouldn't start anymore from pacemaker (unknown error). I tried to start postgres manually from the shell it worked fine, even the monitor was able to see that it became in SYNC (obviously the master/slave group was showing improper state as psql was started outside pacemaker. I don't think data inconsistency is the case, partially because there are no clients connected, partially because psql starts properly outside pacemaker. Here is what is relevant from the log: Nov 27 16:02:50 psql1 pgsql[11021]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:51 psql1 pgsql[11021]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Attila 2011/12/8 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, One strange thing I noticed and could probably be improved. When there is data inconsistency, I have the following node properties: * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : HS:alone * Node psql1: + default_ping_set : 100 + master-postgresql:0 : 1000 + master-postgresql:1 : -INFINITY + pgsql-data-status : LATEST + pgsql-master-baseline : 58:4B20 + pgsql-status : PRI This is fine, and understandable - but I can see this only if I do a crm_mon -A. My problem is, that CRM shows the following: Master/Slave Set: db-ms-psql [postgresql] Masters: [ psql1 ] Slaves: [ psql2 ] So if I monitor the system from crm_mon, HAWK or ther tools - I have no indication at all that the slave is running in an inconsistent mode. I would expect the RA to stop the psql2 node in such cases, because: - It is running, but has non-up-to-date data, therefore noone will use it (the slave IP points to the master as well, which is good) - In CRM status eveything looks perfect, even though it is NOT perfect and admin intervention is required. Shouldn't the disconnected PSQL server be stopped instead? hmm.. It's not better to stop PGSQL server. RA cannot know whether PGSQL is disconnected because of data-inconsistent or network-down or starting-up and so on. How about using dummy RA such as vip-slave? --- primitive runningSlaveOK ocf:heartbeat:Dummy .(snip) location rsc_location-dummy runningSlaveOK \ rule 200: pgsql-status eq HS:sync --- Regards, Takatoshi MATSUO ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Takatoshi, I understand your point and I agree that the correct behavior is not to start replication when data consistency exists. The only thing I do not really understand is how it could have happened: 1) nodes were in sync (psql1=PRI, psql2=STREAMING|SYNC) 2) I shut down node psql1 (by placing it into standby) 3) At this moment psql1's baseline became higher by 20? What could cause this? Probably the demote operation itself? There were no clients connected - and there was definitively no write operation to the db (except if not from system side). On the other hand - thank you very much for your contribution, the RA works very well and I really appreciate your work and help! Bests, Attil -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 28. 2:10 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila Primary can not send all wals to HotStandby whether primary is shut down normally. These logs validate it. Nov 27 16:03:27 psql1 pgsql[12204]: INFO: My Timeline ID and Checkpoint : 14:2320 Nov 27 16:03:27 psql1 pgsql[12204]: INFO: psql2 master baseline : 14:2300 psql1's location was 2320 when it was demoted. OTOH psql2's location was 2300 when it was promoted. It means that psql1's data was newer than psql2's one at that time. The gap is 20. As you said you can start psql1's PostgreSQL manually, but PostgreSQL can't realize this occurrence. If you start HotStandby at psql1, data is replicated after 2320. It's inconsistency. Thanks, Takatoshi MATSUO 2011/11/28 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, I don't think it is inconsistency problem - for me it looks like some RA bug. I think so, because postgres starts properly outside pacemaker. When pacemaker starts node psql1 I see only: postgresql:0_start_0 (node=psql1, call=9, rc=1, status=complete): unknown error and the postgres log is empty - so I suppose that it does not even try to start it. What I tested was: - I had a stable cluster, where psql1 was the master, psql2 was the slave - I put psql1 into standby mode. (node psql1 standby) to test failover - After a while psql2 became the PRI, which is very good - When I put psql1 back online, postgres wouldn't start anymore from pacemaker (unknown error). I tried to start postgres manually from the shell it worked fine, even the monitor was able to see that it became in SYNC (obviously the master/slave group was showing improper state as psql was started outside pacemaker. I don't think data inconsistency is the case, partially because there are no clients connected, partially because psql starts properly outside pacemaker. Here is what is relevant from the log: Nov 27 16:02:50 psql1 pgsql[11021]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:51 psql1 pgsql[11021]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:03:00 psql1 pgsql[11556]: DEBUG: notify: pre for demote Nov 27 16:03:00 psql1 pgsql[11590]: INFO: Stopping PostgreSQL on demote. Nov 27 16:03:02 psql1 pgsql[11590]: INFO: waiting for server to shut down. done server stopped Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Removing /var/lib/pgsql/PGSQL.lock. Nov 27 16:03:02 psql1 pgsql[11590]: INFO: PostgreSQL is down Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Changing pgsql-status on psql1 : PRI-STOP. Nov 27 16:03:02 psql1 pgsql[11590]: DEBUG: Created recovery.conf. host=10.12.1.28, user=postgres Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Setup all nodes as an async. Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: notify: post for demote Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: post-demote called. Demote uname is psql1 Nov 27 16:03:02 psql1 pgsql[11732]: INFO: My Timeline ID and Checkpoint : 14:2320 Nov 27 16:03:02 psql1 pgsql[11732]: WARNING: Can't get psql2 master baseline. Waiting... Nov 27 16:03:03 psql1 pgsql[11732]: INFO: psql2 master baseline : 14:2300 Nov 27 16:03:03 psql1 pgsql[11732]: ERROR: My data is inconsistent. Nov 27 16:03:03 psql1 pgsql[11867]: DEBUG: notify: pre for stop Nov 27 16:03:03 psql1 pgsql[11969]: INFO: PostgreSQL
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Attila 2011/11/28 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, I understand your point and I agree that the correct behavior is not to start replication when data consistency exists. The only thing I do not really understand is how it could have happened: 1) nodes were in sync (psql1=PRI, psql2=STREAMING|SYNC) 2) I shut down node psql1 (by placing it into standby) 3) At this moment psql1's baseline became higher by 20? What could cause this? Probably the demote operation itself? There were no clients connected - and there was definitively no write operation to the db (except if not from system side). Yes, PostgreSQL executes a CHECKPOINT when it is shut down normally on demote. On the other hand - thank you very much for your contribution, the RA works very well and I really appreciate your work and help! Not at all. Don't mention it. Regards, Takatoshi MATSUO Bests, Attil -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 28. 2:10 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila Primary can not send all wals to HotStandby whether primary is shut down normally. These logs validate it. Nov 27 16:03:27 psql1 pgsql[12204]: INFO: My Timeline ID and Checkpoint : 14:2320 Nov 27 16:03:27 psql1 pgsql[12204]: INFO: psql2 master baseline : 14:2300 psql1's location was 2320 when it was demoted. OTOH psql2's location was 2300 when it was promoted. It means that psql1's data was newer than psql2's one at that time. The gap is 20. As you said you can start psql1's PostgreSQL manually, but PostgreSQL can't realize this occurrence. If you start HotStandby at psql1, data is replicated after 2320. It's inconsistency. Thanks, Takatoshi MATSUO 2011/11/28 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, I don't think it is inconsistency problem - for me it looks like some RA bug. I think so, because postgres starts properly outside pacemaker. When pacemaker starts node psql1 I see only: postgresql:0_start_0 (node=psql1, call=9, rc=1, status=complete): unknown error and the postgres log is empty - so I suppose that it does not even try to start it. What I tested was: - I had a stable cluster, where psql1 was the master, psql2 was the slave - I put psql1 into standby mode. (node psql1 standby) to test failover - After a while psql2 became the PRI, which is very good - When I put psql1 back online, postgres wouldn't start anymore from pacemaker (unknown error). I tried to start postgres manually from the shell it worked fine, even the monitor was able to see that it became in SYNC (obviously the master/slave group was showing improper state as psql was started outside pacemaker. I don't think data inconsistency is the case, partially because there are no clients connected, partially because psql starts properly outside pacemaker. Here is what is relevant from the log: Nov 27 16:02:50 psql1 pgsql[11021]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:51 psql1 pgsql[11021]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:03:00 psql1 pgsql[11556]: DEBUG: notify: pre for demote Nov 27 16:03:00 psql1 pgsql[11590]: INFO: Stopping PostgreSQL on demote. Nov 27 16:03:02 psql1 pgsql[11590]: INFO: waiting for server to shut down. done server stopped Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Removing /var/lib/pgsql/PGSQL.lock. Nov 27 16:03:02 psql1 pgsql[11590]: INFO: PostgreSQL is down Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Changing pgsql-status on psql1 : PRI-STOP. Nov 27 16:03:02 psql1 pgsql[11590]: DEBUG: Created recovery.conf. host=10.12.1.28, user=postgres Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Setup all nodes as an async. Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: notify: post for demote Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: post-demote called. Demote uname is psql1 Nov 27 16:03:02 psql1 pgsql[11732]: INFO: My Timeline ID and Checkpoint : 14:2320 Nov 27 16:03:02 psql1 pgsql[11732]: WARNING: Can't get psql2 master baseline. Waiting... Nov 27 16:03:03 psql1 pgsql[11732]: INFO
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Takatoshi, You were right, changing the shell to bash resolved the problem. The cluster now started in sync mode - thank you very much. I will be testing it in the next couple of days. I did just a very quick test - it seems that psql master failed over to psql2 properly, but when I tried to move it back to psql1 there was some problems starting psql on node 1. Does it work fine for you in both directions? Thank you very much. Have a nice weekend, Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 27. 6:12 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 2011/11/27 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, Thank you for coming back to me so quickly. In the /var/lib/pgsql there are the following files: PSQL1: = root@psql1:/var/lib/pgsql# ls -la total 16 drwxr-xr-x 2 postgres postgres 4096 Nov 26 18:04 . drwxr-xr-x 35 root root 4096 Nov 25 22:21 .. -rw-r--r-- 1 postgres postgres1 Nov 26 00:17 rep_mode.conf -rw-r--r-- 1 root root 49 Nov 26 18:04 xlog_note.0 root@psql1:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900 psql2 1900 root@psql1:/var/lib/pgsql# PSQL2: === root@psql2:/var/lib/pgsql# ls -la total 16 drwxr-xr-x 2 postgres postgres 4096 Nov 26 18:05 . drwxr-xr-x 33 root root 4096 Nov 26 00:10 .. -rw-r--r-- 1 postgres postgres1 Nov 26 00:24 rep_mode.conf -rw-r--r-- 1 root root 49 Nov 26 18:05 xlog_note.0 root@psql2:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900 psql2 1900 root@psql2:/var/lib/pgsql# It seems that dash's bultin echo command is used because echo with -e option dose not function. Perhaps my RA also depends on bash. Can you use a bash instead of a dash? BTW, postgres is installed under /var/lib/postgresql , but I noticed that some parts of the RA are referring to the /var/lib/pgsql directory, so I created that directory and i keep some of the files there. It's no ploblem. If you want to change this path, please specify it using tmpdir parameter. Regards, Takatoshi MATSUO Thanks, Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 26. 18:27 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2 files? These files are created while checking a xlog location on monitor. 2. Do these files include lines as below? - pgsql1 1900 pgsql2 1900 - Regards. Takatoshi MATSUO 2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com: Hi Yoshiharu, Takatoshi, Spent another day, without success. :( I started from scratch and synchronous replications works nicely when nodes are started outside pacemaker. My PostgreSQL version is 9.1.1. When I start from pacemaker, after a while it gets into the following state: Online: [ psql1 psql2 ] Master/Slave Set: msPostgresql [postgresql] Slaves: [ psql1 psql2 ] Clone Set: clnPingCheck [pingCheck] Started: [ psql1 psql2 ] Node Attributes: * Node psql1: + default_ping_set : 100 + master-postgresql:0 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc : 1900 * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc : 1900 The psql status queries return the following: PSQL1 == postgres@psql1:/root$ psql -c select application_name,upper(state),upper(sync_state) from pg_stat_replication application_name | upper | upper --+---+--- (0 rows) postgres@psql1:/root$ psql -Atc select pg_last_xlog_replay_location(),pg_last_xlog_receive_location() 0/1920|0/1900 PSQL2 == postgres@psql2:~$ psql -c select application_name,upper(state),upper(sync_state) from pg_stat_replication application_name | upper | upper --+---+--- (0 rows) postgres@psql2:~$ psql -Atc select pg_last_xlog_replay_location(),pg_last_xlog_receive_location() 0/1900|0/1900 Neither server can connect (obviously) to the master, as the vip_repl Is not brought up. Could you help me understand WHAT is the action/state/event that sould promote one of the nodes? I see that pacemaker monitors the servers every X seconds, but nothing else happens. In the log (limited to pgsql) the following sequence is repeated forewer Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist. Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Attila 2011/11/27 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, You were right, changing the shell to bash resolved the problem. The cluster now started in sync mode - thank you very much. You're very welcome. I will be testing it in the next couple of days. I did just a very quick test - it seems that psql master failed over to psql2 properly, but when I tried to move it back to psql1 there was some problems starting psql on node 1. If master(psql1) is failed, its data may be inconsistency. A PostgreSQL developer says that it's a feature. Therefore my RA prevent it from starting automatically if data is inconsistency. Please backup psql2' data and restore it to psql1, and remove /var/lib/pgsql/PGSQL.lock file before clearing failcount. I use rsync to backup and restore in the following way. - # psql -h 192.168.2.114 -U postgres -c SELECT pg_start_backup('label') # rsync -avr --delete --exclude=postmaster.pid 192.168.2.114:/var/lib/pgsql/9.1/data/ /var/lib/pgsql/9.1/data/ # psql -h 192.168.2.114 -U postgres -c SELECT pg_stop_backup() - BTW I fixed some bugs 2 days ago. Please use the newest version. Thanks, Takatoshi MATSUO Does it work fine for you in both directions? Thank you very much. Have a nice weekend, Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 27. 6:12 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 2011/11/27 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, Thank you for coming back to me so quickly. In the /var/lib/pgsql there are the following files: PSQL1: = root@psql1:/var/lib/pgsql# ls -la total 16 drwxr-xr-x 2 postgres postgres 4096 Nov 26 18:04 . drwxr-xr-x 35 root root 4096 Nov 25 22:21 .. -rw-r--r-- 1 postgres postgres 1 Nov 26 00:17 rep_mode.conf -rw-r--r-- 1 root root 49 Nov 26 18:04 xlog_note.0 root@psql1:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900 psql2 1900 root@psql1:/var/lib/pgsql# PSQL2: === root@psql2:/var/lib/pgsql# ls -la total 16 drwxr-xr-x 2 postgres postgres 4096 Nov 26 18:05 . drwxr-xr-x 33 root root 4096 Nov 26 00:10 .. -rw-r--r-- 1 postgres postgres 1 Nov 26 00:24 rep_mode.conf -rw-r--r-- 1 root root 49 Nov 26 18:05 xlog_note.0 root@psql2:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900 psql2 1900 root@psql2:/var/lib/pgsql# It seems that dash's bultin echo command is used because echo with -e option dose not function. Perhaps my RA also depends on bash. Can you use a bash instead of a dash? BTW, postgres is installed under /var/lib/postgresql , but I noticed that some parts of the RA are referring to the /var/lib/pgsql directory, so I created that directory and i keep some of the files there. It's no ploblem. If you want to change this path, please specify it using tmpdir parameter. Regards, Takatoshi MATSUO Thanks, Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 26. 18:27 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2 files? These files are created while checking a xlog location on monitor. 2. Do these files include lines as below? - pgsql1 1900 pgsql2 1900 - Regards. Takatoshi MATSUO 2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com: Hi Yoshiharu, Takatoshi, Spent another day, without success. :( I started from scratch and synchronous replications works nicely when nodes are started outside pacemaker. My PostgreSQL version is 9.1.1. When I start from pacemaker, after a while it gets into the following state: Online: [ psql1 psql2 ] Master/Slave Set: msPostgresql [postgresql] Slaves: [ psql1 psql2 ] Clone Set: clnPingCheck [pingCheck] Started: [ psql1 psql2 ] Node Attributes: * Node psql1: + default_ping_set : 100 + master-postgresql:0 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc : 1900 * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc : 1900 The psql status queries return the following: PSQL1 == postgres@psql1:/root$ psql -c select application_name,upper(state),upper(sync_state) from pg_stat_replication application_name | upper | upper --+---+--- (0 rows) postgres@psql1:/root$ psql -Atc select pg_last_xlog_replay_location(),pg_last_xlog_receive_location() 0
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Takatoshi, I don't think it is inconsistency problem - for me it looks like some RA bug. I think so, because postgres starts properly outside pacemaker. When pacemaker starts node psql1 I see only: postgresql:0_start_0 (node=psql1, call=9, rc=1, status=complete): unknown error and the postgres log is empty - so I suppose that it does not even try to start it. What I tested was: - I had a stable cluster, where psql1 was the master, psql2 was the slave - I put psql1 into standby mode. (node psql1 standby) to test failover - After a while psql2 became the PRI, which is very good - When I put psql1 back online, postgres wouldn't start anymore from pacemaker (unknown error). I tried to start postgres manually from the shell it worked fine, even the monitor was able to see that it became in SYNC (obviously the master/slave group was showing improper state as psql was started outside pacemaker. I don't think data inconsistency is the case, partially because there are no clients connected, partially because psql starts properly outside pacemaker. Here is what is relevant from the log: Nov 27 16:02:50 psql1 pgsql[11021]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:51 psql1 pgsql[11021]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: PostgreSQL is running as a primary. Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: node=psql2, state=STREAMING, sync_state=SYNC Nov 27 16:03:00 psql1 pgsql[11556]: DEBUG: notify: pre for demote Nov 27 16:03:00 psql1 pgsql[11590]: INFO: Stopping PostgreSQL on demote. Nov 27 16:03:02 psql1 pgsql[11590]: INFO: waiting for server to shut down. done server stopped Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Removing /var/lib/pgsql/PGSQL.lock. Nov 27 16:03:02 psql1 pgsql[11590]: INFO: PostgreSQL is down Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Changing pgsql-status on psql1 : PRI-STOP. Nov 27 16:03:02 psql1 pgsql[11590]: DEBUG: Created recovery.conf. host=10.12.1.28, user=postgres Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Setup all nodes as an async. Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: notify: post for demote Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: post-demote called. Demote uname is psql1 Nov 27 16:03:02 psql1 pgsql[11732]: INFO: My Timeline ID and Checkpoint : 14:2320 Nov 27 16:03:02 psql1 pgsql[11732]: WARNING: Can't get psql2 master baseline. Waiting... Nov 27 16:03:03 psql1 pgsql[11732]: INFO: psql2 master baseline : 14:2300 Nov 27 16:03:03 psql1 pgsql[11732]: ERROR: My data is inconsistent. Nov 27 16:03:03 psql1 pgsql[11867]: DEBUG: notify: pre for stop Nov 27 16:03:03 psql1 pgsql[11969]: INFO: PostgreSQL is already stopped. Nov 27 16:03:12 psql1 pgsql[12053]: INFO: Don't check /var/lib/postgresql/9.1/main during probe Nov 27 16:03:12 psql1 pgsql[12053]: INFO: PostgreSQL is down Nov 27 16:03:27 psql1 pgsql[12204]: INFO: Changing pgsql-status on psql1 : -STOP. Nov 27 16:03:27 psql1 pgsql[12204]: DEBUG: Created recovery.conf. host=10.12.1.28, user=postgres Nov 27 16:03:27 psql1 pgsql[12204]: INFO: Setup all nodes as an async. Nov 27 16:03:27 psql1 pgsql[12204]: INFO: My Timeline ID and Checkpoint : 14:2320 Nov 27 16:03:27 psql1 pgsql[12204]: INFO: psql2 master baseline : 14:2300 Nov 27 16:03:27 psql1 pgsql[12204]: ERROR: My data is inconsistent. Nov 27 16:03:27 psql1 pgsql[12339]: DEBUG: notify: post for start Nov 27 16:03:27 psql1 pgsql[12373]: DEBUG: notify: pre for stop Nov 27 16:03:27 psql1 pgsql[12407]: INFO: PostgreSQL is already stopped. Thanks, Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 27. 11:07 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 2011/11/27 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, You were right, changing the shell to bash resolved the problem. The cluster now started in sync mode - thank you very much. You're very welcome. I will be testing it in the next couple of days. I did just a very quick test - it seems that psql master failed over to psql2 properly, but when I tried to move it back to psql1 there was some problems starting psql on node 1. If master(psql1) is failed, its data may be inconsistency. A PostgreSQL developer says that it's a feature. Therefore my RA prevent it from starting automatically if data is inconsistency. Please
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 27. 11:07 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 2011/11/27 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, You were right, changing the shell to bash resolved the problem. The cluster now started in sync mode - thank you very much. You're very welcome. I will be testing it in the next couple of days. I did just a very quick test - it seems that psql master failed over to psql2 properly, but when I tried to move it back to psql1 there was some problems starting psql on node 1. If master(psql1) is failed, its data may be inconsistency. A PostgreSQL developer says that it's a feature. Therefore my RA prevent it from starting automatically if data is inconsistency. Please backup psql2' data and restore it to psql1, and remove /var/lib/pgsql/PGSQL.lock file before clearing failcount. I use rsync to backup and restore in the following way. - # psql -h 192.168.2.114 -U postgres -c SELECT pg_start_backup('label') # rsync -avr --delete --exclude=postmaster.pid 192.168.2.114:/var/lib/pgsql/9.1/data/ /var/lib/pgsql/9.1/data/ # psql -h 192.168.2.114 -U postgres -c SELECT pg_stop_backup() - BTW I fixed some bugs 2 days ago. Please use the newest version. Thanks, Takatoshi MATSUO Does it work fine for you in both directions? Thank you very much. Have a nice weekend, Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 27. 6:12 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 2011/11/27 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, Thank you for coming back to me so quickly. In the /var/lib/pgsql there are the following files: PSQL1: = root@psql1:/var/lib/pgsql# ls -la total 16 drwxr-xr-x 2 postgres postgres 4096 Nov 26 18:04 . drwxr-xr-x 35 root root 4096 Nov 25 22:21 .. -rw-r--r-- 1 postgres postgres 1 Nov 26 00:17 rep_mode.conf -rw-r--r-- 1 root root 49 Nov 26 18:04 xlog_note.0 root@psql1:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900 psql2 1900 root@psql1:/var/lib/pgsql# PSQL2: === root@psql2:/var/lib/pgsql# ls -la total 16 drwxr-xr-x 2 postgres postgres 4096 Nov 26 18:05 . drwxr-xr-x 33 root root 4096 Nov 26 00:10 .. -rw-r--r-- 1 postgres postgres 1 Nov 26 00:24 rep_mode.conf -rw-r--r-- 1 root root 49 Nov 26 18:05 xlog_note.0 root@psql2:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900 psql2 1900 root@psql2:/var/lib/pgsql# It seems that dash's bultin echo command is used because echo with -e option dose not function. Perhaps my RA also depends on bash. Can you use a bash instead of a dash? BTW, postgres is installed under /var/lib/postgresql , but I noticed that some parts of the RA are referring to the /var/lib/pgsql directory, so I created that directory and i keep some of the files there. It's no ploblem. If you want to change this path, please specify it using tmpdir parameter. Regards, Takatoshi MATSUO Thanks, Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 26. 18:27 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2 files? These files are created while checking a xlog location on monitor. 2. Do these files include lines as below? - pgsql1 1900 pgsql2 1900 - Regards. Takatoshi MATSUO 2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com: Hi Yoshiharu, Takatoshi, Spent another day, without success. :( I started from scratch and synchronous replications works nicely when nodes are started outside pacemaker. My PostgreSQL version is 9.1.1. When I start from pacemaker, after a while it gets into the following state: Online: [ psql1 psql2 ] Master/Slave Set: msPostgresql [postgresql] Slaves: [ psql1 psql2 ] Clone Set: clnPingCheck [pingCheck] Started: [ psql1 psql2 ] Node Attributes: * Node psql1: + default_ping_set : 100 + master-postgresql:0 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc : 1900 * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc : 1900 The psql status queries return the following: PSQL1 == postgres@psql1:/root$ psql -c select application_name,upper(state
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Yoshiharu, Takatoshi, Spent another day, without success. :( I started from scratch and synchronous replications works nicely when nodes are started outside pacemaker. My PostgreSQL version is 9.1.1. When I start from pacemaker, after a while it gets into the following state: Online: [ psql1 psql2 ] Master/Slave Set: msPostgresql [postgresql] Slaves: [ psql1 psql2 ] Clone Set: clnPingCheck [pingCheck] Started: [ psql1 psql2 ] Node Attributes: * Node psql1: + default_ping_set : 100 + master-postgresql:0 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc: 1900 * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc: 1900 The psql status queries return the following: PSQL1 == postgres@psql1:/root$ psql -c select application_name,upper(state),upper(sync_state) from pg_stat_replication application_name | upper | upper --+---+--- (0 rows) postgres@psql1:/root$ psql -Atc select pg_last_xlog_replay_location(),pg_last_xlog_receive_location() 0/1920|0/1900 PSQL2 == postgres@psql2:~$ psql -c select application_name,upper(state),upper(sync_state) from pg_stat_replication application_name | upper | upper --+---+--- (0 rows) postgres@psql2:~$ psql -Atc select pg_last_xlog_replay_location(),pg_last_xlog_receive_location() 0/1900|0/1900 Neither server can connect (obviously) to the master, as the vip_repl Is not brought up. Could you help me understand WHAT is the action/state/event that sould promote one of the nodes? I see that pacemaker monitors the servers every X seconds, but nothing else happens. In the log (limited to pgsql) the following sequence is repeated forewer Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist. Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right of master. Nov 26 13:36:19 psql1 pgsql[19829]: INFO: My data status=. Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql1 xlog location : 1900 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql2 xlog location : 1900 Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: PostgreSQL is running as a hot standby. Nov 26 13:36:26 psql1 pgsql[19993]: INFO: Master is not exist. Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: Checking right of master. Nov 26 13:36:26 psql1 pgsql[19993]: INFO: My data status=. Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql1 xlog location : 1900 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql2 xlog location : 1900 Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: PostgreSQL is running as a hot standby. Nov 26 13:36:33 psql1 pgsql[20176]: INFO: Master is not exist. Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: Checking right of master. Nov 26 13:36:33 psql1 pgsql[20176]: INFO: My data status=. Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql1 xlog location : 1900 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql2 xlog location : 1900 Nov 26 13:36:41 psql1 pgsql[20343]: DEBUG: PostgreSQL is running as a hot standby. Any help is appreciated! Regards, Attila -Original Message- From: Yoshiharu Mori [mailto:y-m...@sraoss.co.jp] Sent: 2011. november 25. 14:17 To: The Pacemaker cluster resource manager Cc: Attila Megyeri Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila A quick snippet from the corosync.log Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master. Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=. Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location : 0D00 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog location : 0800 As you see, the my data status returns an empty string. My log is same. but it works. Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Master is not exist. Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Checking right of master. Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: My data status=. Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm01 xlog location : 0520 Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm02 xlog location : 0500 In my log, the following logs are outputted and started after checking xlog location(3 times). Nov 18 19:29:39 osspc24-1 pgsql[18720]: INFO: I have a master right. Please show us more corosync.log. -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: 2011. november 25. 9:28 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Takatoshi, I have restored the PSQL to run without corosync so I cannot send you the crm_mon output now. What I can tell for sure: - RA never
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Attila 1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2 files? These files are created while checking a xlog location on monitor. 2. Do these files include lines as below? - pgsql1 1900 pgsql2 1900 - Regards. Takatoshi MATSUO 2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com: Hi Yoshiharu, Takatoshi, Spent another day, without success. :( I started from scratch and synchronous replications works nicely when nodes are started outside pacemaker. My PostgreSQL version is 9.1.1. When I start from pacemaker, after a while it gets into the following state: Online: [ psql1 psql2 ] Master/Slave Set: msPostgresql [postgresql] Slaves: [ psql1 psql2 ] Clone Set: clnPingCheck [pingCheck] Started: [ psql1 psql2 ] Node Attributes: * Node psql1: + default_ping_set : 100 + master-postgresql:0 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc : 1900 * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc : 1900 The psql status queries return the following: PSQL1 == postgres@psql1:/root$ psql -c select application_name,upper(state),upper(sync_state) from pg_stat_replication application_name | upper | upper --+---+--- (0 rows) postgres@psql1:/root$ psql -Atc select pg_last_xlog_replay_location(),pg_last_xlog_receive_location() 0/1920|0/1900 PSQL2 == postgres@psql2:~$ psql -c select application_name,upper(state),upper(sync_state) from pg_stat_replication application_name | upper | upper --+---+--- (0 rows) postgres@psql2:~$ psql -Atc select pg_last_xlog_replay_location(),pg_last_xlog_receive_location() 0/1900|0/1900 Neither server can connect (obviously) to the master, as the vip_repl Is not brought up. Could you help me understand WHAT is the action/state/event that sould promote one of the nodes? I see that pacemaker monitors the servers every X seconds, but nothing else happens. In the log (limited to pgsql) the following sequence is repeated forewer Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist. Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right of master. Nov 26 13:36:19 psql1 pgsql[19829]: INFO: My data status=. Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql1 xlog location : 1900 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql2 xlog location : 1900 Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: PostgreSQL is running as a hot standby. Nov 26 13:36:26 psql1 pgsql[19993]: INFO: Master is not exist. Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: Checking right of master. Nov 26 13:36:26 psql1 pgsql[19993]: INFO: My data status=. Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql1 xlog location : 1900 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql2 xlog location : 1900 Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: PostgreSQL is running as a hot standby. Nov 26 13:36:33 psql1 pgsql[20176]: INFO: Master is not exist. Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: Checking right of master. Nov 26 13:36:33 psql1 pgsql[20176]: INFO: My data status=. Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql1 xlog location : 1900 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql2 xlog location : 1900 Nov 26 13:36:41 psql1 pgsql[20343]: DEBUG: PostgreSQL is running as a hot standby. Any help is appreciated! Regards, Attila -Original Message- From: Yoshiharu Mori [mailto:y-m...@sraoss.co.jp] Sent: 2011. november 25. 14:17 To: The Pacemaker cluster resource manager Cc: Attila Megyeri Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila A quick snippet from the corosync.log Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master. Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=. Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location : 0D00 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog location : 0800 As you see, the my data status returns an empty string. My log is same. but it works. Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Master is not exist. Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Checking right of master. Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: My data status=. Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm01 xlog location : 0520 Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm02 xlog location : 0500 In my log, the following logs are outputted and started after checking xlog location(3 times). Nov 18 19:29:39 osspc24-1 pgsql[18720]: INFO: I have a master
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Takatoshi, Thank you for coming back to me so quickly. In the /var/lib/pgsql there are the following files: PSQL1: = root@psql1:/var/lib/pgsql# ls -la total 16 drwxr-xr-x 2 postgres postgres 4096 Nov 26 18:04 . drwxr-xr-x 35 root root 4096 Nov 25 22:21 .. -rw-r--r-- 1 postgres postgres1 Nov 26 00:17 rep_mode.conf -rw-r--r-- 1 root root 49 Nov 26 18:04 xlog_note.0 root@psql1:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900 psql2 1900 root@psql1:/var/lib/pgsql# PSQL2: === root@psql2:/var/lib/pgsql# ls -la total 16 drwxr-xr-x 2 postgres postgres 4096 Nov 26 18:05 . drwxr-xr-x 33 root root 4096 Nov 26 00:10 .. -rw-r--r-- 1 postgres postgres1 Nov 26 00:24 rep_mode.conf -rw-r--r-- 1 root root 49 Nov 26 18:05 xlog_note.0 root@psql2:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900 psql2 1900 root@psql2:/var/lib/pgsql# BTW, postgres is installed under /var/lib/postgresql , but I noticed that some parts of the RA are referring to the /var/lib/pgsql directory, so I created that directory and i keep some of the files there. Thanks, Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 26. 18:27 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2 files? These files are created while checking a xlog location on monitor. 2. Do these files include lines as below? - pgsql1 1900 pgsql2 1900 - Regards. Takatoshi MATSUO 2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com: Hi Yoshiharu, Takatoshi, Spent another day, without success. :( I started from scratch and synchronous replications works nicely when nodes are started outside pacemaker. My PostgreSQL version is 9.1.1. When I start from pacemaker, after a while it gets into the following state: Online: [ psql1 psql2 ] Master/Slave Set: msPostgresql [postgresql] Slaves: [ psql1 psql2 ] Clone Set: clnPingCheck [pingCheck] Started: [ psql1 psql2 ] Node Attributes: * Node psql1: + default_ping_set : 100 + master-postgresql:0 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc : 1900 * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc : 1900 The psql status queries return the following: PSQL1 == postgres@psql1:/root$ psql -c select application_name,upper(state),upper(sync_state) from pg_stat_replication application_name | upper | upper --+---+--- (0 rows) postgres@psql1:/root$ psql -Atc select pg_last_xlog_replay_location(),pg_last_xlog_receive_location() 0/1920|0/1900 PSQL2 == postgres@psql2:~$ psql -c select application_name,upper(state),upper(sync_state) from pg_stat_replication application_name | upper | upper --+---+--- (0 rows) postgres@psql2:~$ psql -Atc select pg_last_xlog_replay_location(),pg_last_xlog_receive_location() 0/1900|0/1900 Neither server can connect (obviously) to the master, as the vip_repl Is not brought up. Could you help me understand WHAT is the action/state/event that sould promote one of the nodes? I see that pacemaker monitors the servers every X seconds, but nothing else happens. In the log (limited to pgsql) the following sequence is repeated forewer Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist. Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right of master. Nov 26 13:36:19 psql1 pgsql[19829]: INFO: My data status=. Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql1 xlog location : 1900 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql2 xlog location : 1900 Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: PostgreSQL is running as a hot standby. Nov 26 13:36:26 psql1 pgsql[19993]: INFO: Master is not exist. Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: Checking right of master. Nov 26 13:36:26 psql1 pgsql[19993]: INFO: My data status=. Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql1 xlog location : 1900 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql2 xlog location : 1900 Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: PostgreSQL is running as a hot standby. Nov 26 13:36:33 psql1 pgsql[20176]: INFO: Master is not exist. Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: Checking right of master. Nov 26 13:36:33 psql1 pgsql[20176]: INFO: My data status=. Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql1 xlog location : 1900 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql2 xlog location
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Attila 2011/11/27 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, Thank you for coming back to me so quickly. In the /var/lib/pgsql there are the following files: PSQL1: = root@psql1:/var/lib/pgsql# ls -la total 16 drwxr-xr-x 2 postgres postgres 4096 Nov 26 18:04 . drwxr-xr-x 35 root root 4096 Nov 25 22:21 .. -rw-r--r-- 1 postgres postgres1 Nov 26 00:17 rep_mode.conf -rw-r--r-- 1 root root 49 Nov 26 18:04 xlog_note.0 root@psql1:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900 psql2 1900 root@psql1:/var/lib/pgsql# PSQL2: === root@psql2:/var/lib/pgsql# ls -la total 16 drwxr-xr-x 2 postgres postgres 4096 Nov 26 18:05 . drwxr-xr-x 33 root root 4096 Nov 26 00:10 .. -rw-r--r-- 1 postgres postgres1 Nov 26 00:24 rep_mode.conf -rw-r--r-- 1 root root 49 Nov 26 18:05 xlog_note.0 root@psql2:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900 psql2 1900 root@psql2:/var/lib/pgsql# It seems that dash's bultin echo command is used because echo with -e option dose not function. Perhaps my RA also depends on bash. Can you use a bash instead of a dash? BTW, postgres is installed under /var/lib/postgresql , but I noticed that some parts of the RA are referring to the /var/lib/pgsql directory, so I created that directory and i keep some of the files there. It's no ploblem. If you want to change this path, please specify it using tmpdir parameter. Regards, Takatoshi MATSUO Thanks, Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 26. 18:27 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2 files? These files are created while checking a xlog location on monitor. 2. Do these files include lines as below? - pgsql1 1900 pgsql2 1900 - Regards. Takatoshi MATSUO 2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com: Hi Yoshiharu, Takatoshi, Spent another day, without success. :( I started from scratch and synchronous replications works nicely when nodes are started outside pacemaker. My PostgreSQL version is 9.1.1. When I start from pacemaker, after a while it gets into the following state: Online: [ psql1 psql2 ] Master/Slave Set: msPostgresql [postgresql] Slaves: [ psql1 psql2 ] Clone Set: clnPingCheck [pingCheck] Started: [ psql1 psql2 ] Node Attributes: * Node psql1: + default_ping_set : 100 + master-postgresql:0 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc : 1900 * Node psql2: + default_ping_set : 100 + master-postgresql:1 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc : 1900 The psql status queries return the following: PSQL1 == postgres@psql1:/root$ psql -c select application_name,upper(state),upper(sync_state) from pg_stat_replication application_name | upper | upper --+---+--- (0 rows) postgres@psql1:/root$ psql -Atc select pg_last_xlog_replay_location(),pg_last_xlog_receive_location() 0/1920|0/1900 PSQL2 == postgres@psql2:~$ psql -c select application_name,upper(state),upper(sync_state) from pg_stat_replication application_name | upper | upper --+---+--- (0 rows) postgres@psql2:~$ psql -Atc select pg_last_xlog_replay_location(),pg_last_xlog_receive_location() 0/1900|0/1900 Neither server can connect (obviously) to the master, as the vip_repl Is not brought up. Could you help me understand WHAT is the action/state/event that sould promote one of the nodes? I see that pacemaker monitors the servers every X seconds, but nothing else happens. In the log (limited to pgsql) the following sequence is repeated forewer Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist. Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right of master. Nov 26 13:36:19 psql1 pgsql[19829]: INFO: My data status=. Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql1 xlog location : 1900 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql2 xlog location : 1900 Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: PostgreSQL is running as a hot standby. Nov 26 13:36:26 psql1 pgsql[19993]: INFO: Master is not exist. Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: Checking right of master. Nov 26 13:36:26 psql1 pgsql[19993]: INFO: My data status=. Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql1 xlog location : 1900 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql2 xlog location : 1900 Nov 26 13:36:33 psql1 pgsql
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Attila 2011/11/24 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, All, Thanks for your reply. I see that you have invested significant effort in the development of the RA. I spent the last day trying to set up the RA, but without much success. My infrastructure is very similar to yours, except for the fact that currently I am testing with a single network adapter. Replication works nicely when I start the databases manually, not using corosync. When I try to start using corosync,I see that the ping resources start normally, but the msPostgresql starts on both nodes in slave mode, and I see HS:alone To see HS:alone is normal. And RA compares xlog locations and promote the postgresql having new data. In the Wiki you state, the if I start on a signle node only, PSQL should start in Master mode (PRI), but this is not the case. If the data is old, the node can't be master. To be master needs pgsql-data-status=LATEST or STREAMING|SYNC. Plese check it using crm_mon -A. And to become a master from stopped takes a few minutes because the RA compares xlog location on monitor. The recovery.conf file is created immediately, and from the logs I see no attempt at all to promote the node. In the postgres logs I see that node1, which is supposed to be a master, tries to connect to the vip-rep IP address, which is NOT brought up, because it depends on the Master role... Do you have any idea? Please check HA log. My RA outputs My data is out-of-date. status= to log if the data is old. Regards, Takatoshi MATSUO ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Takatoshi, I have restored the PSQL to run without corosync so I cannot send you the crm_mon output now. What I can tell for sure: - RA never promoted any of the nodes, no matter what the status was. It also did not promote the node, when it was the only one. - I believe the issue is in the comparison of the xlogs. How could I troubleshoot that? I see from the logs that crm NEVER tried to invoke pgsql with promote - I tried previously the crm_mon -A option, but there was never a pgsql-data-status attribute. The other attribs were there, including the HS:alone - In the corosync log the only relevant RA message I see is Master is not exist. I never saw a message like My data is out-of-date Thank you! Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 25. 8:56 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 2011/11/24 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, All, Thanks for your reply. I see that you have invested significant effort in the development of the RA. I spent the last day trying to set up the RA, but without much success. My infrastructure is very similar to yours, except for the fact that currently I am testing with a single network adapter. Replication works nicely when I start the databases manually, not using corosync. When I try to start using corosync,I see that the ping resources start normally, but the msPostgresql starts on both nodes in slave mode, and I see HS:alone To see HS:alone is normal. And RA compares xlog locations and promote the postgresql having new data. In the Wiki you state, the if I start on a signle node only, PSQL should start in Master mode (PRI), but this is not the case. If the data is old, the node can't be master. To be master needs pgsql-data-status=LATEST or STREAMING|SYNC. Plese check it using crm_mon -A. And to become a master from stopped takes a few minutes because the RA compares xlog location on monitor. The recovery.conf file is created immediately, and from the logs I see no attempt at all to promote the node. In the postgres logs I see that node1, which is supposed to be a master, tries to connect to the vip-rep IP address, which is NOT brought up, because it depends on the Master role... Do you have any idea? Please check HA log. My RA outputs My data is out-of-date. status= to log if the data is old. Regards, Takatoshi MATSUO ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
A quick snippet from the corosync.log Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master. Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=. Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location : 0D00 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog location : 0800 As you see, the my data status returns an empty string. -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: 2011. november 25. 9:28 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Takatoshi, I have restored the PSQL to run without corosync so I cannot send you the crm_mon output now. What I can tell for sure: - RA never promoted any of the nodes, no matter what the status was. It also did not promote the node, when it was the only one. - I believe the issue is in the comparison of the xlogs. How could I troubleshoot that? I see from the logs that crm NEVER tried to invoke pgsql with promote - I tried previously the crm_mon -A option, but there was never a pgsql-data-status attribute. The other attribs were there, including the HS:alone - In the corosync log the only relevant RA message I see is Master is not exist. I never saw a message like My data is out-of-date Thank you! Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 25. 8:56 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 2011/11/24 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, All, Thanks for your reply. I see that you have invested significant effort in the development of the RA. I spent the last day trying to set up the RA, but without much success. My infrastructure is very similar to yours, except for the fact that currently I am testing with a single network adapter. Replication works nicely when I start the databases manually, not using corosync. When I try to start using corosync,I see that the ping resources start normally, but the msPostgresql starts on both nodes in slave mode, and I see HS:alone To see HS:alone is normal. And RA compares xlog locations and promote the postgresql having new data. In the Wiki you state, the if I start on a signle node only, PSQL should start in Master mode (PRI), but this is not the case. If the data is old, the node can't be master. To be master needs pgsql-data-status=LATEST or STREAMING|SYNC. Plese check it using crm_mon -A. And to become a master from stopped takes a few minutes because the RA compares xlog location on monitor. The recovery.conf file is created immediately, and from the logs I see no attempt at all to promote the node. In the postgres logs I see that node1, which is supposed to be a master, tries to connect to the vip-rep IP address, which is NOT brought up, because it depends on the Master role... Do you have any idea? Please check HA log. My RA outputs My data is out-of-date. status= to log if the data is old. Regards, Takatoshi MATSUO ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Attila A quick snippet from the corosync.log Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master. Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=. Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location : 0D00 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog location : 0800 As you see, the my data status returns an empty string. My log is same. but it works. Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Master is not exist. Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Checking right of master. Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: My data status=. Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm01 xlog location : 0520 Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm02 xlog location : 0500 In my log, the following logs are outputted and started after checking xlog location(3 times). Nov 18 19:29:39 osspc24-1 pgsql[18720]: INFO: I have a master right. Please show us more corosync.log. -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: 2011. november 25. 9:28 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Takatoshi, I have restored the PSQL to run without corosync so I cannot send you the crm_mon output now. What I can tell for sure: - RA never promoted any of the nodes, no matter what the status was. It also did not promote the node, when it was the only one. - I believe the issue is in the comparison of the xlogs. How could I troubleshoot that? I see from the logs that crm NEVER tried to invoke pgsql with promote - I tried previously the crm_mon -A option, but there was never a pgsql-data-status attribute. The other attribs were there, including the HS:alone - In the corosync log the only relevant RA message I see is Master is not exist. I never saw a message like My data is out-of-date Thank you! Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 25. 8:56 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 2011/11/24 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, All, Thanks for your reply. I see that you have invested significant effort in the development of the RA. I spent the last day trying to set up the RA, but without much success. My infrastructure is very similar to yours, except for the fact that currently I am testing with a single network adapter. Replication works nicely when I start the databases manually, not using corosync. When I try to start using corosync,I see that the ping resources start normally, but the msPostgresql starts on both nodes in slave mode, and I see HS:alone To see HS:alone is normal. And RA compares xlog locations and promote the postgresql having new data. In the Wiki you state, the if I start on a signle node only, PSQL should start in Master mode (PRI), but this is not the case. If the data is old, the node can't be master. To be master needs pgsql-data-status=LATEST or STREAMING|SYNC. Plese check it using crm_mon -A. And to become a master from stopped takes a few minutes because the RA compares xlog location on monitor. The recovery.conf file is created immediately, and from the logs I see no attempt at all to promote the node. In the postgres logs I see that node1, which is supposed to be a master, tries to connect to the vip-rep IP address, which is NOT brought up, because it depends on the Master role... Do you have any idea? Please check HA log. My RA outputs My data is out-of-date. status= to log if the data is old. Regards, Takatoshi MATSUO ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- Yoshiharu Mori y-m...@sraoss.co.jp SRA OSS, Inc Japan http://www.sraoss.co.jp TEL: 03-5979-2701 FAX: 03-5979-2702
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Yoshiharu, -Original Message- From: Yoshiharu Mori [mailto:y-m...@sraoss.co.jp] Sent: 2011. november 25. 14:17 To: The Pacemaker cluster resource manager Cc: Attila Megyeri Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila A quick snippet from the corosync.log Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master. Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=. Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location : 0D00 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog location : 0800 As you see, the my data status returns an empty string. My log is same. but it works. Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Master is not exist. Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Checking right of master. Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: My data status=. Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm01 xlog location : 0520 Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm02 xlog location : 0500 In my log, the following logs are outputted and started after checking xlog location(3 times). Nov 18 19:29:39 osspc24-1 pgsql[18720]: INFO: I have a master right. Please show us more corosync.log. === I can leave it run forever, but will never show I have a master right. To be honest, I have no idea what should promote the node to master. What is it that the RA checks, and what could be wrong? I just cannot find where the problem is. Right now I am running corosync on node 1 only, as I expect that this way it will have the most recent xlog and start as a master. But it never starts. Here is the output for crm_mon -A : Last updated: Fri Nov 25 13:52:58 2011 Stack: openais Current DC: psql1 - partition WITHOUT quorum Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f 2 Nodes configured, 2 expected votes 4 Resources configured. Online: [ psql1 ] OFFLINE: [ psql2 ] Master/Slave Set: msPostgresql [postgresql] Slaves: [ psql1 ] Stopped: [ postgresql:1 ] Clone Set: clnPingCheck [pingCheck] Started: [ psql1 ] Stopped: [ pingCheck:1 ] Node Attributes: * Node psql1: + default_ping_set : 100 + master-postgresql:0 : -INFINITY + pgsql-status : HS:alone + pgsql-xlog-loc: 1200 I sent the log directly in private not to overload the list. I did a resource stop msPostgresql and resource start msPostgresql around 13:52. You will see some extra debug messages starting with ATT - I added them to the RA to help my troubleshooting. Thank you for your help, Attila -Original Message- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: 2011. november 25. 9:28 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Takatoshi, I have restored the PSQL to run without corosync so I cannot send you the crm_mon output now. What I can tell for sure: - RA never promoted any of the nodes, no matter what the status was. It also did not promote the node, when it was the only one. - I believe the issue is in the comparison of the xlogs. How could I troubleshoot that? I see from the logs that crm NEVER tried to invoke pgsql with promote - I tried previously the crm_mon -A option, but there was never a pgsql-data-status attribute. The other attribs were there, including the HS:alone - In the corosync log the only relevant RA message I see is Master is not exist. I never saw a message like My data is out-of-date Thank you! Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 25. 8:56 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila 2011/11/24 Attila Megyeri amegy...@minerva-soft.com: Hi Takatoshi, All, Thanks for your reply. I see that you have invested significant effort in the development of the RA. I spent the last day trying to set up the RA, but without much success. My infrastructure is very similar to yours, except for the fact that currently I am testing with a single network adapter. Replication works nicely when I start the databases manually, not using corosync. When I try to start using corosync,I see that the ping resources start normally, but the msPostgresql starts on both nodes in slave mode, and I see HS:alone To see HS:alone is normal. And RA compares xlog locations and promote the postgresql having new data. In the Wiki you state, the if I start on a signle node only, PSQL should start in Master mode (PRI), but this is not the case. If the data is old, the node can't be master. To be master needs pgsql-data-status=LATEST or STREAMING|SYNC. Plese check it using
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
=INFINITY \ migration-threshold=1 Regards, Attila -Original Message- From: Takatoshi MATSUO [mailto:matsuo@gmail.com] Sent: 2011. november 17. 8:04 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi All I create a RA for PosstgrSQL 9.1 Streaming Replication based on pgsql. RA https://github.com/t-matsuo/resource-agents/blob/pgsql91/heartbeat/pgsql Documents https://github.com/t-matsuo/resource-agents/wiki It is almost totally changed from previous patch http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018193.html . It create recovery.conf and promote PostgreSQL automatically. Additionally it can switch between the synchronous and asynchronous replication automatically. If you please, use them and comment. Regards, Takatoshi MATSUO 2011/11/17 Serge Dubrouski serge...@gmail.com: On Wed, Nov 16, 2011 at 12:55 PM, Attila Megyeri amegy...@minerva-soft.com wrote: Hi Florian, -Original Message- From: Florian Haas [mailto:flor...@hastexo.com] Sent: 2011. november 16. 11:49 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila, On 2011-11-16 10:27, Attila Megyeri wrote: Hi All, We have a two-node postgresql 9.1 system configured using streaming replicaiton(active/active with a read-only slave). We want to automate the failover process and I couldn't really find a resource agent that could do the job. That is correct; the pgsql resource agent (unlike its mysql counterpart) does not support streaming replication. We've had a contributor submit a patch at one point, but it was somewhat ill-conceived and thus did not make it into the upstream repo. The relevant thread is here: http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195 .html Would you feel comfortable modifying the pgsql resource agent to support replication? If so, we could revisit this issue and potentially add streaming replication support to pgsql. Well I'm not sure I would be able to do that change. Failover is relatively easy to do but I really have no idea how to do the failback part. And that's exactly the reason why I haven't implemented it yet. With the current way how replication is done in PostgreSQL there is no easy way to switch between roles, or at least I don't know about a such way. Implementing just fail-over functionality by creating a trigger file on a slave server in the case of failure on master side doesn't create a full master-slave implementation in my opinion. I will definitively have to sort this out somehow, I am just unsure whether I will try to use the repmgr mentioned in the video, or pacemaker with some level of customization... Is the resource agent that you mentioned available somewhere? Thanks. Attila ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacem aker -- Serge Dubrouski. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacema ker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Postgresql streaming replication failover - RA needed
Hi All, We have a two-node postgresql 9.1 system configured using streaming replicaiton (active/active with a read-only slave). We want to automate the failover process and I couldn't really find a resource agent that could do the job. All HA solutions for postgresql I have seen are based on a DRBD active/passive approach, that we would not prefer. At the first stage I would be satisified with the failover only - meaning that the more complex failback would not be required. Of course if the failback could be implemented as well, that would be the right solution for us. Does anyone have experience with the above setup? Any feedback is appreciated! Regards, Attila ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Attila, On 2011-11-16 10:27, Attila Megyeri wrote: Hi All, We have a two-node postgresql 9.1 system configured using streaming replicaiton(active/active with a read-only slave). We want to automate the failover process and I couldn’t really find a resource agent that could do the job. That is correct; the pgsql resource agent (unlike its mysql counterpart) does not support streaming replication. We've had a contributor submit a patch at one point, but it was somewhat ill-conceived and thus did not make it into the upstream repo. The relevant thread is here: http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195.html Would you feel comfortable modifying the pgsql resource agent to support replication? If so, we could revisit this issue and potentially add streaming replication support to pgsql. Cheers, Florian -- Need help with High Availability? http://www.hastexo.com/now ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi Florian, -Original Message- From: Florian Haas [mailto:flor...@hastexo.com] Sent: 2011. november 16. 11:49 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila, On 2011-11-16 10:27, Attila Megyeri wrote: Hi All, We have a two-node postgresql 9.1 system configured using streaming replicaiton(active/active with a read-only slave). We want to automate the failover process and I couldn't really find a resource agent that could do the job. That is correct; the pgsql resource agent (unlike its mysql counterpart) does not support streaming replication. We've had a contributor submit a patch at one point, but it was somewhat ill-conceived and thus did not make it into the upstream repo. The relevant thread is here: http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195.html Would you feel comfortable modifying the pgsql resource agent to support replication? If so, we could revisit this issue and potentially add streaming replication support to pgsql. Well I'm not sure I would be able to do that change. Failover is relatively easy to do but I really have no idea how to do the failback part. I will definitively have to sort this out somehow, I am just unsure whether I will try to use the repmgr mentioned in the video, or pacemaker with some level of customization... Is the resource agent that you mentioned available somewhere? Thanks. Attila ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
On Wed, Nov 16, 2011 at 12:55 PM, Attila Megyeri amegy...@minerva-soft.comwrote: Hi Florian, -Original Message- From: Florian Haas [mailto:flor...@hastexo.com] Sent: 2011. november 16. 11:49 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila, On 2011-11-16 10:27, Attila Megyeri wrote: Hi All, We have a two-node postgresql 9.1 system configured using streaming replicaiton(active/active with a read-only slave). We want to automate the failover process and I couldn't really find a resource agent that could do the job. That is correct; the pgsql resource agent (unlike its mysql counterpart) does not support streaming replication. We've had a contributor submit a patch at one point, but it was somewhat ill-conceived and thus did not make it into the upstream repo. The relevant thread is here: http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195.html Would you feel comfortable modifying the pgsql resource agent to support replication? If so, we could revisit this issue and potentially add streaming replication support to pgsql. Well I'm not sure I would be able to do that change. Failover is relatively easy to do but I really have no idea how to do the failback part. And that's exactly the reason why I haven't implemented it yet. With the current way how replication is done in PostgreSQL there is no easy way to switch between roles, or at least I don't know about a such way. Implementing just fail-over functionality by creating a trigger file on a slave server in the case of failure on master side doesn't create a full master-slave implementation in my opinion. I will definitively have to sort this out somehow, I am just unsure whether I will try to use the repmgr mentioned in the video, or pacemaker with some level of customization... Is the resource agent that you mentioned available somewhere? Thanks. Attila ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Serge Dubrouski. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Postgresql streaming replication failover - RA needed
Hi All I create a RA for PosstgrSQL 9.1 Streaming Replication based on pgsql. RA https://github.com/t-matsuo/resource-agents/blob/pgsql91/heartbeat/pgsql Documents https://github.com/t-matsuo/resource-agents/wiki It is almost totally changed from previous patch http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018193.html . It create recovery.conf and promote PostgreSQL automatically. Additionally it can switch between the synchronous and asynchronous replication automatically. If you please, use them and comment. Regards, Takatoshi MATSUO 2011/11/17 Serge Dubrouski serge...@gmail.com: On Wed, Nov 16, 2011 at 12:55 PM, Attila Megyeri amegy...@minerva-soft.com wrote: Hi Florian, -Original Message- From: Florian Haas [mailto:flor...@hastexo.com] Sent: 2011. november 16. 11:49 To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed Hi Attila, On 2011-11-16 10:27, Attila Megyeri wrote: Hi All, We have a two-node postgresql 9.1 system configured using streaming replicaiton(active/active with a read-only slave). We want to automate the failover process and I couldn't really find a resource agent that could do the job. That is correct; the pgsql resource agent (unlike its mysql counterpart) does not support streaming replication. We've had a contributor submit a patch at one point, but it was somewhat ill-conceived and thus did not make it into the upstream repo. The relevant thread is here: http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195.html Would you feel comfortable modifying the pgsql resource agent to support replication? If so, we could revisit this issue and potentially add streaming replication support to pgsql. Well I'm not sure I would be able to do that change. Failover is relatively easy to do but I really have no idea how to do the failback part. And that's exactly the reason why I haven't implemented it yet. With the current way how replication is done in PostgreSQL there is no easy way to switch between roles, or at least I don't know about a such way. Implementing just fail-over functionality by creating a trigger file on a slave server in the case of failure on master side doesn't create a full master-slave implementation in my opinion. I will definitively have to sort this out somehow, I am just unsure whether I will try to use the repmgr mentioned in the video, or pacemaker with some level of customization... Is the resource agent that you mentioned available somewhere? Thanks. Attila ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Serge Dubrouski. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker