Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-12-13 Thread Takatoshi MATSUO
Hello Serge

2011/12/13 Serge Dubrouski serge...@gmail.com:


 On Mon, Dec 12, 2011 at 5:32 AM, Takatoshi MATSUO matsuo@gmail.com
 wrote:

 Hello

 2011/12/12 Serge Dubrouski serge...@gmail.com:
 
 
  On Thu, Dec 8, 2011 at 10:34 PM, Takatoshi MATSUO matsuo@gmail.com
  wrote:
 
  Hi Attila
 
  2011/12/8 Attila Megyeri amegy...@minerva-soft.com:
   Hi Takatoshi,
  
   One strange thing I noticed and could probably be improved.
   When there is data inconsistency, I have the following node
   properties:
  
   * Node psql2:
  + default_ping_set  : 100
  + master-postgresql:1   : -INFINITY
  + pgsql-data-status : DISCONNECT
  + pgsql-status  : HS:alone
   * Node psql1:
  + default_ping_set  : 100
  + master-postgresql:0   : 1000
  + master-postgresql:1   : -INFINITY
  + pgsql-data-status : LATEST
  + pgsql-master-baseline : 58:4B20
  + pgsql-status  : PRI
  
   This is fine, and understandable - but I can see this only if I do a
   crm_mon -A.
  
   My problem is, that CRM shows the following:
  
   Master/Slave Set: db-ms-psql [postgresql]
   Masters: [ psql1 ]
   Slaves: [ psql2 ]
  
   So if I monitor the system from crm_mon, HAWK or ther tools - I have
   no
   indication at all that the slave is running in an inconsistent mode.
  
   I would expect the RA to stop the psql2 node in such cases, because:
   - It is running, but has non-up-to-date data, therefore noone will
   use
   it (the slave IP points to the master as well, which is good)
   - In CRM status eveything looks perfect, even though it is NOT
   perfect
   and admin intervention is required.
  
  
   Shouldn't the disconnected PSQL server be stopped instead?
 
  hmm..
  It's not better to stop PGSQL server.
  RA cannot know whether PGSQL is disconnected because of
  data-inconsistent or network-down or
  starting-up and so on.
 
 
  Why does it matter? If the state is degraded and inconsistent and there
  is
  no way to fix it from inside of the RA, RA should probably stop it.

 In this case, HS's data may be cosistent but Primary dosen't have enough
 wals or
 HS dosen't have enough wal-archives to be replication-mode.
 Unfortunately this RA dosen't calculate the number of wals.


 Honestly I don't know how to better handle this. Pacemaker doesn't have a
 concept of degraded node state.

In this case the RA cannot know whether it is degraded or not
for the above reason.

Of course, the RA stops PostgreSQL when it is obviously degraded .




  Let's say that there is pgpool running in front of the cluster, keeping
  an
  inconsistent node up would lead to the routing SQL queries to it and
  possibly getting wrong results.
 

 It dosen't happen in my sample configuration.
 vip-slave is up at master when slave is not HS:sync.


 So you have a VIP for each slave node?


Yes.
If you don't need read-only access,
it is no problem removing vip-slave.



 
 
 
  How about using dummy RA such as vip-slave?
  ---
  primitive runningSlaveOK ocf:heartbeat:Dummy
  .(snip)
 
  location rsc_location-dummy runningSlaveOK \
  rule  200: pgsql-status eq HS:sync
  ---

 
  That probably fixes visibility issue. What about notifications on
  DISCONNECT
  state? How administrator would know that cluster is inconsistent? May be
  the
  better option in this case would be collocating MailTo resource with
  HS:alone?

 Yes, it's good idea if you want to receive notifications.



Regards,
Takatoshi MATSUO

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-12-12 Thread Takatoshi MATSUO
Hello

2011/12/12 Serge Dubrouski serge...@gmail.com:


 On Thu, Dec 8, 2011 at 10:34 PM, Takatoshi MATSUO matsuo@gmail.com
 wrote:

 Hi Attila

 2011/12/8 Attila Megyeri amegy...@minerva-soft.com:
  Hi Takatoshi,
 
  One strange thing I noticed and could probably be improved.
  When there is data inconsistency, I have the following node properties:
 
  * Node psql2:
 + default_ping_set  : 100
 + master-postgresql:1   : -INFINITY
 + pgsql-data-status : DISCONNECT
 + pgsql-status  : HS:alone
  * Node psql1:
 + default_ping_set  : 100
 + master-postgresql:0   : 1000
 + master-postgresql:1   : -INFINITY
 + pgsql-data-status : LATEST
 + pgsql-master-baseline : 58:4B20
 + pgsql-status  : PRI
 
  This is fine, and understandable - but I can see this only if I do a
  crm_mon -A.
 
  My problem is, that CRM shows the following:
 
  Master/Slave Set: db-ms-psql [postgresql]
  Masters: [ psql1 ]
  Slaves: [ psql2 ]
 
  So if I monitor the system from crm_mon, HAWK or ther tools - I have no
  indication at all that the slave is running in an inconsistent mode.
 
  I would expect the RA to stop the psql2 node in such cases, because:
  - It is running, but has non-up-to-date data, therefore noone will use
  it (the slave IP points to the master as well, which is good)
  - In CRM status eveything looks perfect, even though it is NOT perfect
  and admin intervention is required.
 
 
  Shouldn't the disconnected PSQL server be stopped instead?

 hmm..
 It's not better to stop PGSQL server.
 RA cannot know whether PGSQL is disconnected because of
 data-inconsistent or network-down or
 starting-up and so on.


 Why does it matter? If the state is degraded and inconsistent and there is
 no way to fix it from inside of the RA, RA should probably stop it.

In this case, HS's data may be cosistent but Primary dosen't have enough wals or
HS dosen't have enough wal-archives to be replication-mode.
Unfortunately this RA dosen't calculate the number of wals.


 Let's say that there is pgpool running in front of the cluster, keeping an
 inconsistent node up would lead to the routing SQL queries to it and
 possibly getting wrong results.


It dosen't happen in my sample configuration.
vip-slave is up at master when slave is not HS:sync.




 How about using dummy RA such as vip-slave?
 ---
 primitive runningSlaveOK ocf:heartbeat:Dummy
 .(snip)

 location rsc_location-dummy runningSlaveOK \
 rule  200: pgsql-status eq HS:sync
 ---


 That probably fixes visibility issue. What about notifications on DISCONNECT
 state? How administrator would know that cluster is inconsistent? May be the
 better option in this case would be collocating MailTo resource with
 HS:alone?

Yes, it's good idea if you want to receive notifications.


Regards,
Takatoshi MATSUO

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-12-12 Thread Serge Dubrouski
On Mon, Dec 12, 2011 at 5:32 AM, Takatoshi MATSUO matsuo@gmail.comwrote:

 Hello

 2011/12/12 Serge Dubrouski serge...@gmail.com:
 
 
  On Thu, Dec 8, 2011 at 10:34 PM, Takatoshi MATSUO matsuo@gmail.com
  wrote:
 
  Hi Attila
 
  2011/12/8 Attila Megyeri amegy...@minerva-soft.com:
   Hi Takatoshi,
  
   One strange thing I noticed and could probably be improved.
   When there is data inconsistency, I have the following node
 properties:
  
   * Node psql2:
  + default_ping_set  : 100
  + master-postgresql:1   : -INFINITY
  + pgsql-data-status : DISCONNECT
  + pgsql-status  : HS:alone
   * Node psql1:
  + default_ping_set  : 100
  + master-postgresql:0   : 1000
  + master-postgresql:1   : -INFINITY
  + pgsql-data-status : LATEST
  + pgsql-master-baseline : 58:4B20
  + pgsql-status  : PRI
  
   This is fine, and understandable - but I can see this only if I do a
   crm_mon -A.
  
   My problem is, that CRM shows the following:
  
   Master/Slave Set: db-ms-psql [postgresql]
   Masters: [ psql1 ]
   Slaves: [ psql2 ]
  
   So if I monitor the system from crm_mon, HAWK or ther tools - I have
 no
   indication at all that the slave is running in an inconsistent mode.
  
   I would expect the RA to stop the psql2 node in such cases, because:
   - It is running, but has non-up-to-date data, therefore noone will use
   it (the slave IP points to the master as well, which is good)
   - In CRM status eveything looks perfect, even though it is NOT perfect
   and admin intervention is required.
  
  
   Shouldn't the disconnected PSQL server be stopped instead?
 
  hmm..
  It's not better to stop PGSQL server.
  RA cannot know whether PGSQL is disconnected because of
  data-inconsistent or network-down or
  starting-up and so on.
 
 
  Why does it matter? If the state is degraded and inconsistent and there
 is
  no way to fix it from inside of the RA, RA should probably stop it.

 In this case, HS's data may be cosistent but Primary dosen't have enough
 wals or
 HS dosen't have enough wal-archives to be replication-mode.
 Unfortunately this RA dosen't calculate the number of wals.


Honestly I don't know how to better handle this. Pacemaker doesn't have a
concept of degraded node state.



  Let's say that there is pgpool running in front of the cluster, keeping
 an
  inconsistent node up would lead to the routing SQL queries to it and
  possibly getting wrong results.
 

 It dosen't happen in my sample configuration.
 vip-slave is up at master when slave is not HS:sync.


So you have a VIP for each slave node?



 
 
 
  How about using dummy RA such as vip-slave?
  ---
  primitive runningSlaveOK ocf:heartbeat:Dummy
  .(snip)
 
  location rsc_location-dummy runningSlaveOK \
  rule  200: pgsql-status eq HS:sync
  ---

 
  That probably fixes visibility issue. What about notifications on
 DISCONNECT
  state? How administrator would know that cluster is inconsistent? May be
 the
  better option in this case would be collocating MailTo resource with
  HS:alone?

 Yes, it's good idea if you want to receive notifications.


 Regards,
 Takatoshi MATSUO

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




-- 
Serge Dubrouski.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-12-11 Thread Serge Dubrouski
On Thu, Dec 8, 2011 at 10:34 PM, Takatoshi MATSUO matsuo@gmail.comwrote:

 Hi Attila

 2011/12/8 Attila Megyeri amegy...@minerva-soft.com:
  Hi Takatoshi,
 
  One strange thing I noticed and could probably be improved.
  When there is data inconsistency, I have the following node properties:
 
  * Node psql2:
 + default_ping_set  : 100
 + master-postgresql:1   : -INFINITY
 + pgsql-data-status : DISCONNECT
 + pgsql-status  : HS:alone
  * Node psql1:
 + default_ping_set  : 100
 + master-postgresql:0   : 1000
 + master-postgresql:1   : -INFINITY
 + pgsql-data-status : LATEST
 + pgsql-master-baseline : 58:4B20
 + pgsql-status  : PRI
 
  This is fine, and understandable - but I can see this only if I do a
 crm_mon -A.
 
  My problem is, that CRM shows the following:
 
  Master/Slave Set: db-ms-psql [postgresql]
  Masters: [ psql1 ]
  Slaves: [ psql2 ]
 
  So if I monitor the system from crm_mon, HAWK or ther tools - I have no
 indication at all that the slave is running in an inconsistent mode.
 
  I would expect the RA to stop the psql2 node in such cases, because:
  - It is running, but has non-up-to-date data, therefore noone will use
 it (the slave IP points to the master as well, which is good)
  - In CRM status eveything looks perfect, even though it is NOT perfect
 and admin intervention is required.
 
 
  Shouldn't the disconnected PSQL server be stopped instead?

 hmm..
 It's not better to stop PGSQL server.
 RA cannot know whether PGSQL is disconnected because of
 data-inconsistent or network-down or
 starting-up and so on.


Why does it matter? If the state is degraded and inconsistent and there is
no way to fix it from inside of the RA, RA should probably stop it. Let's
say that there is pgpool running in front of the cluster, keeping an
inconsistent node up would lead to the routing SQL queries to it and
possibly getting wrong results.




 How about using dummy RA such as vip-slave?
 ---
 primitive runningSlaveOK ocf:heartbeat:Dummy
 .(snip)

 location rsc_location-dummy runningSlaveOK \
 rule  200: pgsql-status eq HS:sync
 ---


That probably fixes visibility issue. What about notifications on
DISCONNECT state? How administrator would know that cluster is
inconsistent? May be the better option in this case would be collocating
MailTo resource with HS:alone?




 Regards,
 Takatoshi MATSUO

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




-- 
Serge Dubrouski.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-12-08 Thread Attila Megyeri
Hi Takatoshi,

One strange thing I noticed and could probably be improved.
When there is data inconsistency, I have the following node properties:

* Node psql2:
+ default_ping_set  : 100
+ master-postgresql:1   : -INFINITY
+ pgsql-data-status : DISCONNECT
+ pgsql-status  : HS:alone
* Node psql1:
+ default_ping_set  : 100
+ master-postgresql:0   : 1000
+ master-postgresql:1   : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 58:4B20
+ pgsql-status  : PRI

This is fine, and understandable - but I can see this only if I do a crm_mon -A.

My problem is, that CRM shows the following:

Master/Slave Set: db-ms-psql [postgresql]
 Masters: [ psql1 ]
 Slaves: [ psql2 ]

So if I monitor the system from crm_mon, HAWK or ther tools - I have no 
indication at all that the slave is running in an inconsistent mode.

I would expect the RA to stop the psql2 node in such cases, because:
- It is running, but has non-up-to-date data, therefore noone will use it (the 
slave IP points to the master as well, which is good)
- In CRM status eveything looks perfect, even though it is NOT perfect and 
admin intervention is required.


Shouldn't the disconnected PSQL server be stopped instead?

Regards,
Attila




-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
Sent: 2011. november 28. 11:10
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

2011/11/28 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 I understand your point and I agree that the correct behavior is not to start 
 replication when data consistency exists.
 The only thing I do not really understand is how it could have happened:

 1) nodes were in sync (psql1=PRI, psql2=STREAMING|SYNC)
 2) I shut down node psql1 (by placing it into standby)
 3) At this moment psql1's baseline became higher by 20?  What could cause 
 this? Probably the demote operation itself? There were no clients connected - 
 and there was definitively no write operation to the db (except if not from 
 system side).

Yes, PostgreSQL executes a CHECKPOINT when it is shut down normally on demote.

 On the other hand - thank you very much for your contribution, the RA works 
 very well and I really appreciate your work and help!

Not at all. Don't mention it.

Regards,
Takatoshi MATSUO


 Bests,

 Attil

 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 28. 2:10
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover -
 RA needed

 Hi Attila

 Primary can not send all wals to HotStandby whether primary is shut down 
 normally.
 These logs validate it.

 Nov 27 16:03:27 psql1 pgsql[12204]: INFO: My Timeline ID and
 Checkpoint : 14:2320 Nov 27 16:03:27 psql1 pgsql[12204]:
 INFO: psql2 master baseline : 14:2300

 psql1's location was  2320 when it was demoted.
 OTOH psql2's location was 2300  when it was promoted.

 It means that psql1's data was newer than psql2's one at that time.
 The gap is 20.

 As you said you can start psql1's PostgreSQL manually, but PostgreSQL can't 
 realize this occurrence.
 If you start HotStandby at psql1, data is replicated after 2320.
 It's inconsistency.

 Thanks,
 Takatoshi MATSUO


 2011/11/28 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 I don't think it is inconsistency problem - for me it looks like some RA bug.
 I think so, because postgres starts properly outside pacemaker.

 When pacemaker starts node psql1 I see only:

 postgresql:0_start_0 (node=psql1, call=9, rc=1, status=complete):
 unknown error

 and the postgres log is empty - so I suppose that it does not even try to 
 start it.

 What I tested was:
 - I had a stable cluster, where psql1 was the master, psql2 was the
 slave
 - I put psql1 into standby mode. (node psql1 standby) to test
 failover
 - After a while psql2 became the PRI, which is very good
 - When I put psql1 back online, postgres wouldn't start anymore from 
 pacemaker (unknown error).


 I tried to start postgres manually from the shell it worked fine, even the 
 monitor was able to see that it became in SYNC (obviously the master/slave 
 group was showing improper state as psql was started outside pacemaker.

 I don't think data inconsistency is the case, partially because there are no 
 clients connected, partially because psql starts properly outside pacemaker.

 Here is what is relevant from the log:

 Nov 27 16:02:50 psql1 pgsql[11021]: DEBUG: PostgreSQL is running as a 
 primary.
 Nov 27 16:02:51 psql1 pgsql[11021]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-12-08 Thread Takatoshi MATSUO
Hi Attila

2011/12/8 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 One strange thing I noticed and could probably be improved.
 When there is data inconsistency, I have the following node properties:

 * Node psql2:
+ default_ping_set  : 100
+ master-postgresql:1   : -INFINITY
+ pgsql-data-status : DISCONNECT
+ pgsql-status  : HS:alone
 * Node psql1:
+ default_ping_set  : 100
+ master-postgresql:0   : 1000
+ master-postgresql:1   : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 58:4B20
+ pgsql-status  : PRI

 This is fine, and understandable - but I can see this only if I do a crm_mon 
 -A.

 My problem is, that CRM shows the following:

 Master/Slave Set: db-ms-psql [postgresql]
 Masters: [ psql1 ]
 Slaves: [ psql2 ]

 So if I monitor the system from crm_mon, HAWK or ther tools - I have no 
 indication at all that the slave is running in an inconsistent mode.

 I would expect the RA to stop the psql2 node in such cases, because:
 - It is running, but has non-up-to-date data, therefore noone will use it 
 (the slave IP points to the master as well, which is good)
 - In CRM status eveything looks perfect, even though it is NOT perfect and 
 admin intervention is required.


 Shouldn't the disconnected PSQL server be stopped instead?

hmm..
It's not better to stop PGSQL server.
RA cannot know whether PGSQL is disconnected because of
data-inconsistent or network-down or
starting-up and so on.


How about using dummy RA such as vip-slave?
---
primitive runningSlaveOK ocf:heartbeat:Dummy
.(snip)

location rsc_location-dummy runningSlaveOK \
 rule  200: pgsql-status eq HS:sync
---


Regards,
Takatoshi MATSUO

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-28 Thread Attila Megyeri
Hi Takatoshi,

I understand your point and I agree that the correct behavior is not to start 
replication when data consistency exists.
The only thing I do not really understand is how it could have happened:

1) nodes were in sync (psql1=PRI, psql2=STREAMING|SYNC)
2) I shut down node psql1 (by placing it into standby)
3) At this moment psql1's baseline became higher by 20? What could cause this? 
Probably the demote operation itself? There were no clients connected - and 
there was definitively no write operation to the db (except if not from system 
side).

On the other hand - thank you very much for your contribution, the RA works 
very well and I really appreciate your work and help!

Bests,

Attil

-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
Sent: 2011. november 28. 2:10
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

Primary can not send all wals to HotStandby whether primary is shut down 
normally.
These logs validate it.

 Nov 27 16:03:27 psql1 pgsql[12204]: INFO: My Timeline ID and
 Checkpoint : 14:2320 Nov 27 16:03:27 psql1 pgsql[12204]:
 INFO: psql2 master baseline : 14:2300

psql1's location was  2320 when it was demoted.
OTOH psql2's location was 2300  when it was promoted.

It means that psql1's data was newer than psql2's one at that time.
The gap is 20.

As you said you can start psql1's PostgreSQL manually, but PostgreSQL can't 
realize this occurrence.
If you start HotStandby at psql1, data is replicated after 2320.
It's inconsistency.

Thanks,
Takatoshi MATSUO


2011/11/28 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 I don't think it is inconsistency problem - for me it looks like some RA bug.
 I think so, because postgres starts properly outside pacemaker.

 When pacemaker starts node psql1 I see only:

 postgresql:0_start_0 (node=psql1, call=9, rc=1, status=complete):
 unknown error

 and the postgres log is empty - so I suppose that it does not even try to 
 start it.

 What I tested was:
 - I had a stable cluster, where psql1 was the master, psql2 was the
 slave
 - I put psql1 into standby mode. (node psql1 standby) to test
 failover
 - After a while psql2 became the PRI, which is very good
 - When I put psql1 back online, postgres wouldn't start anymore from 
 pacemaker (unknown error).


 I tried to start postgres manually from the shell it worked fine, even the 
 monitor was able to see that it became in SYNC (obviously the master/slave 
 group was showing improper state as psql was started outside pacemaker.

 I don't think data inconsistency is the case, partially because there are no 
 clients connected, partially because psql starts properly outside pacemaker.

 Here is what is relevant from the log:

 Nov 27 16:02:50 psql1 pgsql[11021]: DEBUG: PostgreSQL is running as a primary.
 Nov 27 16:02:51 psql1 pgsql[11021]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: 
 PostgreSQL is running as a primary.
 Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: 
 PostgreSQL is running as a primary.
 Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: 
 PostgreSQL is running as a primary.
 Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: 
 PostgreSQL is running as a primary.
 Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:03:00 psql1 pgsql[11556]:
 DEBUG: notify: pre for demote Nov 27 16:03:00 psql1 pgsql[11590]: INFO: 
 Stopping PostgreSQL on demote.
 Nov 27 16:03:02 psql1 pgsql[11590]: INFO: waiting for server to shut
 down. done server stopped Nov 27 16:03:02 psql1 pgsql[11590]: INFO: 
 Removing /var/lib/pgsql/PGSQL.lock.
 Nov 27 16:03:02 psql1 pgsql[11590]: INFO: PostgreSQL is down Nov 27
 16:03:02 psql1 pgsql[11590]: INFO: Changing pgsql-status on psql1 : PRI-STOP.
 Nov 27 16:03:02 psql1 pgsql[11590]: DEBUG: Created recovery.conf.
 host=10.12.1.28, user=postgres Nov 27 16:03:02 psql1 pgsql[11590]: INFO: 
 Setup all nodes as an async.
 Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: notify: post for demote Nov
 27 16:03:02 psql1 pgsql[11732]: DEBUG: post-demote called. Demote
 uname is psql1 Nov 27 16:03:02 psql1 pgsql[11732]: INFO: My Timeline
 ID and Checkpoint : 14:2320 Nov 27 16:03:02 psql1 pgsql[11732]: 
 WARNING: Can't get psql2 master baseline. Waiting...
 Nov 27 16:03:03 psql1 pgsql[11732]: INFO: psql2 master baseline :
 14:2300 Nov 27 16:03:03 psql1 pgsql[11732]: ERROR: My data is 
 inconsistent.
 Nov 27 16:03:03 psql1 pgsql[11867]: DEBUG: notify: pre for stop Nov 27
 16:03:03 psql1 pgsql[11969]: INFO: PostgreSQL

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-28 Thread Takatoshi MATSUO
Hi Attila

2011/11/28 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 I understand your point and I agree that the correct behavior is not to start 
 replication when data consistency exists.
 The only thing I do not really understand is how it could have happened:

 1) nodes were in sync (psql1=PRI, psql2=STREAMING|SYNC)
 2) I shut down node psql1 (by placing it into standby)
 3) At this moment psql1's baseline became higher by 20?  What could cause 
 this? Probably the demote operation itself? There were no clients connected - 
 and there was definitively no write operation to the db (except if not from 
 system side).

Yes, PostgreSQL executes a CHECKPOINT when it is shut down normally on demote.

 On the other hand - thank you very much for your contribution, the RA works 
 very well and I really appreciate your work and help!

Not at all. Don't mention it.

Regards,
Takatoshi MATSUO


 Bests,

 Attil

 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 28. 2:10
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

 Hi Attila

 Primary can not send all wals to HotStandby whether primary is shut down 
 normally.
 These logs validate it.

 Nov 27 16:03:27 psql1 pgsql[12204]: INFO: My Timeline ID and
 Checkpoint : 14:2320 Nov 27 16:03:27 psql1 pgsql[12204]:
 INFO: psql2 master baseline : 14:2300

 psql1's location was  2320 when it was demoted.
 OTOH psql2's location was 2300  when it was promoted.

 It means that psql1's data was newer than psql2's one at that time.
 The gap is 20.

 As you said you can start psql1's PostgreSQL manually, but PostgreSQL can't 
 realize this occurrence.
 If you start HotStandby at psql1, data is replicated after 2320.
 It's inconsistency.

 Thanks,
 Takatoshi MATSUO


 2011/11/28 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 I don't think it is inconsistency problem - for me it looks like some RA bug.
 I think so, because postgres starts properly outside pacemaker.

 When pacemaker starts node psql1 I see only:

 postgresql:0_start_0 (node=psql1, call=9, rc=1, status=complete):
 unknown error

 and the postgres log is empty - so I suppose that it does not even try to 
 start it.

 What I tested was:
 - I had a stable cluster, where psql1 was the master, psql2 was the
 slave
 - I put psql1 into standby mode. (node psql1 standby) to test
 failover
 - After a while psql2 became the PRI, which is very good
 - When I put psql1 back online, postgres wouldn't start anymore from 
 pacemaker (unknown error).


 I tried to start postgres manually from the shell it worked fine, even the 
 monitor was able to see that it became in SYNC (obviously the master/slave 
 group was showing improper state as psql was started outside pacemaker.

 I don't think data inconsistency is the case, partially because there are no 
 clients connected, partially because psql starts properly outside pacemaker.

 Here is what is relevant from the log:

 Nov 27 16:02:50 psql1 pgsql[11021]: DEBUG: PostgreSQL is running as a 
 primary.
 Nov 27 16:02:51 psql1 pgsql[11021]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: 
 PostgreSQL is running as a primary.
 Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: 
 PostgreSQL is running as a primary.
 Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: 
 PostgreSQL is running as a primary.
 Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: 
 PostgreSQL is running as a primary.
 Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:03:00 psql1 pgsql[11556]:
 DEBUG: notify: pre for demote Nov 27 16:03:00 psql1 pgsql[11590]: INFO: 
 Stopping PostgreSQL on demote.
 Nov 27 16:03:02 psql1 pgsql[11590]: INFO: waiting for server to shut
 down. done server stopped Nov 27 16:03:02 psql1 pgsql[11590]: INFO: 
 Removing /var/lib/pgsql/PGSQL.lock.
 Nov 27 16:03:02 psql1 pgsql[11590]: INFO: PostgreSQL is down Nov 27
 16:03:02 psql1 pgsql[11590]: INFO: Changing pgsql-status on psql1 : 
 PRI-STOP.
 Nov 27 16:03:02 psql1 pgsql[11590]: DEBUG: Created recovery.conf.
 host=10.12.1.28, user=postgres Nov 27 16:03:02 psql1 pgsql[11590]: INFO: 
 Setup all nodes as an async.
 Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: notify: post for demote Nov
 27 16:03:02 psql1 pgsql[11732]: DEBUG: post-demote called. Demote
 uname is psql1 Nov 27 16:03:02 psql1 pgsql[11732]: INFO: My Timeline
 ID and Checkpoint : 14:2320 Nov 27 16:03:02 psql1 pgsql[11732]: 
 WARNING: Can't get psql2 master baseline. Waiting...
 Nov 27 16:03:03 psql1 pgsql[11732]: INFO

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-27 Thread Attila Megyeri
Hi Takatoshi,

You were right, changing the shell to bash resolved the problem.
The cluster now started in sync mode - thank you very much.
I will be testing it in the next couple of days. I did just a very quick test - 
it seems that psql master failed over to psql2 properly, but when I tried to 
move it back to psql1 there was some problems starting psql on node 1.

Does it work fine for you in  both directions?

Thank you very much.

Have a nice weekend,

Attila



-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com] 
Sent: 2011. november 27. 6:12
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

2011/11/27 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 Thank you for coming back to me so quickly.

 In the /var/lib/pgsql there are the following files:

 PSQL1:
 =
 root@psql1:/var/lib/pgsql# ls -la
 total 16
 drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:04 .
 drwxr-xr-x 35 root root 4096 Nov 25 22:21 ..
 -rw-r--r--  1 postgres postgres1 Nov 26 00:17 rep_mode.conf
 -rw-r--r--  1 root root   49 Nov 26 18:04 xlog_note.0

 root@psql1:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900
 psql2 1900
 root@psql1:/var/lib/pgsql#

 PSQL2:
 ===
 root@psql2:/var/lib/pgsql# ls -la
 total 16
 drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:05 .
 drwxr-xr-x 33 root root 4096 Nov 26 00:10 ..
 -rw-r--r--  1 postgres postgres1 Nov 26 00:24 rep_mode.conf
 -rw-r--r--  1 root root   49 Nov 26 18:05 xlog_note.0
 root@psql2:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900
 psql2 1900
 root@psql2:/var/lib/pgsql#

It seems that dash's bultin echo command is used because echo with -e option 
dose not function.

Perhaps my RA also depends on bash.
Can you use a bash instead of a dash?

 BTW, postgres is installed under /var/lib/postgresql , but I noticed that 
 some parts of the RA are referring to the  /var/lib/pgsql directory, so I 
 created that directory and i keep some of the files there.

It's no ploblem.
If you want to change this path, please specify it using tmpdir parameter.

Regards,
Takatoshi MATSUO


 Thanks,
 Attila



 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 26. 18:27
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - 
 RA needed

 Hi Attila

 1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2  files?
   These files are created while checking a xlog location on monitor.

 2. Do these files include lines as below?
 -
 pgsql1  1900
 pgsql2  1900
 -

 Regards.
 Takatoshi MATSUO


 2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com:
 Hi Yoshiharu, Takatoshi,

 Spent another day, without success. :(

 I started from scratch and synchronous replications works nicely when nodes 
 are started outside pacemaker.
 My PostgreSQL version is 9.1.1.

 When I start from pacemaker, after a while it gets into the following state:

 Online: [ psql1 psql2 ]

  Master/Slave Set: msPostgresql [postgresql]
     Slaves: [ psql1 psql2 ]
  Clone Set: clnPingCheck [pingCheck]
     Started: [ psql1 psql2 ]

 Node Attributes:
 * Node psql1:
    + default_ping_set                  : 100
    + master-postgresql:0               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900
 * Node psql2:
    + default_ping_set                  : 100
    + master-postgresql:1               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900


 The psql status queries return the following:

 PSQL1
 ==
 postgres@psql1:/root$ psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
 application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql1:/root$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0/1920|0/1900

 PSQL2
 ==
 postgres@psql2:~$  psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
  application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql2:~$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0/1900|0/1900


 Neither server can connect (obviously) to the master, as the vip_repl Is not 
 brought up.


 Could you help me understand WHAT is the action/state/event that sould 
 promote one of the nodes? I see that pacemaker monitors the servers every X 
 seconds, but nothing else happens.

 In the log (limited to pgsql) the following sequence is repeated 
 forewer

 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist.
 Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-27 Thread Takatoshi MATSUO
Hi Attila

2011/11/27 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 You were right, changing the shell to bash resolved the problem.
 The cluster now started in sync mode - thank you very much.

You're very welcome.

 I will be testing it in the next couple of days. I did just a very quick test 
 - it seems that psql
 master failed over to psql2 properly, but when I tried to move it back to 
 psql1 there was some
 problems starting psql on node 1.

If master(psql1) is failed, its data may be inconsistency.
A PostgreSQL developer says that it's a feature.
Therefore my RA prevent it from starting automatically if data is inconsistency.
Please backup psql2' data and restore it to psql1, and remove
/var/lib/pgsql/PGSQL.lock file
before clearing failcount.

I use rsync to backup and restore in the following way.
-
# psql -h 192.168.2.114 -U postgres -c SELECT pg_start_backup('label')
# rsync -avr --delete --exclude=postmaster.pid
192.168.2.114:/var/lib/pgsql/9.1/data/ /var/lib/pgsql/9.1/data/
# psql -h 192.168.2.114 -U postgres -c SELECT pg_stop_backup()
-


BTW I fixed some bugs 2 days ago.
Please use the newest version.

Thanks,
Takatoshi MATSUO



 Does it work fine for you in  both directions?

 Thank you very much.

 Have a nice weekend,

 Attila



 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 27. 6:12
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

 Hi Attila

 2011/11/27 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 Thank you for coming back to me so quickly.

 In the /var/lib/pgsql there are the following files:

 PSQL1:
 =
 root@psql1:/var/lib/pgsql# ls -la
 total 16
 drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:04 .
 drwxr-xr-x 35 root     root     4096 Nov 25 22:21 ..
 -rw-r--r--  1 postgres postgres    1 Nov 26 00:17 rep_mode.conf
 -rw-r--r--  1 root     root       49 Nov 26 18:04 xlog_note.0

 root@psql1:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900
 psql2 1900
 root@psql1:/var/lib/pgsql#

 PSQL2:
 ===
 root@psql2:/var/lib/pgsql# ls -la
 total 16
 drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:05 .
 drwxr-xr-x 33 root     root     4096 Nov 26 00:10 ..
 -rw-r--r--  1 postgres postgres    1 Nov 26 00:24 rep_mode.conf
 -rw-r--r--  1 root     root       49 Nov 26 18:05 xlog_note.0
 root@psql2:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900
 psql2 1900
 root@psql2:/var/lib/pgsql#

 It seems that dash's bultin echo command is used because echo with -e 
 option dose not function.

 Perhaps my RA also depends on bash.
 Can you use a bash instead of a dash?

 BTW, postgres is installed under /var/lib/postgresql , but I noticed that 
 some parts of the RA are referring to the  /var/lib/pgsql directory, so I 
 created that directory and i keep some of the files there.

 It's no ploblem.
 If you want to change this path, please specify it using tmpdir parameter.

 Regards,
 Takatoshi MATSUO


 Thanks,
 Attila



 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 26. 18:27
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover -
 RA needed

 Hi Attila

 1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2  
 files?
   These files are created while checking a xlog location on monitor.

 2. Do these files include lines as below?
 -
 pgsql1  1900
 pgsql2  1900
 -

 Regards.
 Takatoshi MATSUO


 2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com:
 Hi Yoshiharu, Takatoshi,

 Spent another day, without success. :(

 I started from scratch and synchronous replications works nicely when nodes 
 are started outside pacemaker.
 My PostgreSQL version is 9.1.1.

 When I start from pacemaker, after a while it gets into the following state:

 Online: [ psql1 psql2 ]

  Master/Slave Set: msPostgresql [postgresql]
     Slaves: [ psql1 psql2 ]
  Clone Set: clnPingCheck [pingCheck]
     Started: [ psql1 psql2 ]

 Node Attributes:
 * Node psql1:
    + default_ping_set                  : 100
    + master-postgresql:0               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900
 * Node psql2:
    + default_ping_set                  : 100
    + master-postgresql:1               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900


 The psql status queries return the following:

 PSQL1
 ==
 postgres@psql1:/root$ psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
 application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql1:/root$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-27 Thread Attila Megyeri
Hi Takatoshi,

I don't think it is inconsistency problem - for me it looks like some RA bug.
I think so, because postgres starts properly outside pacemaker.

When pacemaker starts node psql1 I see only:

postgresql:0_start_0 (node=psql1, call=9, rc=1, status=complete): unknown error

and the postgres log is empty - so I suppose that it does not even try to start 
it.

What I tested was:
- I had a stable cluster, where psql1 was the master, psql2 was the slave
- I put psql1 into standby mode. (node psql1 standby) to test failover
- After a while psql2 became the PRI, which is very good
- When I put psql1 back online, postgres wouldn't start anymore from pacemaker 
(unknown error).


I tried to start postgres manually from the shell it worked fine, even the 
monitor was able to see that it became in SYNC (obviously the master/slave 
group was showing improper state as psql was started outside pacemaker.

I don't think data inconsistency is the case, partially because there are no 
clients connected, partially because psql starts properly outside pacemaker.

Here is what is relevant from the log:

Nov 27 16:02:50 psql1 pgsql[11021]: DEBUG: PostgreSQL is running as a primary.
Nov 27 16:02:51 psql1 pgsql[11021]: DEBUG: node=psql2, state=STREAMING, 
sync_state=SYNC
Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: PostgreSQL is running as a primary.
Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: node=psql2, state=STREAMING, 
sync_state=SYNC
Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: PostgreSQL is running as a primary.
Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: node=psql2, state=STREAMING, 
sync_state=SYNC
Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: PostgreSQL is running as a primary.
Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: node=psql2, state=STREAMING, 
sync_state=SYNC
Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: PostgreSQL is running as a primary.
Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: node=psql2, state=STREAMING, 
sync_state=SYNC
Nov 27 16:03:00 psql1 pgsql[11556]: DEBUG: notify: pre for demote
Nov 27 16:03:00 psql1 pgsql[11590]: INFO: Stopping PostgreSQL on demote.
Nov 27 16:03:02 psql1 pgsql[11590]: INFO: waiting for server to shut down. 
done server stopped
Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Removing /var/lib/pgsql/PGSQL.lock.
Nov 27 16:03:02 psql1 pgsql[11590]: INFO: PostgreSQL is down
Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Changing pgsql-status on psql1 : 
PRI-STOP.
Nov 27 16:03:02 psql1 pgsql[11590]: DEBUG: Created recovery.conf. 
host=10.12.1.28, user=postgres
Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Setup all nodes as an async.
Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: notify: post for demote
Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: post-demote called. Demote uname is 
psql1
Nov 27 16:03:02 psql1 pgsql[11732]: INFO: My Timeline ID and Checkpoint : 
14:2320
Nov 27 16:03:02 psql1 pgsql[11732]: WARNING: Can't get psql2 master baseline. 
Waiting...
Nov 27 16:03:03 psql1 pgsql[11732]: INFO: psql2 master baseline : 
14:2300
Nov 27 16:03:03 psql1 pgsql[11732]: ERROR: My data is inconsistent.
Nov 27 16:03:03 psql1 pgsql[11867]: DEBUG: notify: pre for stop
Nov 27 16:03:03 psql1 pgsql[11969]: INFO: PostgreSQL is already stopped.
Nov 27 16:03:12 psql1 pgsql[12053]: INFO: Don't check 
/var/lib/postgresql/9.1/main during probe
Nov 27 16:03:12 psql1 pgsql[12053]: INFO: PostgreSQL is down
Nov 27 16:03:27 psql1 pgsql[12204]: INFO: Changing pgsql-status on psql1 : 
-STOP.
Nov 27 16:03:27 psql1 pgsql[12204]: DEBUG: Created recovery.conf. 
host=10.12.1.28, user=postgres
Nov 27 16:03:27 psql1 pgsql[12204]: INFO: Setup all nodes as an async.
Nov 27 16:03:27 psql1 pgsql[12204]: INFO: My Timeline ID and Checkpoint : 
14:2320
Nov 27 16:03:27 psql1 pgsql[12204]: INFO: psql2 master baseline : 
14:2300
Nov 27 16:03:27 psql1 pgsql[12204]: ERROR: My data is inconsistent.
Nov 27 16:03:27 psql1 pgsql[12339]: DEBUG: notify: post for start
Nov 27 16:03:27 psql1 pgsql[12373]: DEBUG: notify: pre for stop
Nov 27 16:03:27 psql1 pgsql[12407]: INFO: PostgreSQL is already stopped.


Thanks,

Attila


-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
Sent: 2011. november 27. 11:07
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

2011/11/27 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 You were right, changing the shell to bash resolved the problem.
 The cluster now started in sync mode - thank you very much.

You're very welcome.

 I will be testing it in the next couple of days. I did just a very quick test 
 - it seems that psql
 master failed over to psql2 properly, but when I tried to move it back to 
 psql1 there was some
 problems starting psql on node 1.

If master(psql1) is failed, its data may be inconsistency.
A PostgreSQL developer says that it's a feature.
Therefore my RA prevent it from starting automatically if data is inconsistency.
Please

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-27 Thread Takatoshi MATSUO
-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 27. 11:07
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

 Hi Attila

 2011/11/27 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 You were right, changing the shell to bash resolved the problem.
 The cluster now started in sync mode - thank you very much.

 You're very welcome.

 I will be testing it in the next couple of days. I did just a very quick 
 test - it seems that psql
 master failed over to psql2 properly, but when I tried to move it back to 
 psql1 there was some
 problems starting psql on node 1.

 If master(psql1) is failed, its data may be inconsistency.
 A PostgreSQL developer says that it's a feature.
 Therefore my RA prevent it from starting automatically if data is 
 inconsistency.
 Please backup psql2' data and restore it to psql1, and remove
 /var/lib/pgsql/PGSQL.lock file
 before clearing failcount.

 I use rsync to backup and restore in the following way.
 -
 # psql -h 192.168.2.114 -U postgres -c SELECT pg_start_backup('label')
 # rsync -avr --delete --exclude=postmaster.pid
 192.168.2.114:/var/lib/pgsql/9.1/data/ /var/lib/pgsql/9.1/data/
 # psql -h 192.168.2.114 -U postgres -c SELECT pg_stop_backup()
 -


 BTW I fixed some bugs 2 days ago.
 Please use the newest version.

 Thanks,
 Takatoshi MATSUO



 Does it work fine for you in  both directions?

 Thank you very much.

 Have a nice weekend,

 Attila



 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 27. 6:12
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA 
 needed

 Hi Attila

 2011/11/27 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 Thank you for coming back to me so quickly.

 In the /var/lib/pgsql there are the following files:

 PSQL1:
 =
 root@psql1:/var/lib/pgsql# ls -la
 total 16
 drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:04 .
 drwxr-xr-x 35 root     root     4096 Nov 25 22:21 ..
 -rw-r--r--  1 postgres postgres    1 Nov 26 00:17 rep_mode.conf
 -rw-r--r--  1 root     root       49 Nov 26 18:04 xlog_note.0

 root@psql1:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900
 psql2 1900
 root@psql1:/var/lib/pgsql#

 PSQL2:
 ===
 root@psql2:/var/lib/pgsql# ls -la
 total 16
 drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:05 .
 drwxr-xr-x 33 root     root     4096 Nov 26 00:10 ..
 -rw-r--r--  1 postgres postgres    1 Nov 26 00:24 rep_mode.conf
 -rw-r--r--  1 root     root       49 Nov 26 18:05 xlog_note.0
 root@psql2:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900
 psql2 1900
 root@psql2:/var/lib/pgsql#

 It seems that dash's bultin echo command is used because echo with -e 
 option dose not function.

 Perhaps my RA also depends on bash.
 Can you use a bash instead of a dash?

 BTW, postgres is installed under /var/lib/postgresql , but I noticed that 
 some parts of the RA are referring to the  /var/lib/pgsql directory, so I 
 created that directory and i keep some of the files there.

 It's no ploblem.
 If you want to change this path, please specify it using tmpdir parameter.

 Regards,
 Takatoshi MATSUO


 Thanks,
 Attila



 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 26. 18:27
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover -
 RA needed

 Hi Attila

 1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2  
 files?
   These files are created while checking a xlog location on monitor.

 2. Do these files include lines as below?
 -
 pgsql1  1900
 pgsql2  1900
 -

 Regards.
 Takatoshi MATSUO


 2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com:
 Hi Yoshiharu, Takatoshi,

 Spent another day, without success. :(

 I started from scratch and synchronous replications works nicely when 
 nodes are started outside pacemaker.
 My PostgreSQL version is 9.1.1.

 When I start from pacemaker, after a while it gets into the following 
 state:

 Online: [ psql1 psql2 ]

  Master/Slave Set: msPostgresql [postgresql]
     Slaves: [ psql1 psql2 ]
  Clone Set: clnPingCheck [pingCheck]
     Started: [ psql1 psql2 ]

 Node Attributes:
 * Node psql1:
    + default_ping_set                  : 100
    + master-postgresql:0               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900
 * Node psql2:
    + default_ping_set                  : 100
    + master-postgresql:1               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900


 The psql status queries return the following:

 PSQL1
 ==
 postgres@psql1:/root$ psql  -c select 
 application_name,upper(state

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-26 Thread Attila Megyeri
Hi Yoshiharu, Takatoshi,

Spent another day, without success. :(

I started from scratch and synchronous replications works nicely when nodes are 
started outside pacemaker.
My PostgreSQL version is 9.1.1.

When I start from pacemaker, after a while it gets into the following state:

Online: [ psql1 psql2 ]

 Master/Slave Set: msPostgresql [postgresql]
 Slaves: [ psql1 psql2 ]
 Clone Set: clnPingCheck [pingCheck]
 Started: [ psql1 psql2 ]

Node Attributes:
* Node psql1:
+ default_ping_set  : 100
+ master-postgresql:0   : -INFINITY
+ pgsql-status  : HS:alone
+ pgsql-xlog-loc: 1900
* Node psql2:
+ default_ping_set  : 100
+ master-postgresql:1   : -INFINITY
+ pgsql-status  : HS:alone
+ pgsql-xlog-loc: 1900


The psql status queries return the following:

PSQL1
==
postgres@psql1:/root$ psql  -c select 
application_name,upper(state),upper(sync_state) from pg_stat_replication
application_name | upper | upper
--+---+---
(0 rows)

postgres@psql1:/root$ psql  -Atc select 
pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
0/1920|0/1900

PSQL2
==
postgres@psql2:~$  psql  -c select 
application_name,upper(state),upper(sync_state) from pg_stat_replication
 application_name | upper | upper
--+---+---
(0 rows)

postgres@psql2:~$ psql  -Atc select 
pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
0/1900|0/1900


Neither server can connect (obviously) to the master, as the vip_repl Is not 
brought up.


Could you help me understand WHAT is the action/state/event that sould promote 
one of the nodes? I see that pacemaker monitors the servers every X seconds, 
but nothing else happens.

In the log (limited to pgsql) the following sequence is repeated forewer

Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist.
Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right of master.
Nov 26 13:36:19 psql1 pgsql[19829]: INFO: My data status=.
Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql1 xlog location : 1900
Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql2 xlog location : 1900
Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: PostgreSQL is running as a hot 
standby.
Nov 26 13:36:26 psql1 pgsql[19993]: INFO: Master is not exist.
Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: Checking right of master.
Nov 26 13:36:26 psql1 pgsql[19993]: INFO: My data status=.
Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql1 xlog location : 1900
Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql2 xlog location : 1900
Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: PostgreSQL is running as a hot 
standby.
Nov 26 13:36:33 psql1 pgsql[20176]: INFO: Master is not exist.
Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: Checking right of master.
Nov 26 13:36:33 psql1 pgsql[20176]: INFO: My data status=.
Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql1 xlog location : 1900
Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql2 xlog location : 1900
Nov 26 13:36:41 psql1 pgsql[20343]: DEBUG: PostgreSQL is running as a hot 
standby.


Any help is appreciated!

Regards,
Attila




-Original Message-
From: Yoshiharu Mori [mailto:y-m...@sraoss.co.jp] 
Sent: 2011. november 25. 14:17
To: The Pacemaker cluster resource manager
Cc: Attila Megyeri
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

 A quick snippet from the corosync.log
 
 Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master.
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=.
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location : 
 0D00 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog 
 location : 0800
 
 As you see, the my data status returns an empty string.

My log is same. but it works.

Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Master is not exist.
Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Checking right of master.
Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: My data status=.
Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm01 xlog location : 
0520 Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm02 xlog 
location : 0500

In my log, the following logs are outputted and started after checking xlog 
location(3 times). 

Nov 18 19:29:39 osspc24-1 pgsql[18720]: INFO: I have a master right.

Please show us more corosync.log.


 
 
 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: 2011. november 25. 9:28
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - 
 RA needed
 
 Hi Takatoshi,
 
 I have restored the PSQL to run without corosync so I cannot send you the 
 crm_mon output now.
 
 What I can tell for sure:
 - RA never

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-26 Thread Takatoshi MATSUO
Hi Attila

1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2  files?
   These files are created while checking a xlog location on monitor.

2. Do these files include lines as below?
-
pgsql1  1900
pgsql2  1900
-

Regards.
Takatoshi MATSUO


2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com:
 Hi Yoshiharu, Takatoshi,

 Spent another day, without success. :(

 I started from scratch and synchronous replications works nicely when nodes 
 are started outside pacemaker.
 My PostgreSQL version is 9.1.1.

 When I start from pacemaker, after a while it gets into the following state:

 Online: [ psql1 psql2 ]

  Master/Slave Set: msPostgresql [postgresql]
     Slaves: [ psql1 psql2 ]
  Clone Set: clnPingCheck [pingCheck]
     Started: [ psql1 psql2 ]

 Node Attributes:
 * Node psql1:
    + default_ping_set                  : 100
    + master-postgresql:0               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900
 * Node psql2:
    + default_ping_set                  : 100
    + master-postgresql:1               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900


 The psql status queries return the following:

 PSQL1
 ==
 postgres@psql1:/root$ psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
 application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql1:/root$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0/1920|0/1900

 PSQL2
 ==
 postgres@psql2:~$  psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
  application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql2:~$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0/1900|0/1900


 Neither server can connect (obviously) to the master, as the vip_repl Is not 
 brought up.


 Could you help me understand WHAT is the action/state/event that sould 
 promote one of the nodes? I see that pacemaker monitors the servers every X 
 seconds, but nothing else happens.

 In the log (limited to pgsql) the following sequence is repeated forewer

 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist.
 Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right of master.
 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: My data status=.
 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql1 xlog location : 
 1900
 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql2 xlog location : 
 1900
 Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: PostgreSQL is running as a hot 
 standby.
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: Master is not exist.
 Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: Checking right of master.
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: My data status=.
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql1 xlog location : 
 1900
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql2 xlog location : 
 1900
 Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: PostgreSQL is running as a hot 
 standby.
 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: Master is not exist.
 Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: Checking right of master.
 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: My data status=.
 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql1 xlog location : 
 1900
 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql2 xlog location : 
 1900
 Nov 26 13:36:41 psql1 pgsql[20343]: DEBUG: PostgreSQL is running as a hot 
 standby.


 Any help is appreciated!

 Regards,
 Attila




 -Original Message-
 From: Yoshiharu Mori [mailto:y-m...@sraoss.co.jp]
 Sent: 2011. november 25. 14:17
 To: The Pacemaker cluster resource manager
 Cc: Attila Megyeri
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

 Hi Attila

 A quick snippet from the corosync.log

 Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master.
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=.
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location :
 0D00 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog
 location : 0800

 As you see, the my data status returns an empty string.

 My log is same. but it works.

 Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Master is not exist.
 Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Checking right of master.
 Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: My data status=.
 Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm01 xlog location : 
 0520 Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm02 xlog 
 location : 0500

 In my log, the following logs are outputted and started after checking xlog 
 location(3 times).

 Nov 18 19:29:39 osspc24-1 pgsql[18720]: INFO: I have a master

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-26 Thread Attila Megyeri
Hi Takatoshi,

Thank you for coming back to me so quickly.

In the /var/lib/pgsql there are the following files:

PSQL1:
=
root@psql1:/var/lib/pgsql# ls -la
total 16
drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:04 .
drwxr-xr-x 35 root root 4096 Nov 25 22:21 ..
-rw-r--r--  1 postgres postgres1 Nov 26 00:17 rep_mode.conf
-rw-r--r--  1 root root   49 Nov 26 18:04 xlog_note.0

root@psql1:/var/lib/pgsql# cat xlog_note.0
-e psql1 1900
psql2 1900
root@psql1:/var/lib/pgsql#

PSQL2:
===
root@psql2:/var/lib/pgsql# ls -la
total 16
drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:05 .
drwxr-xr-x 33 root root 4096 Nov 26 00:10 ..
-rw-r--r--  1 postgres postgres1 Nov 26 00:24 rep_mode.conf
-rw-r--r--  1 root root   49 Nov 26 18:05 xlog_note.0
root@psql2:/var/lib/pgsql# cat xlog_note.0
-e psql1 1900
psql2 1900
root@psql2:/var/lib/pgsql#

BTW, postgres is installed under /var/lib/postgresql , but I noticed that some 
parts of the RA are referring to the  /var/lib/pgsql directory, so I created 
that directory and i keep some of the files there.


Thanks,
Attila



-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com] 
Sent: 2011. november 26. 18:27
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2  files?
   These files are created while checking a xlog location on monitor.

2. Do these files include lines as below?
-
pgsql1  1900
pgsql2  1900
-

Regards.
Takatoshi MATSUO


2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com:
 Hi Yoshiharu, Takatoshi,

 Spent another day, without success. :(

 I started from scratch and synchronous replications works nicely when nodes 
 are started outside pacemaker.
 My PostgreSQL version is 9.1.1.

 When I start from pacemaker, after a while it gets into the following state:

 Online: [ psql1 psql2 ]

  Master/Slave Set: msPostgresql [postgresql]
     Slaves: [ psql1 psql2 ]
  Clone Set: clnPingCheck [pingCheck]
     Started: [ psql1 psql2 ]

 Node Attributes:
 * Node psql1:
    + default_ping_set                  : 100
    + master-postgresql:0               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900
 * Node psql2:
    + default_ping_set                  : 100
    + master-postgresql:1               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900


 The psql status queries return the following:

 PSQL1
 ==
 postgres@psql1:/root$ psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
 application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql1:/root$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0/1920|0/1900

 PSQL2
 ==
 postgres@psql2:~$  psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
  application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql2:~$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0/1900|0/1900


 Neither server can connect (obviously) to the master, as the vip_repl Is not 
 brought up.


 Could you help me understand WHAT is the action/state/event that sould 
 promote one of the nodes? I see that pacemaker monitors the servers every X 
 seconds, but nothing else happens.

 In the log (limited to pgsql) the following sequence is repeated 
 forewer

 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist.
 Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right of master.
 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: My data status=.
 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql1 xlog location : 
 1900 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql2 xlog 
 location : 1900 Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: 
 PostgreSQL is running as a hot standby.
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: Master is not exist.
 Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: Checking right of master.
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: My data status=.
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql1 xlog location : 
 1900 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql2 xlog 
 location : 1900 Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: 
 PostgreSQL is running as a hot standby.
 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: Master is not exist.
 Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: Checking right of master.
 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: My data status=.
 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql1 xlog location : 
 1900 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql2 xlog 
 location

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-26 Thread Takatoshi MATSUO
Hi Attila

2011/11/27 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 Thank you for coming back to me so quickly.

 In the /var/lib/pgsql there are the following files:

 PSQL1:
 =
 root@psql1:/var/lib/pgsql# ls -la
 total 16
 drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:04 .
 drwxr-xr-x 35 root root 4096 Nov 25 22:21 ..
 -rw-r--r--  1 postgres postgres1 Nov 26 00:17 rep_mode.conf
 -rw-r--r--  1 root root   49 Nov 26 18:04 xlog_note.0

 root@psql1:/var/lib/pgsql# cat xlog_note.0
 -e psql1 1900
 psql2 1900
 root@psql1:/var/lib/pgsql#

 PSQL2:
 ===
 root@psql2:/var/lib/pgsql# ls -la
 total 16
 drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:05 .
 drwxr-xr-x 33 root root 4096 Nov 26 00:10 ..
 -rw-r--r--  1 postgres postgres1 Nov 26 00:24 rep_mode.conf
 -rw-r--r--  1 root root   49 Nov 26 18:05 xlog_note.0
 root@psql2:/var/lib/pgsql# cat xlog_note.0
 -e psql1 1900
 psql2 1900
 root@psql2:/var/lib/pgsql#

It seems that dash's bultin echo command is used
because echo with -e option dose not function.

Perhaps my RA also depends on bash.
Can you use a bash instead of a dash?

 BTW, postgres is installed under /var/lib/postgresql , but I noticed that 
 some parts of the RA are referring to the  /var/lib/pgsql directory, so I 
 created that directory and i keep some of the files there.

It's no ploblem.
If you want to change this path, please specify it using tmpdir parameter.

Regards,
Takatoshi MATSUO


 Thanks,
 Attila



 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 26. 18:27
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

 Hi Attila

 1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2  files?
   These files are created while checking a xlog location on monitor.

 2. Do these files include lines as below?
 -
 pgsql1  1900
 pgsql2  1900
 -

 Regards.
 Takatoshi MATSUO


 2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com:
 Hi Yoshiharu, Takatoshi,

 Spent another day, without success. :(

 I started from scratch and synchronous replications works nicely when nodes 
 are started outside pacemaker.
 My PostgreSQL version is 9.1.1.

 When I start from pacemaker, after a while it gets into the following state:

 Online: [ psql1 psql2 ]

  Master/Slave Set: msPostgresql [postgresql]
     Slaves: [ psql1 psql2 ]
  Clone Set: clnPingCheck [pingCheck]
     Started: [ psql1 psql2 ]

 Node Attributes:
 * Node psql1:
    + default_ping_set                  : 100
    + master-postgresql:0               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900
 * Node psql2:
    + default_ping_set                  : 100
    + master-postgresql:1               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900


 The psql status queries return the following:

 PSQL1
 ==
 postgres@psql1:/root$ psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
 application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql1:/root$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0/1920|0/1900

 PSQL2
 ==
 postgres@psql2:~$  psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
  application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql2:~$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0/1900|0/1900


 Neither server can connect (obviously) to the master, as the vip_repl Is not 
 brought up.


 Could you help me understand WHAT is the action/state/event that sould 
 promote one of the nodes? I see that pacemaker monitors the servers every X 
 seconds, but nothing else happens.

 In the log (limited to pgsql) the following sequence is repeated
 forewer

 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist.
 Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right of master.
 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: My data status=.
 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql1 xlog location :
 1900 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql2 xlog
 location : 1900 Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: 
 PostgreSQL is running as a hot standby.
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: Master is not exist.
 Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: Checking right of master.
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: My data status=.
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql1 xlog location :
 1900 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql2 xlog
 location : 1900 Nov 26 13:36:33 psql1 pgsql

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-25 Thread Takatoshi MATSUO
Hi Attila

2011/11/24 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi, All,

 Thanks for your reply.
 I see that you have invested significant effort in the development of the RA. 
 I spent the last day trying to set up the RA, but without much success.

 My infrastructure is very similar to yours, except for the fact that 
 currently I am testing with a single network adapter.

 Replication works nicely when I start the databases manually, not using 
 corosync.

 When I try to start using corosync,I see that the ping resources start 
 normally, but the msPostgresql starts on both nodes in slave mode, and I see 
 HS:alone

To see HS:alone is normal.
And RA compares xlog locations and promote the postgresql having new data.

 In the Wiki you state, the if I start on a signle node only, PSQL should 
 start in Master mode (PRI), but this is not the case.

If the data is old, the node can't be master.
To be master needs pgsql-data-status=LATEST or STREAMING|SYNC.
Plese check it using crm_mon -A.

And to become a master from stopped takes a few minutes because the RA
compares xlog location on monitor.


 The recovery.conf file is created immediately, and from the logs I see no 
 attempt at all to promote the node.
 In the postgres logs I see that node1, which is supposed to be a master, 
 tries to connect to the vip-rep IP address, which is NOT brought up, because 
 it depends on the Master role...

 Do you have any idea?

Please check HA log.
My RA outputs My data is out-of-date. status= to log if the
data is old.

Regards,
Takatoshi MATSUO

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-25 Thread Attila Megyeri
Hi Takatoshi,

I have restored the PSQL to run without corosync so I cannot send you the 
crm_mon output now.

What I can tell for sure:
- RA never promoted any of the nodes, no matter what the status was. It also 
did not promote the node, when it was the only one.
- I believe the issue is in the comparison of the xlogs. How could I 
troubleshoot that? I see from the logs that crm NEVER tried to invoke pgsql 
with promote
- I tried previously the crm_mon -A option, but there was never a  
pgsql-data-status attribute. The other attribs were there, including the 
HS:alone
- In the corosync log the only relevant RA message I see is  Master is not 
exist.  I never saw a message like  My data is out-of-date

Thank you!

Attila


-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com] 
Sent: 2011. november 25. 8:56
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

2011/11/24 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi, All,

 Thanks for your reply.
 I see that you have invested significant effort in the development of the RA. 
 I spent the last day trying to set up the RA, but without much success.

 My infrastructure is very similar to yours, except for the fact that 
 currently I am testing with a single network adapter.

 Replication works nicely when I start the databases manually, not using 
 corosync.

 When I try to start using corosync,I see that the ping resources start 
 normally, but the msPostgresql starts on both nodes in slave mode, and I see 
 HS:alone

To see HS:alone is normal.
And RA compares xlog locations and promote the postgresql having new data.

 In the Wiki you state, the if I start on a signle node only, PSQL should 
 start in Master mode (PRI), but this is not the case.

If the data is old, the node can't be master.
To be master needs pgsql-data-status=LATEST or STREAMING|SYNC.
Plese check it using crm_mon -A.




And to become a master from stopped takes a few minutes because the RA compares 
xlog location on monitor.


 The recovery.conf file is created immediately, and from the logs I see no 
 attempt at all to promote the node.
 In the postgres logs I see that node1, which is supposed to be a master, 
 tries to connect to the vip-rep IP address, which is NOT brought up, because 
 it depends on the Master role...

 Do you have any idea?

Please check HA log.
My RA outputs My data is out-of-date. status= to log if the data is 
old.

Regards,
Takatoshi MATSUO

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-25 Thread Attila Megyeri
A quick snippet from the corosync.log

Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master.
Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=.
Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location : 0D00
Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog location : 0800

As you see, the my data status returns an empty string.


-Original Message-
From: Attila Megyeri [mailto:amegy...@minerva-soft.com] 
Sent: 2011. november 25. 9:28
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Takatoshi,

I have restored the PSQL to run without corosync so I cannot send you the 
crm_mon output now.

What I can tell for sure:
- RA never promoted any of the nodes, no matter what the status was. It also 
did not promote the node, when it was the only one.
- I believe the issue is in the comparison of the xlogs. How could I 
troubleshoot that? I see from the logs that crm NEVER tried to invoke pgsql 
with promote
- I tried previously the crm_mon -A option, but there was never a  
pgsql-data-status attribute. The other attribs were there, including the 
HS:alone
- In the corosync log the only relevant RA message I see is  Master is not 
exist.  I never saw a message like  My data is out-of-date

Thank you!

Attila


-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
Sent: 2011. november 25. 8:56
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

2011/11/24 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi, All,

 Thanks for your reply.
 I see that you have invested significant effort in the development of the RA. 
 I spent the last day trying to set up the RA, but without much success.

 My infrastructure is very similar to yours, except for the fact that 
 currently I am testing with a single network adapter.

 Replication works nicely when I start the databases manually, not using 
 corosync.

 When I try to start using corosync,I see that the ping resources start 
 normally, but the msPostgresql starts on both nodes in slave mode, and I see 
 HS:alone

To see HS:alone is normal.
And RA compares xlog locations and promote the postgresql having new data.

 In the Wiki you state, the if I start on a signle node only, PSQL should 
 start in Master mode (PRI), but this is not the case.

If the data is old, the node can't be master.
To be master needs pgsql-data-status=LATEST or STREAMING|SYNC.
Plese check it using crm_mon -A.




And to become a master from stopped takes a few minutes because the RA compares 
xlog location on monitor.


 The recovery.conf file is created immediately, and from the logs I see no 
 attempt at all to promote the node.
 In the postgres logs I see that node1, which is supposed to be a master, 
 tries to connect to the vip-rep IP address, which is NOT brought up, because 
 it depends on the Master role...

 Do you have any idea?

Please check HA log.
My RA outputs My data is out-of-date. status= to log if the data is 
old.

Regards,
Takatoshi MATSUO

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-25 Thread Yoshiharu Mori
Hi Attila

 A quick snippet from the corosync.log
 
 Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master.
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=.
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location : 
 0D00
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog location : 
 0800
 
 As you see, the my data status returns an empty string.

My log is same. but it works.

Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Master is not exist.
Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Checking right of master.
Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: My data status=.
Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm01 xlog location : 
0520
Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm02 xlog location : 
0500

In my log, the following logs are outputted and started after checking xlog 
location(3 times). 

Nov 18 19:29:39 osspc24-1 pgsql[18720]: INFO: I have a master right.

Please show us more corosync.log.


 
 
 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com] 
 Sent: 2011. november 25. 9:28
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed
 
 Hi Takatoshi,
 
 I have restored the PSQL to run without corosync so I cannot send you the 
 crm_mon output now.
 
 What I can tell for sure:
 - RA never promoted any of the nodes, no matter what the status was. It also 
 did not promote the node, when it was the only one.
 - I believe the issue is in the comparison of the xlogs. How could I 
 troubleshoot that? I see from the logs that crm NEVER tried to invoke pgsql 
 with promote
 - I tried previously the crm_mon -A option, but there was never a  
 pgsql-data-status attribute. The other attribs were there, including the 
 HS:alone
 - In the corosync log the only relevant RA message I see is  Master is not 
 exist.  I never saw a message like  My data is out-of-date
 
 Thank you!
 
 Attila
 
 
 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 25. 8:56
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed
 
 Hi Attila
 
 2011/11/24 Attila Megyeri amegy...@minerva-soft.com:
  Hi Takatoshi, All,
 
  Thanks for your reply.
  I see that you have invested significant effort in the development of the 
  RA. I spent the last day trying to set up the RA, but without much success.
 
  My infrastructure is very similar to yours, except for the fact that 
  currently I am testing with a single network adapter.
 
  Replication works nicely when I start the databases manually, not using 
  corosync.
 
  When I try to start using corosync,I see that the ping resources start 
  normally, but the msPostgresql starts on both nodes in slave mode, and I 
  see HS:alone
 
 To see HS:alone is normal.
 And RA compares xlog locations and promote the postgresql having new data.
 
  In the Wiki you state, the if I start on a signle node only, PSQL should 
  start in Master mode (PRI), but this is not the case.
 
 If the data is old, the node can't be master.
 To be master needs pgsql-data-status=LATEST or STREAMING|SYNC.
 Plese check it using crm_mon -A.
 
 
 
 
 And to become a master from stopped takes a few minutes because the RA 
 compares xlog location on monitor.
 
 
  The recovery.conf file is created immediately, and from the logs I see no 
  attempt at all to promote the node.
  In the postgres logs I see that node1, which is supposed to be a master, 
  tries to connect to the vip-rep IP address, which is NOT brought up, 
  because it depends on the Master role...
 
  Do you have any idea?
 
 Please check HA log.
 My RA outputs My data is out-of-date. status= to log if the data is 
 old.
 
 Regards,
 Takatoshi MATSUO
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


-- 
Yoshiharu Mori y-m...@sraoss.co.jp
SRA OSS, Inc Japan http://www.sraoss.co.jp
TEL: 03-5979-2701 
FAX: 03-5979-2702

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-25 Thread Attila Megyeri
Hi Yoshiharu,


-Original Message-
From: Yoshiharu Mori [mailto:y-m...@sraoss.co.jp] 
Sent: 2011. november 25. 14:17
To: The Pacemaker cluster resource manager
Cc: Attila Megyeri
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

 A quick snippet from the corosync.log
 
 Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master.
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=.
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location : 
 0D00 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog 
 location : 0800
 
 As you see, the my data status returns an empty string.

My log is same. but it works.

Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Master is not exist.
Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Checking right of master.
Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: My data status=.
Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm01 xlog location : 
0520 Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm02 xlog 
location : 0500

In my log, the following logs are outputted and started after checking xlog 
location(3 times). 

Nov 18 19:29:39 osspc24-1 pgsql[18720]: INFO: I have a master right.

Please show us more corosync.log.


===
I can leave it run forever, but will never show I have a master right.
To be honest, I have no idea what should promote the node to master.
What is it that the RA checks, and what could be wrong? I just cannot find 
where the problem is.

Right now I am running corosync on node 1 only, as I expect that this way it 
will have the most recent  xlog and start as a master.
But it never starts.

Here is the output for crm_mon -A :


Last updated: Fri Nov 25 13:52:58 2011
Stack: openais
Current DC: psql1 - partition WITHOUT quorum
Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
4 Resources configured.


Online: [ psql1 ]
OFFLINE: [ psql2 ]

 Master/Slave Set: msPostgresql [postgresql]
 Slaves: [ psql1 ]
 Stopped: [ postgresql:1 ]
 Clone Set: clnPingCheck [pingCheck]
 Started: [ psql1 ]
 Stopped: [ pingCheck:1 ]

Node Attributes:
* Node psql1:
+ default_ping_set  : 100
+ master-postgresql:0   : -INFINITY
+ pgsql-status  : HS:alone
+ pgsql-xlog-loc: 1200



I sent the log directly in private not to overload the list. I did a resource 
stop msPostgresql and resource start msPostgresql around 13:52.
You will see some extra debug messages starting with ATT - I added them to 
the RA to help my troubleshooting.

Thank you for your help,
Attila





 
 
 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: 2011. november 25. 9:28
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - 
 RA needed
 
 Hi Takatoshi,
 
 I have restored the PSQL to run without corosync so I cannot send you the 
 crm_mon output now.
 
 What I can tell for sure:
 - RA never promoted any of the nodes, no matter what the status was. It also 
 did not promote the node, when it was the only one.
 - I believe the issue is in the comparison of the xlogs. How could I 
 troubleshoot that? I see from the logs that crm NEVER tried to invoke pgsql 
 with promote
 - I tried previously the crm_mon -A option, but there was never a  
 pgsql-data-status attribute. The other attribs were there, including 
 the HS:alone
 - In the corosync log the only relevant RA message I see is  Master is not 
 exist.  I never saw a message like  My data is out-of-date
 
 Thank you!
 
 Attila
 
 
 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 25. 8:56
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - 
 RA needed
 
 Hi Attila
 
 2011/11/24 Attila Megyeri amegy...@minerva-soft.com:
  Hi Takatoshi, All,
 
  Thanks for your reply.
  I see that you have invested significant effort in the development of the 
  RA. I spent the last day trying to set up the RA, but without much success.
 
  My infrastructure is very similar to yours, except for the fact that 
  currently I am testing with a single network adapter.
 
  Replication works nicely when I start the databases manually, not using 
  corosync.
 
  When I try to start using corosync,I see that the ping resources start 
  normally, but the msPostgresql starts on both nodes in slave mode, and I 
  see HS:alone
 
 To see HS:alone is normal.
 And RA compares xlog locations and promote the postgresql having new data.
 
  In the Wiki you state, the if I start on a signle node only, PSQL should 
  start in Master mode (PRI), but this is not the case.
 
 If the data is old, the node can't be master.
 To be master needs pgsql-data-status=LATEST or STREAMING|SYNC.
 Plese check it using

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-23 Thread Attila Megyeri
=INFINITY \
migration-threshold=1



Regards,
Attila



-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com] 
Sent: 2011. november 17. 8:04
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi  All

I create a RA for PosstgrSQL 9.1 Streaming Replication based on pgsql.

RA
  https://github.com/t-matsuo/resource-agents/blob/pgsql91/heartbeat/pgsql
Documents
  https://github.com/t-matsuo/resource-agents/wiki

It is almost totally changed from previous patch 
http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018193.html
.
It create recovery.conf and promote PostgreSQL automatically.
Additionally it can switch between the synchronous and asynchronous replication 
automatically.

If you please, use them and comment.

Regards,
Takatoshi MATSUO

2011/11/17 Serge Dubrouski serge...@gmail.com:


 On Wed, Nov 16, 2011 at 12:55 PM, Attila Megyeri 
 amegy...@minerva-soft.com
 wrote:

 Hi Florian,

 -Original Message-
 From: Florian Haas [mailto:flor...@hastexo.com]
 Sent: 2011. november 16. 11:49
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - 
 RA needed

 Hi Attila,

 On 2011-11-16 10:27, Attila Megyeri wrote:
  Hi All,
 
 
 
  We have a two-node postgresql 9.1 system configured using streaming 
  replicaiton(active/active with a read-only slave).
 
  We want to automate the failover process and I couldn't really find 
  a resource agent that could do the job.

 That is correct; the pgsql resource agent (unlike its mysql 
 counterpart) does not support streaming replication. We've had a 
 contributor submit a patch at one point, but it was somewhat 
 ill-conceived and thus did not make it into the upstream repo. The relevant 
 thread is here:

 http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195
 .html

 Would you feel comfortable modifying the pgsql resource agent to 
 support replication? If so, we could revisit this issue and 
 potentially add streaming replication support to pgsql.


 Well I'm not sure I would be able to do that change. Failover is 
 relatively easy to do but I really have no idea how to do the failback part.

 And that's exactly the reason why I haven't implemented it yet. With 
 the current way how replication is done in PostgreSQL there is no easy 
 way to switch between roles, or at least I don't know about a such way.
 Implementing just fail-over functionality by creating a trigger file 
 on a slave server in the case of failure on master side doesn't create 
 a full master-slave implementation in my opinion.


 I will definitively have to sort this out somehow, I am just unsure 
 whether I will try to use the repmgr mentioned in the video, or 
 pacemaker with some level of customization...

 Is the resource agent that you mentioned available somewhere?

 Thanks.
 Attila



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacem
 aker



 --
 Serge Dubrouski.

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacema
 ker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-16 Thread Attila Megyeri
Hi All,

We have a two-node postgresql 9.1 system configured using streaming replicaiton 
(active/active with a read-only slave).
We want to automate the failover process and I couldn't really find a resource 
agent that could do the job.

All HA solutions for postgresql I have seen are based on a DRBD active/passive 
approach, that we would not prefer.

At the first stage I would be satisified with the failover only - meaning 
that the more complex failback would not be required.
Of course if the failback could be implemented as well, that would be the right 
solution for us.

Does anyone have experience with the above setup? Any feedback is appreciated!

Regards,
Attila
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-16 Thread Florian Haas
Hi Attila,

On 2011-11-16 10:27, Attila Megyeri wrote:
 Hi All,
 
  
 
 We have a two-node postgresql 9.1 system configured using streaming
 replicaiton(active/active with a read-only slave).
 
 We want to automate the failover process and I couldn’t really find a
 resource agent that could do the job.

That is correct; the pgsql resource agent (unlike its mysql counterpart)
does not support streaming replication. We've had a contributor submit a
patch at one point, but it was somewhat ill-conceived and thus did not
make it into the upstream repo. The relevant thread is here:

http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195.html

Would you feel comfortable modifying the pgsql resource agent to support
replication? If so, we could revisit this issue and potentially add
streaming replication support to pgsql.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-16 Thread Attila Megyeri
Hi Florian,

-Original Message-
From: Florian Haas [mailto:flor...@hastexo.com] 
Sent: 2011. november 16. 11:49
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila,

On 2011-11-16 10:27, Attila Megyeri wrote:
 Hi All,
 
  
 
 We have a two-node postgresql 9.1 system configured using streaming 
 replicaiton(active/active with a read-only slave).
 
 We want to automate the failover process and I couldn't really find a 
 resource agent that could do the job.

That is correct; the pgsql resource agent (unlike its mysql counterpart) does 
not support streaming replication. We've had a contributor submit a patch at 
one point, but it was somewhat ill-conceived and thus did not make it into the 
upstream repo. The relevant thread is here:

http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195.html

Would you feel comfortable modifying the pgsql resource agent to support 
replication? If so, we could revisit this issue and potentially add streaming 
replication support to pgsql.


Well I'm not sure I would be able to do that change. Failover is relatively 
easy to do but I really have no idea how to do the failback part. I will 
definitively have to sort this out somehow, I am just unsure whether I will try 
to use the repmgr mentioned in the video, or pacemaker with some level of 
customization...

Is the resource agent that you mentioned available somewhere?

Thanks.
Attila



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-16 Thread Serge Dubrouski
On Wed, Nov 16, 2011 at 12:55 PM, Attila Megyeri
amegy...@minerva-soft.comwrote:

 Hi Florian,

 -Original Message-
 From: Florian Haas [mailto:flor...@hastexo.com]
 Sent: 2011. november 16. 11:49
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA
 needed

 Hi Attila,

 On 2011-11-16 10:27, Attila Megyeri wrote:
  Hi All,
 
 
 
  We have a two-node postgresql 9.1 system configured using streaming
  replicaiton(active/active with a read-only slave).
 
  We want to automate the failover process and I couldn't really find a
  resource agent that could do the job.

 That is correct; the pgsql resource agent (unlike its mysql counterpart)
 does not support streaming replication. We've had a contributor submit a
 patch at one point, but it was somewhat ill-conceived and thus did not make
 it into the upstream repo. The relevant thread is here:

 http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195.html

 Would you feel comfortable modifying the pgsql resource agent to support
 replication? If so, we could revisit this issue and potentially add
 streaming replication support to pgsql.


 Well I'm not sure I would be able to do that change. Failover is
 relatively easy to do but I really have no idea how to do the failback part.


And that's exactly the reason why I haven't implemented it yet. With the
current way how replication is done in PostgreSQL there is no easy way to
switch between roles, or at least I don't know about a such way.
Implementing just fail-over functionality by creating a trigger file on a
slave server in the case of failure on master side doesn't create a full
master-slave implementation in my opinion.


 I will definitively have to sort this out somehow, I am just unsure
 whether I will try to use the repmgr mentioned in the video, or pacemaker
 with some level of customization...

 Is the resource agent that you mentioned available somewhere?

 Thanks.
 Attila



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




-- 
Serge Dubrouski.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-16 Thread Takatoshi MATSUO
Hi  All

I create a RA for PosstgrSQL 9.1 Streaming Replication based on pgsql.

RA
  https://github.com/t-matsuo/resource-agents/blob/pgsql91/heartbeat/pgsql
Documents
  https://github.com/t-matsuo/resource-agents/wiki

It is almost totally changed from previous patch
http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018193.html
.
It create recovery.conf and promote PostgreSQL automatically.
Additionally it can switch between the synchronous and asynchronous
replication automatically.

If you please, use them and comment.

Regards,
Takatoshi MATSUO

2011/11/17 Serge Dubrouski serge...@gmail.com:


 On Wed, Nov 16, 2011 at 12:55 PM, Attila Megyeri amegy...@minerva-soft.com
 wrote:

 Hi Florian,

 -Original Message-
 From: Florian Haas [mailto:flor...@hastexo.com]
 Sent: 2011. november 16. 11:49
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA
 needed

 Hi Attila,

 On 2011-11-16 10:27, Attila Megyeri wrote:
  Hi All,
 
 
 
  We have a two-node postgresql 9.1 system configured using streaming
  replicaiton(active/active with a read-only slave).
 
  We want to automate the failover process and I couldn't really find a
  resource agent that could do the job.

 That is correct; the pgsql resource agent (unlike its mysql counterpart)
 does not support streaming replication. We've had a contributor submit a
 patch at one point, but it was somewhat ill-conceived and thus did not make
 it into the upstream repo. The relevant thread is here:

 http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195.html

 Would you feel comfortable modifying the pgsql resource agent to support
 replication? If so, we could revisit this issue and potentially add
 streaming replication support to pgsql.


 Well I'm not sure I would be able to do that change. Failover is
 relatively easy to do but I really have no idea how to do the failback part.

 And that's exactly the reason why I haven't implemented it yet. With the
 current way how replication is done in PostgreSQL there is no easy way to
 switch between roles, or at least I don't know about a such way.
 Implementing just fail-over functionality by creating a trigger file on a
 slave server in the case of failure on master side doesn't create a full
 master-slave implementation in my opinion.


 I will definitively have to sort this out somehow, I am just unsure
 whether I will try to use the repmgr mentioned in the video, or pacemaker
 with some level of customization...

 Is the resource agent that you mentioned available somewhere?

 Thanks.
 Attila



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



 --
 Serge Dubrouski.

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker