Re: [Pacemaker] failed actions are not removed

2014-04-04 Thread Attila Megyeri
Hi Lars,


 -Original Message-
 From: Lars Marowsky-Bree [mailto:l...@suse.com]
 Sent: Tuesday, April 01, 2014 2:59 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] failed actions are not removed
 
 On 2014-04-01T14:41:11, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
  Hi Andrew, all,
 
  We use Pacemeaker 1.1.10, with corosync 2.2.3 and we notice that failed
 actions are not reset after the cluster recheck interval has elapsed.
  Is this a known issue, or shall I provide some more details?
 
 What have you set failure-timeout set to?
 
 Are they just still being shown, or are they having an impact on your resource
 placement still too?
 
 If you can provide a CIB for this scenario it's easier to answer.
 

The failure-timeout is set to 2m, and the cluster-recheck-interval=2m - but 
I included this in the previous email as well.
I cannot actually recall whether this has impact on the placement, or is it 
only an inconvenience in the crm_mon. I will reproduce a failure, and check 
this out.

What do you exactly need from the cib? I would not like to post the entire CIB 
as it has confidential data, and I would have to manually remove everything, 
but if you tell me which parts are relevant I will do that.

Thank you for your quick response!



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] failed actions are not removed

2014-04-01 Thread Attila Megyeri
Hi Andrew, all,

We use Pacemeaker 1.1.10, with corosync 2.2.3 and we notice that failed actions 
are not reset after the cluster recheck interval has elapsed.
Is this a known issue, or shall I provide some more details?

It has worked in previous setups properly, we have no idea what could be the 
issue here.


Some background:

In properties:

cluster-recheck-interval=2m \

In the relevant resources:

primitive jboss_imssrv2 ocf:heartbeat:jboss \
params shutdown_timeout=10 user=jboss

op start interval=0 timeout=60s on-fail=restart \
op monitor interval=10s timeout=90s on-fail=restart \
op stop interval=0 timeout=120s on-fail=block \
meta migration-threshold=5 failure-timeout=2m target-role=Started


Thanks!

Attila
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-18 Thread Attila Megyeri
Hello,

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 18, 2014 2:43 AM
 To: Attila Megyeri
 Cc: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 13 Mar 2014, at 11:44 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
  Hello,
 
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Thursday, March 13, 2014 10:03 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  ...
 
 
  Also can you please try to set debug: on in corosync.conf and
  paste full corosync.log then?
 
  I set debug to on, and did a few restarts but could not reproduce
  the issue
  yet - will post the logs as soon as I manage to reproduce.
 
 
  Perfect.
 
  Another option you can try to set is netmtu (1200 is usually safe).
 
  Finally I was able to reproduce the issue.
  I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
  (not
  when node was up again).
 
  The corosync log with debug on is available at:
  http://pastebin.com/kTpDqqtm
 
 
  To be honest, I had to wait much longer for this reproduction as
  before,
  even though there was no change in the corosync configuration - just
  potentially some system updates. But anyway, the issue is
  unfortunately still there.
  Previously, when this issue came, cpu was at 100% on all nodes -
  this time
  only on ctmgr, which was the DC...
 
  I hope you can find some useful details in the log.
 
 
  Attila,
  what seems to be interesting is
 
  Configuration ERRORs found during PE processing.  Please run crm_verify
 -L
  to identify issues.
 
  I'm unsure how much is this problem but I'm really not pacemaker expert.
 
  Perhaps Andrew could comment on that. Any idea?
 
 Did you run the command?  What did it say?

Yes, all was fine. This is why I found it strange.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-18 Thread Attila Megyeri
Hi Andrew,


 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 18, 2014 11:40 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 18 Mar 2014, at 6:03 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
  Hello,
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 18, 2014 2:43 AM
  To: Attila Megyeri
  Cc: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 13 Mar 2014, at 11:44 pm, Attila Megyeri amegyeri@minerva-
 soft.com
  wrote:
 
  Hello,
 
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Thursday, March 13, 2014 10:03 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  ...
 
  Attila,
  what seems to be interesting is
 
  Configuration ERRORs found during PE processing.  Please run
 crm_verify
  -L
  to identify issues.
 
  I'm unsure how much is this problem but I'm really not pacemaker
 expert.
 
  Perhaps Andrew could comment on that. Any idea?
 
  Did you run the command?  What did it say?
 
  Yes, all was fine. This is why I found it strange.
 
 If you still have /var/lib/pacemaker/pengine/pe-error-7.bz2 from ctdb2, then
 I should be able to figure out what it was complaining about.
 (You can also run: crm_verify --xml-file /var/lib/pacemaker/pengine/pe-
 error-7.bz2 -V )

The file is still there, and crm_veryfy check is successful (error 0) and no 
output. The file is full of confidential data but if you think you can find 
something useful in it I can send it in a direct mail.

thanks!





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-17 Thread Attila Megyeri
Hi David, Jan,

For the time being corosync 2.3.3 looks stable with libqb 0.17.0 with both 
build from source.
Thank you very much for the guidance!

Attila

 -Original Message-
 From: David Vossel [mailto:dvos...@redhat.com]
 Sent: Thursday, March 13, 2014 9:22 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 
 
 
 - Original Message -
  From: Jan Friesse jfrie...@redhat.com
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Thursday, March 13, 2014 4:03:28 AM
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  ...
 
  
   Also can you please try to set debug: on in corosync.conf and
   paste full corosync.log then?
  
   I set debug to on, and did a few restarts but could not reproduce
   the issue
   yet - will post the logs as soon as I manage to reproduce.
  
  
   Perfect.
  
   Another option you can try to set is netmtu (1200 is usually safe).
  
   Finally I was able to reproduce the issue.
   I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
   (not when node was up again).
  
   The corosync log with debug on is available at:
   http://pastebin.com/kTpDqqtm
  
  
   To be honest, I had to wait much longer for this reproduction as
   before, even though there was no change in the corosync
   configuration - just potentially some system updates. But anyway,
   the issue is unfortunately still there.
   Previously, when this issue came, cpu was at 100% on all nodes -
   this time only on ctmgr, which was the DC...
  
   I hope you can find some useful details in the log.
  
 
  Attila,
  what seems to be interesting is
 
  Configuration ERRORs found during PE processing.  Please run
  crm_verify -L to identify issues.
 
  I'm unsure how much is this problem but I'm really not pacemaker expert.
 
  Anyway, I have theory what may happening and it looks like related
  with IPC (and probably not related to network). But to make sure we
  will not try fixing already fixed bug, can you please build:
  - New libqb (0.17.0). There are plenty of fixes in IPC
  - Corosync 2.3.3 (already plenty IPC fixes)
 
 yes, there was a libqb/corosync interoperation problem that showed these
 same symptoms last year. Updating to the latest corosync and libqb will likely
 resolve this.
 
  - And maybe also newer pacemaker
 
  I know you were not very happy using hand-compiled sources, but please
  give them at least a try.
 
  Thanks,
Honza
 
   Thanks,
   Attila
  
  
  
  
   Regards,
 Honza
  
  
   There are also a few things that might or might not be related:
  
   1) Whenever I want to edit the configuration with crm configure
   edit,
 
  ...
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org Getting started:
  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-14 Thread Attila Megyeri
Hello David,


 -Original Message-
 From: David Vossel [mailto:dvos...@redhat.com]
 Sent: Thursday, March 13, 2014 9:22 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 
 
 
 - Original Message -
  From: Jan Friesse jfrie...@redhat.com
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Thursday, March 13, 2014 4:03:28 AM
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  ...
 
  
   Also can you please try to set debug: on in corosync.conf and
   paste full corosync.log then?
  
   I set debug to on, and did a few restarts but could not reproduce
   the issue
   yet - will post the logs as soon as I manage to reproduce.
  
  
   Perfect.
  
   Another option you can try to set is netmtu (1200 is usually safe).
  
   Finally I was able to reproduce the issue.
   I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
   (not when node was up again).
  
   The corosync log with debug on is available at:
   http://pastebin.com/kTpDqqtm
  
  
   To be honest, I had to wait much longer for this reproduction as
   before, even though there was no change in the corosync
   configuration - just potentially some system updates. But anyway,
   the issue is unfortunately still there.
   Previously, when this issue came, cpu was at 100% on all nodes -
   this time only on ctmgr, which was the DC...
  
   I hope you can find some useful details in the log.
  
 
  Attila,
  what seems to be interesting is
 
  Configuration ERRORs found during PE processing.  Please run
  crm_verify -L to identify issues.
 
  I'm unsure how much is this problem but I'm really not pacemaker expert.
 
  Anyway, I have theory what may happening and it looks like related
  with IPC (and probably not related to network). But to make sure we
  will not try fixing already fixed bug, can you please build:
  - New libqb (0.17.0). There are plenty of fixes in IPC
  - Corosync 2.3.3 (already plenty IPC fixes)
 
 yes, there was a libqb/corosync interoperation problem that showed these
 same symptoms last year. Updating to the latest corosync and libqb will likely
 resolve this.

I have upgraded all nodes to these version and we are testing. So far no issues.
Thank you very much for your help.

Regards,
Attila





 
  - And maybe also newer pacemaker
 
  I know you were not very happy using hand-compiled sources, but please
  give them at least a try.
 
  Thanks,
Honza
 
   Thanks,
   Attila
  
  
  
  
   Regards,
 Honza
  
  
   There are also a few things that might or might not be related:
  
   1) Whenever I want to edit the configuration with crm configure
   edit,
 
  ...
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org Getting started:
  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread Attila Megyeri
Hello,

 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Thursday, March 13, 2014 10:03 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 ...
 
 
  Also can you please try to set debug: on in corosync.conf and paste
  full corosync.log then?
 
  I set debug to on, and did a few restarts but could not reproduce
  the issue
  yet - will post the logs as soon as I manage to reproduce.
 
 
  Perfect.
 
  Another option you can try to set is netmtu (1200 is usually safe).
 
  Finally I was able to reproduce the issue.
  I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not
 when node was up again).
 
  The corosync log with debug on is available at:
  http://pastebin.com/kTpDqqtm
 
 
  To be honest, I had to wait much longer for this reproduction as before,
 even though there was no change in the corosync configuration - just
 potentially some system updates. But anyway, the issue is unfortunately still
 there.
  Previously, when this issue came, cpu was at 100% on all nodes - this time
 only on ctmgr, which was the DC...
 
  I hope you can find some useful details in the log.
 
 
 Attila,
 what seems to be interesting is
 
 Configuration ERRORs found during PE processing.  Please run crm_verify -L
 to identify issues.
 
 I'm unsure how much is this problem but I'm really not pacemaker expert.

Perhaps Andrew could comment on that. Any idea?


 
 Anyway, I have theory what may happening and it looks like related with IPC
 (and probably not related to network). But to make sure we will not try fixing
 already fixed bug, can you please build:
 - New libqb (0.17.0). There are plenty of fixes in IPC
 - Corosync 2.3.3 (already plenty IPC fixes)
 - And maybe also newer pacemaker
 

I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu 
package.
I am currently building libqb 0.17.0, will update you on the results.

In the meantime we had another freeze, which did not seem to be related to any 
restarts, but brought all coroync processes to 100%.
Please check out the corosync.log, perhaps it is a different cause: 
http://pastebin.com/WMwzv0Rr 


In the meantime I will install the new libqb and send logs if we have further 
issues.

Thank you very much for your help!

Regards,
Attila



 I know you were not very happy using hand-compiled sources, but please
 give them at least a try.
 
 Thanks,
   Honza
 
  Thanks,
  Attila
 
 
 
 
  Regards,
Honza
 
 
  There are also a few things that might or might not be related:
 
  1) Whenever I want to edit the configuration with crm configure
  edit,
 
 ...
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread Attila Megyeri

 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Thursday, March 13, 2014 1:45 PM
 To: The Pacemaker cluster resource manager; Andrew Beekhof
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 Hello,
 
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Thursday, March 13, 2014 10:03 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  ...
 
  
   Also can you please try to set debug: on in corosync.conf and
   paste full corosync.log then?
  
   I set debug to on, and did a few restarts but could not reproduce
   the issue
   yet - will post the logs as soon as I manage to reproduce.
  
  
   Perfect.
  
   Another option you can try to set is netmtu (1200 is usually safe).
  
   Finally I was able to reproduce the issue.
   I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
   (not
  when node was up again).
  
   The corosync log with debug on is available at:
   http://pastebin.com/kTpDqqtm
  
  
   To be honest, I had to wait much longer for this reproduction as
   before,
  even though there was no change in the corosync configuration - just
  potentially some system updates. But anyway, the issue is
  unfortunately still there.
   Previously, when this issue came, cpu was at 100% on all nodes -
   this time
  only on ctmgr, which was the DC...
  
   I hope you can find some useful details in the log.
  
 
  Attila,
  what seems to be interesting is
 
  Configuration ERRORs found during PE processing.  Please run crm_verify -
 L
  to identify issues.
 
  I'm unsure how much is this problem but I'm really not pacemaker expert.
 
 Perhaps Andrew could comment on that. Any idea?
 
 
 
  Anyway, I have theory what may happening and it looks like related
  with IPC (and probably not related to network). But to make sure we
  will not try fixing already fixed bug, can you please build:
  - New libqb (0.17.0). There are plenty of fixes in IPC
  - Corosync 2.3.3 (already plenty IPC fixes)
  - And maybe also newer pacemaker
 
 
 I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from
 Ubuntu package.
 I am currently building libqb 0.17.0, will update you on the results.
 
 In the meantime we had another freeze, which did not seem to be related to
 any restarts, but brought all coroync processes to 100%.
 Please check out the corosync.log, perhaps it is a different cause:
 http://pastebin.com/WMwzv0Rr
 
 
 In the meantime I will install the new libqb and send logs if we have further
 issues.
 
 Thank you very much for your help!
 
 Regards,
 Attila
 

One more question:

If I install libqb 0.17.0 from source, do I need to rebuild corosync as well, 
or if it was built with libqb 0.16.0 it will be fine?

BTW, in the meantime I installed the new libqb on 3 of the 7 hosts, so I can 
see if it makes a difference. If I see crashes on the outdated ones, but not on 
the new ones, we are fine. :)

Thanks,

Attila







 
 
  I know you were not very happy using hand-compiled sources, but please
  give them at least a try.
 
  Thanks,
Honza
 
   Thanks,
   Attila
  
  
  
  
   Regards,
 Honza
  
  
   There are also a few things that might or might not be related:
  
   1) Whenever I want to edit the configuration with crm configure
   edit,
 
  ...
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org Getting started:
  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread Attila Megyeri
Hi Honza,

What I also found in the log related to the freeze at 12:22:26:


Corosync main process was not scheduled for  ... Can It be the general 
cause of the issue?



Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
[10.9.1.5]:58597-[10.9.1.3]:161
Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
[10.9.1.5]:47943-[10.9.1.3]:161
Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
[10.9.1.5]:47943-[10.9.1.3]:161
Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
[10.9.1.5]:59647-[10.9.1.3]:161


Mar 13 12:22:26 ctmgr corosync[3024]:   [MAIN  ] Corosync main process was not 
scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token timeout 
increase.


Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] The token was lost in the 
OPERATIONAL state.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] A processor failed, forming 
new configuration.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering GATHER state from 
2(The token was lost in the OPERATIONAL state.).
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Creating commit token because 
I am the rep.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Saving state aru 6a8c high seq 
received 6a8c
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Storing new sequence id for 
ring 7dc
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering COMMIT state.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] got commit token
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering RECOVERY state.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [0] member 10.9.1.3:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [1] member 10.9.1.41:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [2] member 10.9.1.42:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [3] member 10.9.1.71:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [4] member 10.9.1.72:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [5] member 10.9.2.11:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [6] member 10.9.2.12:




Regards,
Attila

 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Thursday, March 13, 2014 2:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  -Original Message-
  From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
  Sent: Thursday, March 13, 2014 1:45 PM
  To: The Pacemaker cluster resource manager; Andrew Beekhof
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  Hello,
 
   -Original Message-
   From: Jan Friesse [mailto:jfrie...@redhat.com]
   Sent: Thursday, March 13, 2014 10:03 AM
   To: The Pacemaker cluster resource manager
   Subject: Re: [Pacemaker] Pacemaker/corosync freeze
  
   ...
  
   
Also can you please try to set debug: on in corosync.conf and
paste full corosync.log then?
   
I set debug to on, and did a few restarts but could not
reproduce the issue
yet - will post the logs as soon as I manage to reproduce.
   
   
Perfect.
   
Another option you can try to set is netmtu (1200 is usually safe).
   
Finally I was able to reproduce the issue.
I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
(not
   when node was up again).
   
The corosync log with debug on is available at:
http://pastebin.com/kTpDqqtm
   
   
To be honest, I had to wait much longer for this reproduction as
before,
   even though there was no change in the corosync configuration - just
   potentially some system updates. But anyway, the issue is
   unfortunately still there.
Previously, when this issue came, cpu was at 100% on all nodes -
this time
   only on ctmgr, which was the DC...
   
I hope you can find some useful details in the log.
   
  
   Attila,
   what seems to be interesting is
  
   Configuration ERRORs found during PE processing.  Please run
   crm_verify -
  L
   to identify issues.
  
   I'm unsure how much is this problem but I'm really not pacemaker
 expert.
 
  Perhaps Andrew could comment on that. Any idea?
 
 
  
   Anyway, I have theory what may happening and it looks like related
   with IPC (and probably not related to network). But to make sure we
   will not try fixing already fixed bug, can you please build:
   - New libqb (0.17.0). There are plenty of fixes in IPC
   - Corosync 2.3.3 (already plenty IPC fixes)
   - And maybe also newer pacemaker
  
 
  I already use Corosync 2.3.3, built from source, and libqb-dev 0.16
  from Ubuntu package.
  I am currently building libqb 0.17.0, will update you on the results.
 
  In the meantime we had another freeze, which did not seem to be
  related to any restarts, but brought all coroync processes to 100%.
  Please check out the corosync.log, perhaps it is a different cause:
  http://pastebin.com/WMwzv0Rr
 
 
  In the meantime I will install the new libqb and send logs if we have
  further issues.
 
  Thank you very much

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 12:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com
  wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and suddenly the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses 100%
  cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work in
  most of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly
  confused
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
  Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install from
  sources,
  but that would be very difficult to maintain and I'm not sure I would
  get rid of this issue.
 
  What do you recommend?
 
  The same thing as Lars, or switch to a distro that stays current with
  upstream (git shows 5 newer releases for that branch since it was
  released 3 years ago).
  If you do build from source, its probably best to go with v1.4.6
 
  Hm, I am a bit confused here. We are using 2.3.0,
 
 I swapped the 2 for a 1 somehow. A bit distracted, sorry.

I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the 
same issue - after some time CPU gets to 100%, and the corosync log is flooded 
with messages like:

Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 0 
CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 0 
CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:57 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 0 
CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)


Shall I try to downgrade to 1.4.6? What is the difference in that build? Or 
where should I start troubleshooting?

Thank you in advance.






 
  which was released approx. a year ago (you mention 3

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri
Hello Jan,

Thank you very much for your help so far.

 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 9:51 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 Attila Megyeri napsal(a):
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 10:27 PM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 12 Mar 2014, at 1:54 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 12:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:54 pm, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and suddenly
 the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses
  100% cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work in
  most of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode
 passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
Sent
 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
Sent
 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
Sent
 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
Sent
 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly
  confused
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
  Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install from
  sources,
  but that would be very difficult to maintain and I'm not sure I
  would get rid of this issue.
 
  What do you recommend?
 
  The same thing as Lars, or switch to a distro that stays current
  with upstream (git shows 5 newer releases for that branch since it
  was released 3 years ago).
  If you do build from source, its probably best to go with v1.4.6
 
  Hm, I am a bit confused here. We are using 2.3.0,
 
  I swapped the 2 for a 1 somehow. A bit distracted, sorry.
 
  I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still 
  the
 same issue - after some time CPU gets to 100%, and the corosync log is
 flooded with messages like:
 
  Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:57 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0 CPG
 messages  (51 remaining, last=3995

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri
 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 2:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):
  Hello Jan,
 
  Thank you very much for your help so far.
 
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Wednesday, March 12, 2014 9:51 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  Attila Megyeri napsal(a):
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 10:27 PM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 12 Mar 2014, at 1:54 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 12:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:54 pm, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and
 suddenly
  the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses
  100% cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work
  in most of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode
  passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly
  confused
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
  Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install
  from sources,
  but that would be very difficult to maintain and I'm not sure I
  would get rid of this issue.
 
  What do you recommend?
 
  The same thing as Lars, or switch to a distro that stays current
  with upstream (git shows 5 newer releases for that branch since
  it was released 3 years ago).
  If you do build from source, its probably best to go with v1.4.6
 
  Hm, I am a bit confused here. We are using 2.3.0,
 
  I swapped the 2 for a 1 somehow. A bit distracted, sorry.
 
  I upgraded all nodes to 2.3.3 and first it seemed a bit better, but
  still the
  same issue - after some time CPU gets to 100%, and the corosync log
  is flooded with messages like:
 
  Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:57 [4793] ctdb2cib: info

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri


 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 4:31 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Wednesday, March 12, 2014 2:27 PM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  Attila Megyeri napsal(a):
  Hello Jan,
 
  Thank you very much for your help so far.
 
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Wednesday, March 12, 2014 9:51 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  Attila Megyeri napsal(a):
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 10:27 PM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 12 Mar 2014, at 1:54 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 12:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:54 pm, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and
  suddenly
  the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses
  100% cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not
 work
  in most of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode
  passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
  Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
  Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
  Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
  Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly
  confused
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
  Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install
  from sources,
  but that would be very difficult to maintain and I'm not sure I
  would get rid of this issue.
 
  What do you recommend?
 
  The same thing as Lars, or switch to a distro that stays
  current with upstream (git shows 5 newer releases for that
  branch since it was released 3 years ago).
  If you do build from source, its probably best to go with
  v1.4.6
 
  Hm, I am a bit confused here. We are using 2.3.0,
 
  I swapped the 2 for a 1 somehow. A bit distracted, sorry.
 
  I upgraded all nodes to 2.3.3 and first it seemed a bit better,
  but still the
  same issue - after some time CPU gets to 100%, and the corosync log
  is flooded with messages like:
 
  Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0
  CPG
  messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-11 Thread Attila Megyeri

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and suddenly the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses 100%
  cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work in most
  of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly confused
 state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
 Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install from sources,
 but that would be very difficult to maintain and I'm not sure I would get rid 
 of
 this issue.
 
  What do you recommend?
 
 The same thing as Lars, or switch to a distro that stays current with upstream
 (git shows 5 newer releases for that branch since it was released 3 years
 ago).
 If you do build from source, its probably best to go with v1.4.6

Hm, I am a bit confused here. We are using 2.3.0, which was released approx. a 
year ago (you mention 3 years) and you recommend 1.4.6, which is a rather old 
version.
Could you please clarify a bit? :)
Lars recommends 2.3.3 git tree.

I might end up trying both, but just want to make sure I am not 
misunderstanding something badly.

Thank you!








 
 
 
 
  HTOP show something like this (sorted by TIME+ descending):
 
 
 
   1  [100.0%] Tasks: 59, 4
  thr; 2 running
   2  [| 0.7%] Load average: 
  1.00 0.99 1.02
   Mem[ 165/994MB] Uptime: 1
  day, 10:22:03
   Swp[   0/509MB]
 
   PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
   921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58
 /usr/sbin/corosync
  1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
  /usr/sbin/snmpd
 -
  Lsd -Lf /dev/null -u snmp -g snm
  1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
  /usr/lib/pacemaker/cib
  1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
  /usr/lib/pacemaker/stonithd
  1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
  /usr/sbin/watchdog
  1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
  /usr/lib/pacemaker/crmd
  1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
  /usr/lib/pacemaker/lrmd
  1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
  /usr/lib/pacemaker/attrd
  1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
  1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: 
  read
 process
  1315 hacluster  20   0 73892  2652

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-07 Thread Attila Megyeri
One more thing to add. I did an apt-get upgrade on one of the nodes, and then 
restarted the node. It resulted in this state on all other nodes again...

 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Friday, March 07, 2014 7:54 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com
  wrote:
 
   Hello,
  
   We have a strange issue with Corosync/Pacemaker.
   From time to time, something unexpected happens and suddenly the
  crm_mon output remains static.
   When I check the cpu usage, I see that one of the cores uses 100%
   cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
  
   In such a case, this high CPU usage is happening on all 7 nodes.
   I have to manually go to each node, stop pacemaker, restart
   corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work in most
  of the cases, usually a kill -9 is needed.
  
   Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
  
   Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
  
   Logs are usually flooded with CPG related messages, such as:
  
   Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
   Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
   Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
   Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
   Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
   Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
   Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
   Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
  
   OR
  
   Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
   Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
   Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
   Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
   Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
   Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly confused 
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
  
 
 As I wrote I use Ubuntu trusty, the exact package versions are:
 
 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2
 
 There are no updates available. The only option is to install from sources, 
 but
 that would be very difficult to maintain and I'm not sure I would get rid of 
 this
 issue.
 
 What do you recommend?
 
 
  
   HTOP show something like this (sorted by TIME+ descending):
  
  
  
 1  [100.0%] Tasks: 59, 4
  thr; 2 running
 2  [| 0.7%] Load average: 
   1.00 0.99 1.02
 Mem[ 165/994MB] Uptime: 1
  day, 10:22:03
 Swp[   0/509MB]
  
 PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58
 /usr/sbin/corosync
   1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
   /usr/sbin/snmpd -
  Lsd -Lf /dev/null -u snmp -g snm
   1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
  /usr/lib/pacemaker/cib
   1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
  /usr/lib/pacemaker/stonithd
   1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
   /usr/sbin/watchdog
   1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
  /usr/lib/pacemaker/crmd
   1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
  /usr/lib/pacemaker/lrmd
   1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
  /usr/lib/pacemaker/attrd
   1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
   1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: 
   read
 process
   1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25
  /usr/lib/pacemaker/pengine
   1252 root   20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: 
   write
 process
   1835 ntp20   0 27216  1980  1408 S  0.0  0.2  0:11.80 
   /usr/sbin/ntpd -p
  /var/run/ntpd.pid -g -u 105:112
 899 root   20   0 19168   700   488 S  0.0  0.1  0:09.75 
   /usr/sbin/irqbalance
   1642 root   20   0 30696  1556   912 S  0.0  0.2  0:06.49 
   /usr/bin/monit -c
  /etc/monit/monitrc
   4374 kamailio   20   0

[Pacemaker] Pacemaker/corosync freeze

2014-03-06 Thread Attila Megyeri
Hello,

We have a strange issue with Corosync/Pacemaker.
From time to time, something unexpected happens and suddenly the crm_mon 
output remains static.
When I check the cpu usage, I see that one of the cores uses 100% cpu, but 
cannot actually match it to either the corosync or one of the pacemaker 
processes.

In such a case, this high CPU usage is happening on all 7 nodes.
I have to manually go to each node, stop pacemaker, restart corosync, then 
start pacemeker. Stoping pacemaker and corosync does not work in most of the 
cases, usually a kill -9 is needed.

Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.

Logs are usually flooded with CPG related messages, such as:

Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 0 
CPG messages  (1 remaining, last=8): Try again (6)
Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 0 
CPG messages  (1 remaining, last=8): Try again (6)
Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 0 
CPG messages  (1 remaining, last=8): Try again (6)
Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 0 
CPG messages  (1 remaining, last=8): Try again (6)

OR

Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 0 
CPG messages  (1 remaining, last=10933): Try again (
Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 0 
CPG messages  (1 remaining, last=10933): Try again (
Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 0 
CPG messages  (1 remaining, last=10933): Try again (


HTOP show something like this (sorted by TIME+ descending):



  1  [100.0%] Tasks: 59, 4 thr; 2 
running
  2  [| 0.7%] Load average: 1.00 
0.99 1.02
  Mem[ 165/994MB] Uptime: 1 day, 
10:22:03
  Swp[   0/509MB]

  PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
  921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58 
/usr/sbin/corosync
1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 /usr/sbin/snmpd 
-Lsd -Lf /dev/null -u snmp -g snm
1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71 
/usr/lib/pacemaker/cib
1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06 
/usr/lib/pacemaker/stonithd
1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 /usr/sbin/watchdog
1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62 
/usr/lib/pacemaker/crmd
1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64 
/usr/lib/pacemaker/lrmd
1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01 
/usr/lib/pacemaker/attrd
1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: read 
process
1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25 
/usr/lib/pacemaker/pengine
1252 root   20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: write 
process
1835 ntp20   0 27216  1980  1408 S  0.0  0.2  0:11.80 /usr/sbin/ntpd -p 
/var/run/ntpd.pid -g -u 105:112
  899 root   20   0 19168   700   488 S  0.0  0.1  0:09.75 
/usr/sbin/irqbalance
1642 root   20   0 30696  1556   912 S  0.0  0.2  0:06.49 /usr/bin/monit -c 
/etc/monit/monitrc
4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77 
/usr/local/sbin/kamailio -f /etc/kamailio/kamaili
3079 root0 -20 16864  4592  3508 S  0.0  0.5  0:01.51 /usr/bin/atop -a 
-w /var/log/atop/atop_20140306 6
  445 syslog 20   0  249M  6276   976 S  0.0  0.6  0:01.16 rsyslogd
4373 kamailio   20   0  291M  7492  2396 S  0.0  0.7  0:01.03 
/usr/local/sbin/kamailio -f /etc/kamailio/kamaili
1 root   20   0 33376  2632  1404 S  0.0  0.3  0:00.63 /sbin/init
  453 syslog 20   0  249M  6276   976 S  0.0  0.6  0:00.63 rsyslogd
  451 syslog 20   0  249M  6276   976 S  0.0  0.6  0:00.53 rsyslogd
4379 kamailio   20   0  291M  6224  1132 S  0.0  0.6  0:00.38 
/usr/local/sbin/kamailio -f /etc/kamailio/kamaili
4380 kamailio   20   0  291M  8516  3084 S  0.0  0.8  0:00.38 
/usr/local/sbin/kamailio -f /etc/kamailio/kamaili
4381 kamailio   20   0  291M  8252  2828 S  0.0  0.8  0:00.37 
/usr/local/sbin/kamailio -f /etc/kamailio/kamaili
23315 root   20   0 24872  2476  1412 R  0.7  0.2  0:00.37 htop
4367 kamailio   20   0  291M 1  4864 S  0.0  1.0  0:00.36 
/usr/local/sbin/kamailio -f /etc/kamailio/kamaili


My questions:

-   Is this a cororync or pacameker issue?

-   What are the CPG messages? Is it possible that we have a firewall issue?


Any hints would be great!

Thanks,
Attila
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: 

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-06 Thread Attila Megyeri
Thanks for the quick response!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and suddenly the
 crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses 100% cpu, but
 cannot actually match it to either the corosync or one of the pacemaker
 processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in most of
 the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 
 That is usually a symptom of corosync getting into a horribly confused state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.
 
 

As I wrote I use Ubuntu trusty, the exact package versions are:

corosync 2.3.0-1ubuntu5
pacemaker 1.1.10+git20130802-1ubuntu2

There are no updates available. The only option is to install from sources, but 
that would be very difficult to maintain and I'm not sure I would get rid of 
this issue.

What do you recommend?


 
  HTOP show something like this (sorted by TIME+ descending):
 
 
 
1  [100.0%] Tasks: 59, 4
 thr; 2 running
2  [| 0.7%] Load average: 
  1.00 0.99 1.02
Mem[ 165/994MB] Uptime: 1
 day, 10:22:03
Swp[   0/509MB]
 
PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58 
  /usr/sbin/corosync
  1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
  /usr/sbin/snmpd -
 Lsd -Lf /dev/null -u snmp -g snm
  1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
 /usr/lib/pacemaker/cib
  1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
 /usr/lib/pacemaker/stonithd
  1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
  /usr/sbin/watchdog
  1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
 /usr/lib/pacemaker/crmd
  1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
 /usr/lib/pacemaker/lrmd
  1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
 /usr/lib/pacemaker/attrd
  1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
  1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: read 
  process
  1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25
 /usr/lib/pacemaker/pengine
  1252 root   20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: 
  write process
  1835 ntp20   0 27216  1980  1408 S  0.0  0.2  0:11.80 
  /usr/sbin/ntpd -p
 /var/run/ntpd.pid -g -u 105:112
899 root   20   0 19168   700   488 S  0.0  0.1  0:09.75 
  /usr/sbin/irqbalance
  1642 root   20   0 30696  1556   912 S  0.0  0.2  0:06.49 
  /usr/bin/monit -c
 /etc/monit/monitrc
  4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77
 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
  3079 root0 -20 16864  4592  3508 S  0.0  0.5  0:01.51 /usr/bin/atop 
  -a -w
 /var/log/atop/atop_20140306 6
445 syslog 20   0  249M  6276   976 S  0.0  0.6  0:01.16 rsyslogd
  4373 kamailio   20   0  291M  7492  2396 S  0.0  0.7  0:01.03
 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
  1 root   20   0 33376  2632  1404 S  0.0  0.3  0:00.63 /sbin/init
453 syslog 20   0  249M

Re: [Pacemaker] Mysql multiple slaves, slaves restarting occasionally without a reason

2013-09-12 Thread Attila Megyeri
No idea on this one?

Sent: Tuesday, September 10, 2013 8:07 AM
To: pacemaker@oss.clusterlabs.org
Subject: [Pacemaker] Mysql multiple slaves, slaves restarting occasionally 
without a reason

Hi,

We have a Mysql cluster which works fine when I have a single master (A) and 
slave (B). Failover is almost immediate and I am happy with this approach.
When we configured two additional slaves, strange things start to happen. From 
time to time I am noticing that all slaves mysql instances are restarted and I 
cannot figure out why.

I tried to find out what is happening, and this is how far I got:

There is a repeating sequence in the DC, which looks like this when everything 
is fine:

Sep 10 01:45:42 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Sep 10 01:45:42 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71358 
(ref=pe_calc-dc-1378777542-165977) derived from 
/var/lib/pengine/pe-input-3179.bz2
Sep 10 01:45:42 oamgr crmd: [3385]: notice: run_graph:  Transition 71358 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pengine/pe-input-3179.bz2): Complete
Sep 10 01:45:42 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 10 01:47:42 oamgr crmd: [3385]: info: crm_timer_popped: PEngine Recheck 
Timer (I_PE_CALC) just popped (12ms)
Sep 10 01:47:42 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
Sep 10 01:47:42 oamgr crmd: [3385]: info: do_state_transition: Progressed to 
state S_POLICY_ENGINE after C_TIMER_POPPED


But

It looks somewhat different when I see the restarts:


Sep 10 01:51:42 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Sep 10 01:51:42 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71361 
(ref=pe_calc-dc-1378777902-165980) derived from 
/var/lib/pengine/pe-input-3179.bz2
Sep 10 01:51:42 oamgr crmd: [3385]: notice: run_graph:  Transition 71361 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pengine/pe-input-3179.bz2): Complete
Sep 10 01:51:42 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 10 01:52:45 oamgr crmd: [3385]: info: abort_transition_graph: 
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, 
id=status-oadb2-master-db-mysql.1, name=master-db-mysql:1, value=0, magic=NA, 
cib=0.4829.3480) : Transient attribute: update
Sep 10 01:52:45 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Sep 10 01:52:45 oamgr crmd: [3385]: info: abort_transition_graph: 
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, 
id=status-oadb2-readable, name=readable, value=0, magic=NA, cib=0.4829.3481) : 
Transient attribute: update
.

There is a transaction abort, and shortly after this, the slaves are restarted:



Sep 10 01:52:45 oamgr pengine: [3384]: notice: LogActions: Movedb-mysql:1   
(Slave oadb2 - huoadb1)
Sep 10 01:52:45 oamgr pengine: [3384]: notice: LogActions: Movedb-mysql:2   
(Slave huoadb1 - oadb2)
Sep 10 01:52:45 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Sep 10 01:52:45 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71362 
(ref=pe_calc-dc-1378777965-165981) derived from 
/var/lib/pengine/pe-input-3180.bz2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 
148: notify db-mysql:0_pre_notify_stop_0 on oadb1
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 
150: notify db-mysql:1_pre_notify_stop_0 on oadb2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 
151: notify db-mysql:2_pre_notify_stop_0 on huoadb1
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 
152: notify db-mysql:3_pre_notify_stop_0 on huoadb2
Sep 10 01:52:45 oamgr pengine: [3384]: notice: process_pe_message: Transition 
71362: PEngine Input stored in: /var/lib/pengine/pe-input-3180.bz2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 39: 
stop db-mysql:1_stop_0 on oadb2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 43: 
stop db-mysql:2_stop_0 on huoadb1


It appears that oadb2 and huoadb1 are replaced with each other (in terms of 
db-mysql:1 and db-mysql:2 )? Does that make any sense?

It happens 

[Pacemaker] Mysql multiple slaves, slaves restarting occasionally without a reason

2013-09-10 Thread Attila Megyeri
Hi,

We have a Mysql cluster which works fine when I have a single master (A) and 
slave (B). Failover is almost immediate and I am happy with this approach.
When we configured two additional slaves, strange things start to happen. From 
time to time I am noticing that all slaves mysql instances are restarted and I 
cannot figure out why.

I tried to find out what is happening, and this is how far I got:

There is a repeating sequence in the DC, which looks like this when everything 
is fine:

Sep 10 01:45:42 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Sep 10 01:45:42 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71358 
(ref=pe_calc-dc-1378777542-165977) derived from 
/var/lib/pengine/pe-input-3179.bz2
Sep 10 01:45:42 oamgr crmd: [3385]: notice: run_graph:  Transition 71358 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pengine/pe-input-3179.bz2): Complete
Sep 10 01:45:42 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 10 01:47:42 oamgr crmd: [3385]: info: crm_timer_popped: PEngine Recheck 
Timer (I_PE_CALC) just popped (12ms)
Sep 10 01:47:42 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
Sep 10 01:47:42 oamgr crmd: [3385]: info: do_state_transition: Progressed to 
state S_POLICY_ENGINE after C_TIMER_POPPED


But

It looks somewhat different when I see the restarts:


Sep 10 01:51:42 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Sep 10 01:51:42 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71361 
(ref=pe_calc-dc-1378777902-165980) derived from 
/var/lib/pengine/pe-input-3179.bz2
Sep 10 01:51:42 oamgr crmd: [3385]: notice: run_graph:  Transition 71361 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pengine/pe-input-3179.bz2): Complete
Sep 10 01:51:42 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 10 01:52:45 oamgr crmd: [3385]: info: abort_transition_graph: 
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, 
id=status-oadb2-master-db-mysql.1, name=master-db-mysql:1, value=0, magic=NA, 
cib=0.4829.3480) : Transient attribute: update
Sep 10 01:52:45 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Sep 10 01:52:45 oamgr crmd: [3385]: info: abort_transition_graph: 
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, 
id=status-oadb2-readable, name=readable, value=0, magic=NA, cib=0.4829.3481) : 
Transient attribute: update
.

There is a transaction abort, and shortly after this, the slaves are restarted:



Sep 10 01:52:45 oamgr pengine: [3384]: notice: LogActions: Movedb-mysql:1   
(Slave oadb2 - huoadb1)
Sep 10 01:52:45 oamgr pengine: [3384]: notice: LogActions: Movedb-mysql:2   
(Slave huoadb1 - oadb2)
Sep 10 01:52:45 oamgr crmd: [3385]: notice: do_state_transition: State 
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Sep 10 01:52:45 oamgr crmd: [3385]: info: do_te_invoke: Processing graph 71362 
(ref=pe_calc-dc-1378777965-165981) derived from 
/var/lib/pengine/pe-input-3180.bz2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 
148: notify db-mysql:0_pre_notify_stop_0 on oadb1
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 
150: notify db-mysql:1_pre_notify_stop_0 on oadb2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 
151: notify db-mysql:2_pre_notify_stop_0 on huoadb1
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 
152: notify db-mysql:3_pre_notify_stop_0 on huoadb2
Sep 10 01:52:45 oamgr pengine: [3384]: notice: process_pe_message: Transition 
71362: PEngine Input stored in: /var/lib/pengine/pe-input-3180.bz2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 39: 
stop db-mysql:1_stop_0 on oadb2
Sep 10 01:52:45 oamgr crmd: [3385]: info: te_rsc_command: Initiating action 43: 
stop db-mysql:2_stop_0 on huoadb1


It appears that oadb2 and huoadb1 are replaced with each other (in terms of 
db-mysql:1 and db-mysql:2 )? Does that make any sense?

It happens only when I have all 4 mysql nodes online. (oadb1, oadb2, huoadb1, 
huoadb2). When I moved oadb2 to standby for a day, I did not see restarts.

Could someone help me troubleshoot this?


Mysql 

Re: [Pacemaker] Clone resource as a dependency

2012-12-20 Thread Attila Megyeri
Is this so difficult or so trivial, that no one responded? :)

I would appreciate a reference to some documentation as well.

Thank you,
Attila

From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
Sent: Wednesday, December 19, 2012 10:05 AM
To: The Pacemaker cluster resource manager
Subject: [Pacemaker] Clone resource as a dependency

Hi,

How can I configure a resource (e.g. an apache) to depend on the start of a 
clone resource (e.g. a filesystem resource) for the given node?
I know how to arrange a primitive into a group, but in this particular case, 
the primitive must run on the passive node as well (performing some async 
offline operations), but apache may run only if the clone is started on the 
node where apache is about to start.

I tried by defining the clone resource and then by adding a mandatory order 
where apache depends on the filesystem resource, but apache keeps on running 
even if the filesystem runs only on a different node (stopped on the apache 
node).

BTW, the filesystem is glusterfs.

Thank you in advance!


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Clone resource as a dependency

2012-12-20 Thread Attila Megyeri
Thanks Jake,

I did not try with the collocation constraint as the clone was running on all 
nodes, but I will give it a try – not sure whether this would work with a clone.
I am using pacemaker 1.1.6 on a debian system, the critical RAs are from latest 
github. The cluster is assymetric.

The config itself is quite big so I wouldn’t paste it here, but the basic 
requirement is very simple:


-  Primitive “fs” (filesystem)

-  Clone of “fs” with clone-max=4. It shall run on 4 of the 7 nodes.

-  primitive apache, which is allowed to run on 2 of 7 nodes, but in 
one instance only

-  property $id=cib-bootstrap-options \

-  dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \

-  cluster-infrastructure=openais \

-  expected-quorum-votes=7 \

-  stonith-enabled=false \

-  no-quorum-policy=stop \

-  start-failure-is-fatal=false \

-  stonith-action=reboot \

-  symmetric-cluster=false \

-  last-lrm-refresh=1355960642

-



The goal is to make sure that apache runs only if a FS clone is running on that 
node as well. At the same time, the FS clone must run on all 4 nodes.

Thanks,
Attila



From: Jake Smith [mailto:jsm...@argotec.com]
Sent: Thursday, December 20, 2012 8:37 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Clone resource as a dependency

A collocation constraint as well as the order so it must run on the same node 
as a running clone might do it.  Not quite sure with the clone though.

Doc reference would require some more info such as what version of pacemaker, 
etc.

Including configuration helps get answers quicker.

HTH
Jake


From: Attila Megyeri 
amegy...@minerva-soft.commailto:amegy...@minerva-soft.com
To: The Pacemaker cluster resource manager 
pacemaker@oss.clusterlabs.orgmailto:pacemaker@oss.clusterlabs.org
Sent: Thursday, December 20, 2012 1:23:07 PM
Subject: Re: [Pacemaker] Clone resource as a dependency
Is this so difficult or so trivial, that no one responded? ☺

I would appreciate a reference to some documentation as well.

Thank you,
Attila

From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
Sent: Wednesday, December 19, 2012 10:05 AM
To: The Pacemaker cluster resource manager
Subject: [Pacemaker] Clone resource as a dependency

Hi,

How can I configure a resource (e.g. an apache) to depend on the start of a 
clone resource (e.g. a filesystem resource) for the given node?
I know how to arrange a primitive into a group, but in this particular case, 
the primitive must run on the passive node as well (performing some async 
offline operations), but apache may run only if the clone is started on the 
node where apache is about to start.

I tried by defining the clone resource and then by adding a mandatory order 
where apache depends on the filesystem resource, but apache keeps on running 
even if the filesystem runs only on a different node (stopped on the apache 
node).

BTW, the filesystem is glusterfs.

Thank you in advance!



___
Pacemaker mailing list: 
Pacemaker@oss.clusterlabs.orgmailto:Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Failed actions: ...not installed

2012-01-24 Thread Attila Megyeri
Hi Andreas,

I would give it a try from shell.
Set all required environment variables, such as (as in your cib)

export OCF_ROOT=/usr/lib/ocf
export OCF_RESKEY_binary=/usr/bin/mysqld_safe
export OCF_RESKEY_config=/etc/mysql/my.cnf
export OCF_RESKEY_datadir=/var/lib/mysql
export OCF_RESKEY_user=mysql
export OCF_RESKEY_pid=/var/run/mysqld/mysqld.pid
export OCF_RESKEY_socket=/var/run/mysqld/mysqld.sock
….

and then execute the RA from its local directory. (e.g. ./mysql monitor or 
./mysql start )
You may see why it fails, or add some logging into the RA itself.

Cheers,
Attila

From: Stallmann, Andreas [mailto:astallm...@conet.de]
Sent: 2012. január 24. 9:46
To: pacemaker@oss.clusterlabs.org
Subject: [Pacemaker] Failed actions: ...not installed

Hi there,

What does the following error mean:

Failed actions:
p_mysql:1_start_0 (node=int-ipfuie-mgmt02, call=15, rc=5, status=complete): 
not installed
p_mysql:0_start_0 (node=int-ipfuie-mgmt01, call=15, rc=5, status=complete): 
not installed

The resource script “mysql” IS installed, it’s where all the other scripts are:

/usr/lib/ocf/resource.d/heartbeat # ls -lh mysql
-rwxr-xr-x 1 root root 43K Jan 23 16:36 mysql

There’s nothing (at least nothing obvious) in /var/log/messages, that reveals 
any further information.

Here’s my configuration (for the primitive and for the M/S setup):

primitive p_mysql ocf:heartbeat:mysql \
params config=/etc/my.cnf pid=/var/run/mysql/mysql.pid 
socket=/var/run/mysql/mysql.sock replication_user=repl 
replication_passwd=blafasel max_slave_lag=15 evict_outdated_slaves=false 
binary=/usr/bin/mysqld_safe test_user=root test_passwd=blafasel \
op start interval=0 timeout=120s \
op stop interval=0 timeout=300s \
op monitor interval=5s role=Master timeout=30s 
OCF_CHECK_LEVEL=1 \
op monitor interval=2s role=Slave timeout=30s OCF_CHECK_LEVEL=1
primitive pingy_res ocf:pacemaker:ping \
params dampen=5s multiplier=1000 host_list=10.30.0.41 10.30.0.42 
10.30.0.1 \
op monitor interval=15s timeout=20s \
op start interval=0 timeout=60s
ms ms_MySQL p_mysql \
meta master-max=1 master-node-max=1 clone-max=2 notify=true 
target-role=Started

Any ideas on how to get this running?

Thanks,

Andreas





CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef.
Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136)
Geschäftsführer/Managing Director: Anke Höfer

 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] strange error in crm status

2012-01-18 Thread Attila Megyeri
Hi Andreas,


Thanks for the direction.
Indeed a strange node appeared in my CIB, in XML representation.



node cvmgr \
attributes standby=off
node psql1 \
attributes pgsql-data-status=LATEST standby=off
node psql2 \
attributes pgsql-data-status=STREAMING|SYNC standby=off
xml node id=r=quot;web1quot; election-id=quot;230quot;/gt; 
type=normal uname=r=quot;web1quot; election-id=quot;230quot;/gt;/
node red1 \
attributes standby=off
node red2 \
attributes standby=off
node web1 \
attributes standby=off
node web2 \
attributes standby=off


Can I delete it safely? The real web is there below it and no resource seems to 
be using this one...
I wonder how this node got there...

Thanks for your help


Cheers,

Attila




-Original Message-
From: Andreas Kurz [mailto:andr...@hastexo.com] 
Sent: 2012. január 18. 10:05
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] strange error in crm status

Hello,

On 01/17/2012 11:02 PM, Attila Megyeri wrote:
 Hi Guys,
 
  
 
  
 
 In the crm_mon a strange line appeared and cannot get rid of it, tried 
 everything (restarting corosync on all nodes, crm_resource refresh, 
 etc) but no remedy.
 
  
 
 The line is:
 
  
 
 OFFLINE: [ r=web1 election-id=230/ ]

Can you share your cib? Don't know how that entry found it's way in your cib, 
but you should see it in your node section ... already tried a crm node delete 
...?

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

 
  
 
  
 
 The rest looks like this:
 
  
 
 
 
 Last updated: Tue Jan 17 23:01:05 2012
 
 Last change: Mon Jan 16 10:37:17 2012
 
 Stack: openais
 
 Current DC: red1 - partition with quorum
 
 Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
 
 8 Nodes configured, 7 expected votes
 
 19 Resources configured.
 
 
 
  
 
 Online: [ red1 red2 web1 web2 psql2 psql1 cvmgr ]
 
 OFFLINE: [ r=web1 election-id=230/ ]
 
  
 
 Clone Set: cl_red5-server [red5-server]
 
  Started: [ red1 red2 ]
 
 Resource Group: webserver
 
  web_ip_int (ocf::heartbeat:IPaddr2):   Started web1
 
  web_ip_src (ocf::heartbeat:IPsrcaddr): Started web1
 
  web_memcached  (lsb:memcached):Started web1
 
  website(ocf::heartbeat:apache):Started web1
 
  web_ip_fo  (ocf::hetzner:hetzner-fo-ip):   Started web1
 
 red_ip_int  (ocf::heartbeat:IPaddr2):   Started red2
 
 db-ip-slave (ocf::heartbeat:IPaddr2):   Started psql2
 
 Resource Group: db-master-group
 
  db-ip-mast (ocf::heartbeat:IPaddr2):   Started psql1
 
  db-ip-rep  (ocf::heartbeat:IPaddr2):   Started psql1
 
 Master/Slave Set: db-ms-psql [postgresql]
 
  Masters: [ psql1 ]
 
  Slaves: [ psql2 ]
 
 Clone Set: db-cl-pingcheck [pingCheck]
 
  Started: [ psql1 psql2 ]
 
 Clone Set: web_pingclone [web_db_ping]
 
  Started: [ web1 web2 ]
 
 Clone Set: red_pingclone [red_web_ping]
 
  Started: [ red1 red2 ]
 
  
 
 I see nothing suspicious in the logs.
 
 Everything seems to be working fine.
 
  
 
 How could I get rid of this error?
 
  
 
 Thanks,
 
  
 
 Attila
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] strange error in crm status

2012-01-18 Thread Attila Megyeri
Hi Andreas,

-Original Message-
From: Andreas Kurz [mailto:andr...@hastexo.com] 
Sent: 2012. január 18. 11:11
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] strange error in crm status

Hello Attila,

On 01/18/2012 10:17 AM, Attila Megyeri wrote:
 Hi Andreas,
 
 
 Thanks for the direction.
 Indeed a strange node appeared in my CIB, in XML representation.
 
 
 
 node cvmgr \
 attributes standby=off
 node psql1 \
 attributes pgsql-data-status=LATEST standby=off
 node psql2 \
 attributes pgsql-data-status=STREAMING|SYNC standby=off
 xml node id=r=quot;web1quot; election-id=quot;230quot;/gt; 
 type=normal uname=r=quot;web1quot; 
 election-id=quot;230quot;/gt;/
 node red1 \
 attributes standby=off
 node red2 \
 attributes standby=off
 node web1 \
 attributes standby=off
 node web2 \
 attributes standby=off
 
 
 Can I delete it safely? The real web is there below it and no resource seems 
 to be using this one...
 I wonder how this node got there...

I would dump the cib with cibadmin, remove the erroneous node from the node 
section, run a ptest or crm_simulate on the modified cib and then ... if all is 
fine (no unwanted resource movements or other events) ...
replace the old cib with cibadmin -R ...

Or try to remove it from within crm shells edit mode ... really strange, looks 
like a snippet from the status section 

 
 Thanks for your help

You are welcome!

Regards,
Andreas

--


I deleted the node from the crm shell edit config, and it looks OK now. No idea 
how it got there...

Thanks!

Regards,
Attila


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] strange error in crm status

2012-01-17 Thread Attila Megyeri
Hi Guys,


In the crm_mon a strange line appeared and cannot get rid of it, tried 
everything (restarting corosync on all nodes, crm_resource refresh, etc) but no 
remedy.

The line is:

OFFLINE: [ r=web1 election-id=230/ ]


The rest looks like this:


Last updated: Tue Jan 17 23:01:05 2012
Last change: Mon Jan 16 10:37:17 2012
Stack: openais
Current DC: red1 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
8 Nodes configured, 7 expected votes
19 Resources configured.


Online: [ red1 red2 web1 web2 psql2 psql1 cvmgr ]
OFFLINE: [ r=web1 election-id=230/ ]

Clone Set: cl_red5-server [red5-server]
 Started: [ red1 red2 ]
Resource Group: webserver
 web_ip_int (ocf::heartbeat:IPaddr2):   Started web1
 web_ip_src (ocf::heartbeat:IPsrcaddr): Started web1
 web_memcached  (lsb:memcached):Started web1
 website(ocf::heartbeat:apache):Started web1
 web_ip_fo  (ocf::hetzner:hetzner-fo-ip):   Started web1
red_ip_int  (ocf::heartbeat:IPaddr2):   Started red2
db-ip-slave (ocf::heartbeat:IPaddr2):   Started psql2
Resource Group: db-master-group
 db-ip-mast (ocf::heartbeat:IPaddr2):   Started psql1
 db-ip-rep  (ocf::heartbeat:IPaddr2):   Started psql1
Master/Slave Set: db-ms-psql [postgresql]
 Masters: [ psql1 ]
 Slaves: [ psql2 ]
Clone Set: db-cl-pingcheck [pingCheck]
 Started: [ psql1 psql2 ]
Clone Set: web_pingclone [web_db_ping]
 Started: [ web1 web2 ]
Clone Set: red_pingclone [red_web_ping]
 Started: [ red1 red2 ]

I see nothing suspicious in the logs.
Everything seems to be working fine.

How could I get rid of this error?

Thanks,

Attila
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] syslog full of redundand link messages

2012-01-09 Thread Attila Megyeri
Hi,

I might be taking something wrong, but, 

bindnetaddr: 10.100.1.255

does not mean it will listen on this address, but will listen on every 
interface where this mask matches.
This is just to make the config file simpler and common for all nodes in the 
same subnet.

Or am I taking something terribly wrong?

Thanks

Attila

-Original Message-
From: Dan Frincu [mailto:df.clus...@gmail.com] 
Sent: 2012. január 9. 11:39
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] syslog full of redundand link messages

Hi,

On Sun, Jan 8, 2012 at 1:59 AM, Attila Megyeri amegy...@minerva-soft.com 
wrote:
 Hi All,



 My syslogs are full of messages like this:



 Jan  7 23:55:47 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:48 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:48 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:48 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:49 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:49 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:49 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:50 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:50 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:50 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:51 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:51 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:51 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:52 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:52 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active

 Jan  7 23:55:52 oa2 corosync[362]:   [TOTEM ] received message 
 requesting test of ring now active





 What could be the reason for this?





 Pacemaker 1.1.6, Corosync 1.4.2





 The relevant part of the config:



 Eth0 is ont he 10.100.1.X subnet, eth1 is 192.168.100.X









 totem {

     version: 2

     secauth: off

     threads: 0

     rrp_mode: passive

     interface {

     ringnumber: 0

     bindnetaddr: 10.100.1.255

     mcastaddr: 226.100.40.1

     mcastport: 4000

     }

     interface {

     ringnumber: 1

     bindnetaddr: 192.168.100.255

     mcastaddr: 226.101.40.1

     mcastport: 4000

     }


Are the subnets /24 or higher (/23, /22, etc.)? Because as I see you're using 
what would be the broadcast address on a /24 subnet and may cause issues.





 }





 Thanks,



 Attila




 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




--
Dan Frincu
CCNA, RHCE

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] syslog full of redundand link messages

2012-01-07 Thread Attila Megyeri
Hi All,

My syslogs are full of messages like this:

Jan  7 23:55:47 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:48 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:48 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:48 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:49 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:49 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:49 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:50 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:50 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:50 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:51 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:51 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:51 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:52 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:52 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active
Jan  7 23:55:52 oa2 corosync[362]:   [TOTEM ] received message requesting test 
of ring now active


What could be the reason for this?


Pacemaker 1.1.6, Corosync 1.4.2


The relevant part of the config:

Eth0 is ont he 10.100.1.X subnet, eth1 is 192.168.100.X




totem {
version: 2
secauth: off
threads: 0
rrp_mode: passive
interface {
ringnumber: 0
bindnetaddr: 10.100.1.255
mcastaddr: 226.100.40.1
mcastport: 4000
}
interface {
ringnumber: 1
bindnetaddr: 192.168.100.255
mcastaddr: 226.101.40.1
mcastport: 4000
}


}


Thanks,

Attila

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [SOLVED] RE: Slave does not start after failover: Mysql circular replication and master-slave resources

2011-12-20 Thread Attila Megyeri
Hi Andreas,


-Original Message-
From: Andreas Kurz [mailto:andr...@hastexo.com] 
Sent: 2011. december 19. 15:19
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] [SOLVED] RE: Slave does not start after failover: 
Mysql circular replication and master-slave resources

On 12/17/2011 10:51 AM, Attila Megyeri wrote:
 Hi all,
 
 For anyone interested.
 I finally made the mysql replication work. For some strange reason there were 
 no [mysql] log entries at all neither in corosync.log nor in the syslog. 
 After a couple of corosync restarts (?!) [mysql] RA debug/error entries 
 started to show up.
 The issue was that the slave could not apply the binary logs due to some 
 duplicate errors. I am not sure how this could happen, but the solution was 
 to ignore the duplicate errors on the slaves, by adding the following line to 
 the my.conf:
 
 slave-skip-errors = 1062

although you use different auto-increment-offset values?


Yes... I am actually quite surprised how this can happen. The slave has applied 
the binlog already, but for some reason it wants to execute it again.

 
 I hope this helps to some of you guys as well.
 
 P.S. Did anyone else notice missing mysql debug/info/error entries in 
 corosync log as well?

There is no RA output/log in any of your syslogs? ... in absence of a connected 
tty and no configured logd, logger should feed all logs to syslog ... what is 
your distribution, any fancy syslog configuration?

My system is running on a Debian squeeze, pacemaker 1.1.5 squeeze backport. The 
syslog configuration is standard, no extras. I have noticed this strange 
behavior (RA not logging anything) many times - not only for the mysql resource 
but also for postgres. E.g. I added a log_ocf at the entry point of the RA, 
just to log when the script is executed and what parameter was passed - but I 
did not see any monitor invokes either.
Now it works fine, but this is not an absolutely stable setup.

One other very disturbing issue is, that sometimes corosync and some of the 
heartbeat processes stuck at 100% CPU and only restart/kill -9 help. :(

Cheers,

Attila

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Cheers,
 Attila
 
 
 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: 2011. december 16. 12:39
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Slave does not start after failover: Mysql 
 circular replication and master-slave resources
 
 Hi Andreas,
 
 The slave lag cannot be high, as the slave was restarted within 1-2 mins and 
 there are no active users on the system yet.
 I did not find anything at all in the logs.
 
 I will doublecheck if the RA is the latest.
 
 Thanks,
 
 Attila
 
 
 -Original Message-
 From: Andreas Kurz [mailto:andr...@hastexo.com]
 Sent: 2011. december 16. 1:50
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Slave does not start after failover: Mysql 
 circular replication and master-slave resources
 
 Hello Attila,
 
 ... see below ...
 
 On 12/15/2011 02:42 PM, Attila Megyeri wrote:
 Hi All,

  

 Some time ago I exchanged a couple of posts with you here regarding 
 Mysql active-active HA.

 The best solution I found so  far was the Mysql multi-master 
 replication, also referred to as circular replication.

  

 Basically I set up two nodes, both were capable of the master role, 
 and the changes were immediately propagated to the other node.

  

 But still I wanted to have a M/S approach, to have a RW master and a 
 RO slave - mainly because I prefer to have a signle master VIP where 
 my apps can connect to.

  

 (In the first approach I configured a two node clone, and the master 
 IP was always bound to one of the nodes)

  

 I applied the following configuration:

  

 node db1 \

 attributes IP=10.100.1.31 \

 attributes standby=off
 db2-log-file-db-mysql=mysql-bin.21 db2-log-pos-db-mysql=40730

 node db2 \

 attributes IP=10.100.1.32 \

 attributes standby=off

 primitive db-ip-master ocf:heartbeat:IPaddr2 \

 params lvs_support=true ip=10.100.1.30 cidr_netmask=8
 broadcast=10.255.255.255 \

 op monitor interval=20s timeout=20s \

 meta target-role=Started

 primitive db-mysql ocf:heartbeat:mysql \

 params binary=/usr/bin/mysqld_safe config=/etc/mysql/my.cnf
 datadir=/var/lib/mysql user=mysql pid=/var/run/mysqld/mysqld.pid
 socket=/var/run/mysqld/mysqld.sock test_passwd=X

 test_table=replicatest.connectioncheck test_user=slave_user
 replication_user=slave_user replication_passwd=X
 additional_parameters=--skip-slave-start \

 op start interval=0 timeout=120s \

 op stop interval=0 timeout=120s \

 op monitor interval=30 timeout=30s OCF_CHECK_LEVEL=1 \

 op promote interval=0 timeout=120 \

 op demote interval=0 timeout=120

 ms db-ms-mysql db-mysql \

 meta notify=true master-max=1 clone

[Pacemaker] [SOLVED] RE: Slave does not start after failover: Mysql circular replication and master-slave resources

2011-12-17 Thread Attila Megyeri
Hi all,

For anyone interested.
I finally made the mysql replication work. For some strange reason there were 
no [mysql] log entries at all neither in corosync.log nor in the syslog. After 
a couple of corosync restarts (?!) [mysql] RA debug/error entries started to 
show up.
The issue was that the slave could not apply the binary logs due to some 
duplicate errors. I am not sure how this could happen, but the solution was to 
ignore the duplicate errors on the slaves, by adding the following line to the 
my.conf:

slave-skip-errors = 1062

I hope this helps to some of you guys as well.

P.S. Did anyone else notice missing mysql debug/info/error entries in corosync 
log as well?

Cheers,
Attila


-Original Message-
From: Attila Megyeri [mailto:amegy...@minerva-soft.com] 
Sent: 2011. december 16. 12:39
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Slave does not start after failover: Mysql circular 
replication and master-slave resources

Hi Andreas,

The slave lag cannot be high, as the slave was restarted within 1-2 mins and 
there are no active users on the system yet.
I did not find anything at all in the logs.

I will doublecheck if the RA is the latest.

Thanks,

Attila


-Original Message-
From: Andreas Kurz [mailto:andr...@hastexo.com]
Sent: 2011. december 16. 1:50
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Slave does not start after failover: Mysql circular 
replication and master-slave resources

Hello Attila,

... see below ...

On 12/15/2011 02:42 PM, Attila Megyeri wrote:
 Hi All,
 
  
 
 Some time ago I exchanged a couple of posts with you here regarding 
 Mysql active-active HA.
 
 The best solution I found so  far was the Mysql multi-master 
 replication, also referred to as circular replication.
 
  
 
 Basically I set up two nodes, both were capable of the master role, 
 and the changes were immediately propagated to the other node.
 
  
 
 But still I wanted to have a M/S approach, to have a RW master and a 
 RO slave - mainly because I prefer to have a signle master VIP where 
 my apps can connect to.
 
  
 
 (In the first approach I configured a two node clone, and the master 
 IP was always bound to one of the nodes)
 
  
 
 I applied the following configuration:
 
  
 
 node db1 \
 
 attributes IP=10.100.1.31 \
 
 attributes standby=off
 db2-log-file-db-mysql=mysql-bin.21 db2-log-pos-db-mysql=40730
 
 node db2 \
 
 attributes IP=10.100.1.32 \
 
 attributes standby=off
 
 primitive db-ip-master ocf:heartbeat:IPaddr2 \
 
 params lvs_support=true ip=10.100.1.30 cidr_netmask=8
 broadcast=10.255.255.255 \
 
 op monitor interval=20s timeout=20s \
 
 meta target-role=Started
 
 primitive db-mysql ocf:heartbeat:mysql \
 
 params binary=/usr/bin/mysqld_safe config=/etc/mysql/my.cnf
 datadir=/var/lib/mysql user=mysql pid=/var/run/mysqld/mysqld.pid
 socket=/var/run/mysqld/mysqld.sock test_passwd=X
 
 test_table=replicatest.connectioncheck test_user=slave_user
 replication_user=slave_user replication_passwd=X
 additional_parameters=--skip-slave-start \
 
 op start interval=0 timeout=120s \
 
 op stop interval=0 timeout=120s \
 
 op monitor interval=30 timeout=30s OCF_CHECK_LEVEL=1 \
 
 op promote interval=0 timeout=120 \
 
 op demote interval=0 timeout=120
 
 ms db-ms-mysql db-mysql \
 
 meta notify=true master-max=1 clone-max=2
 target-role=Started
 
 colocation db-ip-with-master inf: db-ip-master db-ms-mysql:Master
 
 property $id=cib-bootstrap-options \
 
 dc-version=1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
 
 cluster-infrastructure=openais \
 
 expected-quorum-votes=2 \
 
 stonith-enabled=false \
 
 no-quorum-policy=ignore
 
 rsc_defaults $id=rsc-options \
 
 resource-stickiness=0
 
  
 
  
 
 The setup works in the basic conditions:
 
 * After the first startup, nodes start up as slaves, and
 shortly after, one of them is promoted to master.
 
 * Updates to the master are replicated properly to the slave.
 
 * Slave accepts updates, which is Wrong, but I can live with
 this - I will allow connect to the Master VIP only.
 
 * If I stop the slave for some time, and re-start it, it will
 catch up with the master shortly and get into sync.
 
  
 
 I have, however a serious issue:
 
 * If I stop the current master, the slave is promoted, accepts
 RW queries, the Master IP is bound to it - ALL fine.
 
 * BUT - when I want to bring the other node online, it simply
 shows: Stopped (not installed)
 
  
 
 Online: [ db1 db2 ]
 
  
 
 db-ip-master(ocf::heartbeat:IPaddr2):   Started db1
 
 Master/Slave Set: db-ms-mysql [db-mysql]
 
  Masters: [ db1 ]
 
  Stopped: [ db-mysql:1 ]
 
  
 
 Node Attributes:
 
 * Node db1:
 
 + IP: 10.100.1.31
 
 + db2-log-file-db

Re: [Pacemaker] Slave does not start after failover: Mysql circular replication and master-slave resources

2011-12-16 Thread Attila Megyeri
Hi Andreas,

The slave lag cannot be high, as the slave was restarted within 1-2 mins and 
there are no active users on the system yet.
I did not find anything at all in the logs.

I will doublecheck if the RA is the latest.

Thanks,

Attila


-Original Message-
From: Andreas Kurz [mailto:andr...@hastexo.com] 
Sent: 2011. december 16. 1:50
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Slave does not start after failover: Mysql circular 
replication and master-slave resources

Hello Attila,

... see below ...

On 12/15/2011 02:42 PM, Attila Megyeri wrote:
 Hi All,
 
  
 
 Some time ago I exchanged a couple of posts with you here regarding 
 Mysql active-active HA.
 
 The best solution I found so  far was the Mysql multi-master 
 replication, also referred to as circular replication.
 
  
 
 Basically I set up two nodes, both were capable of the master role, 
 and the changes were immediately propagated to the other node.
 
  
 
 But still I wanted to have a M/S approach, to have a RW master and a 
 RO slave - mainly because I prefer to have a signle master VIP where 
 my apps can connect to.
 
  
 
 (In the first approach I configured a two node clone, and the master 
 IP was always bound to one of the nodes)
 
  
 
 I applied the following configuration:
 
  
 
 node db1 \
 
 attributes IP=10.100.1.31 \
 
 attributes standby=off
 db2-log-file-db-mysql=mysql-bin.21 db2-log-pos-db-mysql=40730
 
 node db2 \
 
 attributes IP=10.100.1.32 \
 
 attributes standby=off
 
 primitive db-ip-master ocf:heartbeat:IPaddr2 \
 
 params lvs_support=true ip=10.100.1.30 cidr_netmask=8
 broadcast=10.255.255.255 \
 
 op monitor interval=20s timeout=20s \
 
 meta target-role=Started
 
 primitive db-mysql ocf:heartbeat:mysql \
 
 params binary=/usr/bin/mysqld_safe config=/etc/mysql/my.cnf
 datadir=/var/lib/mysql user=mysql pid=/var/run/mysqld/mysqld.pid
 socket=/var/run/mysqld/mysqld.sock test_passwd=X
 
 test_table=replicatest.connectioncheck test_user=slave_user
 replication_user=slave_user replication_passwd=X
 additional_parameters=--skip-slave-start \
 
 op start interval=0 timeout=120s \
 
 op stop interval=0 timeout=120s \
 
 op monitor interval=30 timeout=30s OCF_CHECK_LEVEL=1 \
 
 op promote interval=0 timeout=120 \
 
 op demote interval=0 timeout=120
 
 ms db-ms-mysql db-mysql \
 
 meta notify=true master-max=1 clone-max=2
 target-role=Started
 
 colocation db-ip-with-master inf: db-ip-master db-ms-mysql:Master
 
 property $id=cib-bootstrap-options \
 
 dc-version=1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
 
 cluster-infrastructure=openais \
 
 expected-quorum-votes=2 \
 
 stonith-enabled=false \
 
 no-quorum-policy=ignore
 
 rsc_defaults $id=rsc-options \
 
 resource-stickiness=0
 
  
 
  
 
 The setup works in the basic conditions:
 
 * After the first startup, nodes start up as slaves, and
 shortly after, one of them is promoted to master.
 
 * Updates to the master are replicated properly to the slave.
 
 * Slave accepts updates, which is Wrong, but I can live with
 this - I will allow connect to the Master VIP only.
 
 * If I stop the slave for some time, and re-start it, it will
 catch up with the master shortly and get into sync.
 
  
 
 I have, however a serious issue:
 
 * If I stop the current master, the slave is promoted, accepts
 RW queries, the Master IP is bound to it - ALL fine.
 
 * BUT - when I want to bring the other node online, it simply
 shows: Stopped (not installed)
 
  
 
 Online: [ db1 db2 ]
 
  
 
 db-ip-master(ocf::heartbeat:IPaddr2):   Started db1
 
 Master/Slave Set: db-ms-mysql [db-mysql]
 
  Masters: [ db1 ]
 
  Stopped: [ db-mysql:1 ]
 
  
 
 Node Attributes:
 
 * Node db1:
 
 + IP: 10.100.1.31
 
 + db2-log-file-db-mysql : mysql-bin.21
 
 + db2-log-pos-db-mysql  : 40730
 
 + master-db-mysql:0 : 3601
 
 * Node db2:
 
 + IP: 10.100.1.32
 
  
 
 Failed actions:
 
 db-mysql:0_monitor_3 (node=db2, call=58, rc=5, status=complete):
 not installed
 

Looking at the RA (latest from git) I'd say the problem is somewhere in the 
check_slave() function. Either the check for replication errors or for a too 
high slave lag ... though on both errors you should see the log. entries.

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now


  
 
  
 
 I checked the logs, and could not find a reason why the slave at db2 is
 not started.
 
 Any IDEA Anyone ?
 
  
 
  
 
 Thanks,
 
 Attila
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http

[Pacemaker] Slave does not start after failover: Mysql circular replication and master-slave resources

2011-12-15 Thread Attila Megyeri
Hi All,

Some time ago I exchanged a couple of posts with you here regarding Mysql 
active-active HA.
The best solution I found so  far was the Mysql multi-master replication, also 
referred to as circular replication.

Basically I set up two nodes, both were capable of the master role, and the 
changes were immediately propagated to the other node.

But still I wanted to have a M/S approach, to have a RW master and a RO slave - 
mainly because I prefer to have a signle master VIP where my apps can connect 
to.

(In the first approach I configured a two node clone, and the master IP was 
always bound to one of the nodes)

I applied the following configuration:

node db1 \
attributes IP=10.100.1.31 \
attributes standby=off db2-log-file-db-mysql=mysql-bin.21 
db2-log-pos-db-mysql=40730
node db2 \
attributes IP=10.100.1.32 \
attributes standby=off
primitive db-ip-master ocf:heartbeat:IPaddr2 \
params lvs_support=true ip=10.100.1.30 cidr_netmask=8 
broadcast=10.255.255.255 \
op monitor interval=20s timeout=20s \
meta target-role=Started
primitive db-mysql ocf:heartbeat:mysql \
params binary=/usr/bin/mysqld_safe config=/etc/mysql/my.cnf 
datadir=/var/lib/mysql user=mysql pid=/var/run/mysqld/mysqld.pid 
socket=/var/run/mysqld/mysqld.sock test_passwd=X
test_table=replicatest.connectioncheck test_user=slave_user 
replication_user=slave_user replication_passwd=X 
additional_parameters=--skip-slave-start \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s \
op monitor interval=30 timeout=30s OCF_CHECK_LEVEL=1 \
op promote interval=0 timeout=120 \
op demote interval=0 timeout=120
ms db-ms-mysql db-mysql \
meta notify=true master-max=1 clone-max=2 target-role=Started
colocation db-ip-with-master inf: db-ip-master db-ms-mysql:Master
property $id=cib-bootstrap-options \
dc-version=1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
stonith-enabled=false \
no-quorum-policy=ignore
rsc_defaults $id=rsc-options \
resource-stickiness=0


The setup works in the basic conditions:

* After the first startup, nodes start up as slaves, and shortly 
after, one of them is promoted to master.

* Updates to the master are replicated properly to the slave.

* Slave accepts updates, which is Wrong, but I can live with this - I 
will allow connect to the Master VIP only.

* If I stop the slave for some time, and re-start it, it will catch up 
with the master shortly and get into sync.

I have, however a serious issue:

* If I stop the current master, the slave is promoted, accepts RW 
queries, the Master IP is bound to it - ALL fine.

* BUT - when I want to bring the other node online, it simply shows: 
Stopped (not installed)

Online: [ db1 db2 ]

db-ip-master(ocf::heartbeat:IPaddr2):   Started db1
Master/Slave Set: db-ms-mysql [db-mysql]
 Masters: [ db1 ]
 Stopped: [ db-mysql:1 ]

Node Attributes:
* Node db1:
+ IP: 10.100.1.31
+ db2-log-file-db-mysql : mysql-bin.21
+ db2-log-pos-db-mysql  : 40730
+ master-db-mysql:0 : 3601
* Node db2:
+ IP: 10.100.1.32

Failed actions:
db-mysql:0_monitor_3 (node=db2, call=58, rc=5, status=complete): not 
installed


I checked the logs, and could not find a reason why the slave at db2 is not 
started.
Any IDEA Anyone ?


Thanks,
Attila
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-12-08 Thread Attila Megyeri
Hi Takatoshi,

One strange thing I noticed and could probably be improved.
When there is data inconsistency, I have the following node properties:

* Node psql2:
+ default_ping_set  : 100
+ master-postgresql:1   : -INFINITY
+ pgsql-data-status : DISCONNECT
+ pgsql-status  : HS:alone
* Node psql1:
+ default_ping_set  : 100
+ master-postgresql:0   : 1000
+ master-postgresql:1   : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 58:4B20
+ pgsql-status  : PRI

This is fine, and understandable - but I can see this only if I do a crm_mon -A.

My problem is, that CRM shows the following:

Master/Slave Set: db-ms-psql [postgresql]
 Masters: [ psql1 ]
 Slaves: [ psql2 ]

So if I monitor the system from crm_mon, HAWK or ther tools - I have no 
indication at all that the slave is running in an inconsistent mode.

I would expect the RA to stop the psql2 node in such cases, because:
- It is running, but has non-up-to-date data, therefore noone will use it (the 
slave IP points to the master as well, which is good)
- In CRM status eveything looks perfect, even though it is NOT perfect and 
admin intervention is required.


Shouldn't the disconnected PSQL server be stopped instead?

Regards,
Attila




-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
Sent: 2011. november 28. 11:10
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

2011/11/28 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 I understand your point and I agree that the correct behavior is not to start 
 replication when data consistency exists.
 The only thing I do not really understand is how it could have happened:

 1) nodes were in sync (psql1=PRI, psql2=STREAMING|SYNC)
 2) I shut down node psql1 (by placing it into standby)
 3) At this moment psql1's baseline became higher by 20?  What could cause 
 this? Probably the demote operation itself? There were no clients connected - 
 and there was definitively no write operation to the db (except if not from 
 system side).

Yes, PostgreSQL executes a CHECKPOINT when it is shut down normally on demote.

 On the other hand - thank you very much for your contribution, the RA works 
 very well and I really appreciate your work and help!

Not at all. Don't mention it.

Regards,
Takatoshi MATSUO


 Bests,

 Attil

 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 28. 2:10
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover -
 RA needed

 Hi Attila

 Primary can not send all wals to HotStandby whether primary is shut down 
 normally.
 These logs validate it.

 Nov 27 16:03:27 psql1 pgsql[12204]: INFO: My Timeline ID and
 Checkpoint : 14:2320 Nov 27 16:03:27 psql1 pgsql[12204]:
 INFO: psql2 master baseline : 14:2300

 psql1's location was  2320 when it was demoted.
 OTOH psql2's location was 2300  when it was promoted.

 It means that psql1's data was newer than psql2's one at that time.
 The gap is 20.

 As you said you can start psql1's PostgreSQL manually, but PostgreSQL can't 
 realize this occurrence.
 If you start HotStandby at psql1, data is replicated after 2320.
 It's inconsistency.

 Thanks,
 Takatoshi MATSUO


 2011/11/28 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 I don't think it is inconsistency problem - for me it looks like some RA bug.
 I think so, because postgres starts properly outside pacemaker.

 When pacemaker starts node psql1 I see only:

 postgresql:0_start_0 (node=psql1, call=9, rc=1, status=complete):
 unknown error

 and the postgres log is empty - so I suppose that it does not even try to 
 start it.

 What I tested was:
 - I had a stable cluster, where psql1 was the master, psql2 was the
 slave
 - I put psql1 into standby mode. (node psql1 standby) to test
 failover
 - After a while psql2 became the PRI, which is very good
 - When I put psql1 back online, postgres wouldn't start anymore from 
 pacemaker (unknown error).


 I tried to start postgres manually from the shell it worked fine, even the 
 monitor was able to see that it became in SYNC (obviously the master/slave 
 group was showing improper state as psql was started outside pacemaker.

 I don't think data inconsistency is the case, partially because there are no 
 clients connected, partially because psql starts properly outside pacemaker.

 Here is what is relevant from the log:

 Nov 27 16:02:50 psql1 pgsql[11021]: DEBUG: PostgreSQL is running as a 
 primary.
 Nov 27 16:02:51 psql1 pgsql[11021]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-28 Thread Attila Megyeri
Hi Takatoshi,

I understand your point and I agree that the correct behavior is not to start 
replication when data consistency exists.
The only thing I do not really understand is how it could have happened:

1) nodes were in sync (psql1=PRI, psql2=STREAMING|SYNC)
2) I shut down node psql1 (by placing it into standby)
3) At this moment psql1's baseline became higher by 20? What could cause this? 
Probably the demote operation itself? There were no clients connected - and 
there was definitively no write operation to the db (except if not from system 
side).

On the other hand - thank you very much for your contribution, the RA works 
very well and I really appreciate your work and help!

Bests,

Attil

-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
Sent: 2011. november 28. 2:10
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

Primary can not send all wals to HotStandby whether primary is shut down 
normally.
These logs validate it.

 Nov 27 16:03:27 psql1 pgsql[12204]: INFO: My Timeline ID and
 Checkpoint : 14:2320 Nov 27 16:03:27 psql1 pgsql[12204]:
 INFO: psql2 master baseline : 14:2300

psql1's location was  2320 when it was demoted.
OTOH psql2's location was 2300  when it was promoted.

It means that psql1's data was newer than psql2's one at that time.
The gap is 20.

As you said you can start psql1's PostgreSQL manually, but PostgreSQL can't 
realize this occurrence.
If you start HotStandby at psql1, data is replicated after 2320.
It's inconsistency.

Thanks,
Takatoshi MATSUO


2011/11/28 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 I don't think it is inconsistency problem - for me it looks like some RA bug.
 I think so, because postgres starts properly outside pacemaker.

 When pacemaker starts node psql1 I see only:

 postgresql:0_start_0 (node=psql1, call=9, rc=1, status=complete):
 unknown error

 and the postgres log is empty - so I suppose that it does not even try to 
 start it.

 What I tested was:
 - I had a stable cluster, where psql1 was the master, psql2 was the
 slave
 - I put psql1 into standby mode. (node psql1 standby) to test
 failover
 - After a while psql2 became the PRI, which is very good
 - When I put psql1 back online, postgres wouldn't start anymore from 
 pacemaker (unknown error).


 I tried to start postgres manually from the shell it worked fine, even the 
 monitor was able to see that it became in SYNC (obviously the master/slave 
 group was showing improper state as psql was started outside pacemaker.

 I don't think data inconsistency is the case, partially because there are no 
 clients connected, partially because psql starts properly outside pacemaker.

 Here is what is relevant from the log:

 Nov 27 16:02:50 psql1 pgsql[11021]: DEBUG: PostgreSQL is running as a primary.
 Nov 27 16:02:51 psql1 pgsql[11021]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: 
 PostgreSQL is running as a primary.
 Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: 
 PostgreSQL is running as a primary.
 Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: 
 PostgreSQL is running as a primary.
 Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: 
 PostgreSQL is running as a primary.
 Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: node=psql2,
 state=STREAMING, sync_state=SYNC Nov 27 16:03:00 psql1 pgsql[11556]:
 DEBUG: notify: pre for demote Nov 27 16:03:00 psql1 pgsql[11590]: INFO: 
 Stopping PostgreSQL on demote.
 Nov 27 16:03:02 psql1 pgsql[11590]: INFO: waiting for server to shut
 down. done server stopped Nov 27 16:03:02 psql1 pgsql[11590]: INFO: 
 Removing /var/lib/pgsql/PGSQL.lock.
 Nov 27 16:03:02 psql1 pgsql[11590]: INFO: PostgreSQL is down Nov 27
 16:03:02 psql1 pgsql[11590]: INFO: Changing pgsql-status on psql1 : PRI-STOP.
 Nov 27 16:03:02 psql1 pgsql[11590]: DEBUG: Created recovery.conf.
 host=10.12.1.28, user=postgres Nov 27 16:03:02 psql1 pgsql[11590]: INFO: 
 Setup all nodes as an async.
 Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: notify: post for demote Nov
 27 16:03:02 psql1 pgsql[11732]: DEBUG: post-demote called. Demote
 uname is psql1 Nov 27 16:03:02 psql1 pgsql[11732]: INFO: My Timeline
 ID and Checkpoint : 14:2320 Nov 27 16:03:02 psql1 pgsql[11732]: 
 WARNING: Can't get psql2 master baseline. Waiting...
 Nov 27 16:03:03 psql1 pgsql[11732]: INFO: psql2 master baseline :
 14:2300 Nov 27 16:03:03 psql1 pgsql[11732]: ERROR: My data is 
 inconsistent.
 Nov 27 16:03:03 psql1 pgsql[11867]: DEBUG: notify: pre for stop Nov 27
 16:03:03 psql1 pgsql[11969]: INFO: PostgreSQL

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-27 Thread Attila Megyeri
Hi Takatoshi,

You were right, changing the shell to bash resolved the problem.
The cluster now started in sync mode - thank you very much.
I will be testing it in the next couple of days. I did just a very quick test - 
it seems that psql master failed over to psql2 properly, but when I tried to 
move it back to psql1 there was some problems starting psql on node 1.

Does it work fine for you in  both directions?

Thank you very much.

Have a nice weekend,

Attila



-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com] 
Sent: 2011. november 27. 6:12
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

2011/11/27 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 Thank you for coming back to me so quickly.

 In the /var/lib/pgsql there are the following files:

 PSQL1:
 =
 root@psql1:/var/lib/pgsql# ls -la
 total 16
 drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:04 .
 drwxr-xr-x 35 root root 4096 Nov 25 22:21 ..
 -rw-r--r--  1 postgres postgres1 Nov 26 00:17 rep_mode.conf
 -rw-r--r--  1 root root   49 Nov 26 18:04 xlog_note.0

 root@psql1:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900
 psql2 1900
 root@psql1:/var/lib/pgsql#

 PSQL2:
 ===
 root@psql2:/var/lib/pgsql# ls -la
 total 16
 drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:05 .
 drwxr-xr-x 33 root root 4096 Nov 26 00:10 ..
 -rw-r--r--  1 postgres postgres1 Nov 26 00:24 rep_mode.conf
 -rw-r--r--  1 root root   49 Nov 26 18:05 xlog_note.0
 root@psql2:/var/lib/pgsql# cat xlog_note.0 -e psql1 1900
 psql2 1900
 root@psql2:/var/lib/pgsql#

It seems that dash's bultin echo command is used because echo with -e option 
dose not function.

Perhaps my RA also depends on bash.
Can you use a bash instead of a dash?

 BTW, postgres is installed under /var/lib/postgresql , but I noticed that 
 some parts of the RA are referring to the  /var/lib/pgsql directory, so I 
 created that directory and i keep some of the files there.

It's no ploblem.
If you want to change this path, please specify it using tmpdir parameter.

Regards,
Takatoshi MATSUO


 Thanks,
 Attila



 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 26. 18:27
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - 
 RA needed

 Hi Attila

 1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2  files?
   These files are created while checking a xlog location on monitor.

 2. Do these files include lines as below?
 -
 pgsql1  1900
 pgsql2  1900
 -

 Regards.
 Takatoshi MATSUO


 2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com:
 Hi Yoshiharu, Takatoshi,

 Spent another day, without success. :(

 I started from scratch and synchronous replications works nicely when nodes 
 are started outside pacemaker.
 My PostgreSQL version is 9.1.1.

 When I start from pacemaker, after a while it gets into the following state:

 Online: [ psql1 psql2 ]

  Master/Slave Set: msPostgresql [postgresql]
     Slaves: [ psql1 psql2 ]
  Clone Set: clnPingCheck [pingCheck]
     Started: [ psql1 psql2 ]

 Node Attributes:
 * Node psql1:
    + default_ping_set                  : 100
    + master-postgresql:0               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900
 * Node psql2:
    + default_ping_set                  : 100
    + master-postgresql:1               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900


 The psql status queries return the following:

 PSQL1
 ==
 postgres@psql1:/root$ psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
 application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql1:/root$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0/1920|0/1900

 PSQL2
 ==
 postgres@psql2:~$  psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
  application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql2:~$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0/1900|0/1900


 Neither server can connect (obviously) to the master, as the vip_repl Is not 
 brought up.


 Could you help me understand WHAT is the action/state/event that sould 
 promote one of the nodes? I see that pacemaker monitors the servers every X 
 seconds, but nothing else happens.

 In the log (limited to pgsql) the following sequence is repeated 
 forewer

 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist.
 Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-27 Thread Attila Megyeri
Hi Takatoshi,

I don't think it is inconsistency problem - for me it looks like some RA bug.
I think so, because postgres starts properly outside pacemaker.

When pacemaker starts node psql1 I see only:

postgresql:0_start_0 (node=psql1, call=9, rc=1, status=complete): unknown error

and the postgres log is empty - so I suppose that it does not even try to start 
it.

What I tested was:
- I had a stable cluster, where psql1 was the master, psql2 was the slave
- I put psql1 into standby mode. (node psql1 standby) to test failover
- After a while psql2 became the PRI, which is very good
- When I put psql1 back online, postgres wouldn't start anymore from pacemaker 
(unknown error).


I tried to start postgres manually from the shell it worked fine, even the 
monitor was able to see that it became in SYNC (obviously the master/slave 
group was showing improper state as psql was started outside pacemaker.

I don't think data inconsistency is the case, partially because there are no 
clients connected, partially because psql starts properly outside pacemaker.

Here is what is relevant from the log:

Nov 27 16:02:50 psql1 pgsql[11021]: DEBUG: PostgreSQL is running as a primary.
Nov 27 16:02:51 psql1 pgsql[11021]: DEBUG: node=psql2, state=STREAMING, 
sync_state=SYNC
Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: PostgreSQL is running as a primary.
Nov 27 16:02:53 psql1 pgsql[11142]: DEBUG: node=psql2, state=STREAMING, 
sync_state=SYNC
Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: PostgreSQL is running as a primary.
Nov 27 16:02:55 psql1 pgsql[11272]: DEBUG: node=psql2, state=STREAMING, 
sync_state=SYNC
Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: PostgreSQL is running as a primary.
Nov 27 16:02:57 psql1 pgsql[11368]: DEBUG: node=psql2, state=STREAMING, 
sync_state=SYNC
Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: PostgreSQL is running as a primary.
Nov 27 16:03:00 psql1 pgsql[11463]: DEBUG: node=psql2, state=STREAMING, 
sync_state=SYNC
Nov 27 16:03:00 psql1 pgsql[11556]: DEBUG: notify: pre for demote
Nov 27 16:03:00 psql1 pgsql[11590]: INFO: Stopping PostgreSQL on demote.
Nov 27 16:03:02 psql1 pgsql[11590]: INFO: waiting for server to shut down. 
done server stopped
Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Removing /var/lib/pgsql/PGSQL.lock.
Nov 27 16:03:02 psql1 pgsql[11590]: INFO: PostgreSQL is down
Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Changing pgsql-status on psql1 : 
PRI-STOP.
Nov 27 16:03:02 psql1 pgsql[11590]: DEBUG: Created recovery.conf. 
host=10.12.1.28, user=postgres
Nov 27 16:03:02 psql1 pgsql[11590]: INFO: Setup all nodes as an async.
Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: notify: post for demote
Nov 27 16:03:02 psql1 pgsql[11732]: DEBUG: post-demote called. Demote uname is 
psql1
Nov 27 16:03:02 psql1 pgsql[11732]: INFO: My Timeline ID and Checkpoint : 
14:2320
Nov 27 16:03:02 psql1 pgsql[11732]: WARNING: Can't get psql2 master baseline. 
Waiting...
Nov 27 16:03:03 psql1 pgsql[11732]: INFO: psql2 master baseline : 
14:2300
Nov 27 16:03:03 psql1 pgsql[11732]: ERROR: My data is inconsistent.
Nov 27 16:03:03 psql1 pgsql[11867]: DEBUG: notify: pre for stop
Nov 27 16:03:03 psql1 pgsql[11969]: INFO: PostgreSQL is already stopped.
Nov 27 16:03:12 psql1 pgsql[12053]: INFO: Don't check 
/var/lib/postgresql/9.1/main during probe
Nov 27 16:03:12 psql1 pgsql[12053]: INFO: PostgreSQL is down
Nov 27 16:03:27 psql1 pgsql[12204]: INFO: Changing pgsql-status on psql1 : 
-STOP.
Nov 27 16:03:27 psql1 pgsql[12204]: DEBUG: Created recovery.conf. 
host=10.12.1.28, user=postgres
Nov 27 16:03:27 psql1 pgsql[12204]: INFO: Setup all nodes as an async.
Nov 27 16:03:27 psql1 pgsql[12204]: INFO: My Timeline ID and Checkpoint : 
14:2320
Nov 27 16:03:27 psql1 pgsql[12204]: INFO: psql2 master baseline : 
14:2300
Nov 27 16:03:27 psql1 pgsql[12204]: ERROR: My data is inconsistent.
Nov 27 16:03:27 psql1 pgsql[12339]: DEBUG: notify: post for start
Nov 27 16:03:27 psql1 pgsql[12373]: DEBUG: notify: pre for stop
Nov 27 16:03:27 psql1 pgsql[12407]: INFO: PostgreSQL is already stopped.


Thanks,

Attila


-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
Sent: 2011. november 27. 11:07
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

2011/11/27 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi,

 You were right, changing the shell to bash resolved the problem.
 The cluster now started in sync mode - thank you very much.

You're very welcome.

 I will be testing it in the next couple of days. I did just a very quick test 
 - it seems that psql
 master failed over to psql2 properly, but when I tried to move it back to 
 psql1 there was some
 problems starting psql on node 1.

If master(psql1) is failed, its data may be inconsistency.
A PostgreSQL developer says that it's a feature.
Therefore my RA prevent it from starting automatically if data is inconsistency.
Please

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-26 Thread Attila Megyeri
Hi Yoshiharu, Takatoshi,

Spent another day, without success. :(

I started from scratch and synchronous replications works nicely when nodes are 
started outside pacemaker.
My PostgreSQL version is 9.1.1.

When I start from pacemaker, after a while it gets into the following state:

Online: [ psql1 psql2 ]

 Master/Slave Set: msPostgresql [postgresql]
 Slaves: [ psql1 psql2 ]
 Clone Set: clnPingCheck [pingCheck]
 Started: [ psql1 psql2 ]

Node Attributes:
* Node psql1:
+ default_ping_set  : 100
+ master-postgresql:0   : -INFINITY
+ pgsql-status  : HS:alone
+ pgsql-xlog-loc: 1900
* Node psql2:
+ default_ping_set  : 100
+ master-postgresql:1   : -INFINITY
+ pgsql-status  : HS:alone
+ pgsql-xlog-loc: 1900


The psql status queries return the following:

PSQL1
==
postgres@psql1:/root$ psql  -c select 
application_name,upper(state),upper(sync_state) from pg_stat_replication
application_name | upper | upper
--+---+---
(0 rows)

postgres@psql1:/root$ psql  -Atc select 
pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
0/1920|0/1900

PSQL2
==
postgres@psql2:~$  psql  -c select 
application_name,upper(state),upper(sync_state) from pg_stat_replication
 application_name | upper | upper
--+---+---
(0 rows)

postgres@psql2:~$ psql  -Atc select 
pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
0/1900|0/1900


Neither server can connect (obviously) to the master, as the vip_repl Is not 
brought up.


Could you help me understand WHAT is the action/state/event that sould promote 
one of the nodes? I see that pacemaker monitors the servers every X seconds, 
but nothing else happens.

In the log (limited to pgsql) the following sequence is repeated forewer

Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist.
Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right of master.
Nov 26 13:36:19 psql1 pgsql[19829]: INFO: My data status=.
Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql1 xlog location : 1900
Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql2 xlog location : 1900
Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: PostgreSQL is running as a hot 
standby.
Nov 26 13:36:26 psql1 pgsql[19993]: INFO: Master is not exist.
Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: Checking right of master.
Nov 26 13:36:26 psql1 pgsql[19993]: INFO: My data status=.
Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql1 xlog location : 1900
Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql2 xlog location : 1900
Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: PostgreSQL is running as a hot 
standby.
Nov 26 13:36:33 psql1 pgsql[20176]: INFO: Master is not exist.
Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: Checking right of master.
Nov 26 13:36:33 psql1 pgsql[20176]: INFO: My data status=.
Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql1 xlog location : 1900
Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql2 xlog location : 1900
Nov 26 13:36:41 psql1 pgsql[20343]: DEBUG: PostgreSQL is running as a hot 
standby.


Any help is appreciated!

Regards,
Attila




-Original Message-
From: Yoshiharu Mori [mailto:y-m...@sraoss.co.jp] 
Sent: 2011. november 25. 14:17
To: The Pacemaker cluster resource manager
Cc: Attila Megyeri
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

 A quick snippet from the corosync.log
 
 Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master.
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=.
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location : 
 0D00 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog 
 location : 0800
 
 As you see, the my data status returns an empty string.

My log is same. but it works.

Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Master is not exist.
Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Checking right of master.
Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: My data status=.
Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm01 xlog location : 
0520 Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm02 xlog 
location : 0500

In my log, the following logs are outputted and started after checking xlog 
location(3 times). 

Nov 18 19:29:39 osspc24-1 pgsql[18720]: INFO: I have a master right.

Please show us more corosync.log.


 
 
 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: 2011. november 25. 9:28
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - 
 RA needed
 
 Hi Takatoshi,
 
 I have restored the PSQL to run without corosync so I cannot send you the 
 crm_mon output now.
 
 What I can tell for sure:
 - RA never

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-26 Thread Attila Megyeri
Hi Takatoshi,

Thank you for coming back to me so quickly.

In the /var/lib/pgsql there are the following files:

PSQL1:
=
root@psql1:/var/lib/pgsql# ls -la
total 16
drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:04 .
drwxr-xr-x 35 root root 4096 Nov 25 22:21 ..
-rw-r--r--  1 postgres postgres1 Nov 26 00:17 rep_mode.conf
-rw-r--r--  1 root root   49 Nov 26 18:04 xlog_note.0

root@psql1:/var/lib/pgsql# cat xlog_note.0
-e psql1 1900
psql2 1900
root@psql1:/var/lib/pgsql#

PSQL2:
===
root@psql2:/var/lib/pgsql# ls -la
total 16
drwxr-xr-x  2 postgres postgres 4096 Nov 26 18:05 .
drwxr-xr-x 33 root root 4096 Nov 26 00:10 ..
-rw-r--r--  1 postgres postgres1 Nov 26 00:24 rep_mode.conf
-rw-r--r--  1 root root   49 Nov 26 18:05 xlog_note.0
root@psql2:/var/lib/pgsql# cat xlog_note.0
-e psql1 1900
psql2 1900
root@psql2:/var/lib/pgsql#

BTW, postgres is installed under /var/lib/postgresql , but I noticed that some 
parts of the RA are referring to the  /var/lib/pgsql directory, so I created 
that directory and i keep some of the files there.


Thanks,
Attila



-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com] 
Sent: 2011. november 26. 18:27
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

1. Are there /var/lib/pgsql/xlog_note.0 , xlog_note.1, xlog_note.2  files?
   These files are created while checking a xlog location on monitor.

2. Do these files include lines as below?
-
pgsql1  1900
pgsql2  1900
-

Regards.
Takatoshi MATSUO


2011年11月26日22:44 Attila Megyeri amegy...@minerva-soft.com:
 Hi Yoshiharu, Takatoshi,

 Spent another day, without success. :(

 I started from scratch and synchronous replications works nicely when nodes 
 are started outside pacemaker.
 My PostgreSQL version is 9.1.1.

 When I start from pacemaker, after a while it gets into the following state:

 Online: [ psql1 psql2 ]

  Master/Slave Set: msPostgresql [postgresql]
     Slaves: [ psql1 psql2 ]
  Clone Set: clnPingCheck [pingCheck]
     Started: [ psql1 psql2 ]

 Node Attributes:
 * Node psql1:
    + default_ping_set                  : 100
    + master-postgresql:0               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900
 * Node psql2:
    + default_ping_set                  : 100
    + master-postgresql:1               : -INFINITY
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 1900


 The psql status queries return the following:

 PSQL1
 ==
 postgres@psql1:/root$ psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
 application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql1:/root$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0/1920|0/1900

 PSQL2
 ==
 postgres@psql2:~$  psql  -c select 
 application_name,upper(state),upper(sync_state) from pg_stat_replication
  application_name | upper | upper
 --+---+---
 (0 rows)

 postgres@psql2:~$ psql  -Atc select 
 pg_last_xlog_replay_location(),pg_last_xlog_receive_location()
 0/1900|0/1900


 Neither server can connect (obviously) to the master, as the vip_repl Is not 
 brought up.


 Could you help me understand WHAT is the action/state/event that sould 
 promote one of the nodes? I see that pacemaker monitors the servers every X 
 seconds, but nothing else happens.

 In the log (limited to pgsql) the following sequence is repeated 
 forewer

 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: Master is not exist.
 Nov 26 13:36:19 psql1 pgsql[19829]: DEBUG: Checking right of master.
 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: My data status=.
 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql1 xlog location : 
 1900 Nov 26 13:36:19 psql1 pgsql[19829]: INFO: psql2 xlog 
 location : 1900 Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: 
 PostgreSQL is running as a hot standby.
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: Master is not exist.
 Nov 26 13:36:26 psql1 pgsql[19993]: DEBUG: Checking right of master.
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: My data status=.
 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql1 xlog location : 
 1900 Nov 26 13:36:26 psql1 pgsql[19993]: INFO: psql2 xlog 
 location : 1900 Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: 
 PostgreSQL is running as a hot standby.
 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: Master is not exist.
 Nov 26 13:36:33 psql1 pgsql[20176]: DEBUG: Checking right of master.
 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: My data status=.
 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql1 xlog location : 
 1900 Nov 26 13:36:33 psql1 pgsql[20176]: INFO: psql2 xlog 
 location

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-25 Thread Attila Megyeri
Hi Takatoshi,

I have restored the PSQL to run without corosync so I cannot send you the 
crm_mon output now.

What I can tell for sure:
- RA never promoted any of the nodes, no matter what the status was. It also 
did not promote the node, when it was the only one.
- I believe the issue is in the comparison of the xlogs. How could I 
troubleshoot that? I see from the logs that crm NEVER tried to invoke pgsql 
with promote
- I tried previously the crm_mon -A option, but there was never a  
pgsql-data-status attribute. The other attribs were there, including the 
HS:alone
- In the corosync log the only relevant RA message I see is  Master is not 
exist.  I never saw a message like  My data is out-of-date

Thank you!

Attila


-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com] 
Sent: 2011. november 25. 8:56
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

2011/11/24 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi, All,

 Thanks for your reply.
 I see that you have invested significant effort in the development of the RA. 
 I spent the last day trying to set up the RA, but without much success.

 My infrastructure is very similar to yours, except for the fact that 
 currently I am testing with a single network adapter.

 Replication works nicely when I start the databases manually, not using 
 corosync.

 When I try to start using corosync,I see that the ping resources start 
 normally, but the msPostgresql starts on both nodes in slave mode, and I see 
 HS:alone

To see HS:alone is normal.
And RA compares xlog locations and promote the postgresql having new data.

 In the Wiki you state, the if I start on a signle node only, PSQL should 
 start in Master mode (PRI), but this is not the case.

If the data is old, the node can't be master.
To be master needs pgsql-data-status=LATEST or STREAMING|SYNC.
Plese check it using crm_mon -A.




And to become a master from stopped takes a few minutes because the RA compares 
xlog location on monitor.


 The recovery.conf file is created immediately, and from the logs I see no 
 attempt at all to promote the node.
 In the postgres logs I see that node1, which is supposed to be a master, 
 tries to connect to the vip-rep IP address, which is NOT brought up, because 
 it depends on the Master role...

 Do you have any idea?

Please check HA log.
My RA outputs My data is out-of-date. status= to log if the data is 
old.

Regards,
Takatoshi MATSUO

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-25 Thread Attila Megyeri
A quick snippet from the corosync.log

Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master.
Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=.
Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location : 0D00
Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog location : 0800

As you see, the my data status returns an empty string.


-Original Message-
From: Attila Megyeri [mailto:amegy...@minerva-soft.com] 
Sent: 2011. november 25. 9:28
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Takatoshi,

I have restored the PSQL to run without corosync so I cannot send you the 
crm_mon output now.

What I can tell for sure:
- RA never promoted any of the nodes, no matter what the status was. It also 
did not promote the node, when it was the only one.
- I believe the issue is in the comparison of the xlogs. How could I 
troubleshoot that? I see from the logs that crm NEVER tried to invoke pgsql 
with promote
- I tried previously the crm_mon -A option, but there was never a  
pgsql-data-status attribute. The other attribs were there, including the 
HS:alone
- In the corosync log the only relevant RA message I see is  Master is not 
exist.  I never saw a message like  My data is out-of-date

Thank you!

Attila


-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
Sent: 2011. november 25. 8:56
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

2011/11/24 Attila Megyeri amegy...@minerva-soft.com:
 Hi Takatoshi, All,

 Thanks for your reply.
 I see that you have invested significant effort in the development of the RA. 
 I spent the last day trying to set up the RA, but without much success.

 My infrastructure is very similar to yours, except for the fact that 
 currently I am testing with a single network adapter.

 Replication works nicely when I start the databases manually, not using 
 corosync.

 When I try to start using corosync,I see that the ping resources start 
 normally, but the msPostgresql starts on both nodes in slave mode, and I see 
 HS:alone

To see HS:alone is normal.
And RA compares xlog locations and promote the postgresql having new data.

 In the Wiki you state, the if I start on a signle node only, PSQL should 
 start in Master mode (PRI), but this is not the case.

If the data is old, the node can't be master.
To be master needs pgsql-data-status=LATEST or STREAMING|SYNC.
Plese check it using crm_mon -A.




And to become a master from stopped takes a few minutes because the RA compares 
xlog location on monitor.


 The recovery.conf file is created immediately, and from the logs I see no 
 attempt at all to promote the node.
 In the postgres logs I see that node1, which is supposed to be a master, 
 tries to connect to the vip-rep IP address, which is NOT brought up, because 
 it depends on the Master role...

 Do you have any idea?

Please check HA log.
My RA outputs My data is out-of-date. status= to log if the data is 
old.

Regards,
Takatoshi MATSUO

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-25 Thread Attila Megyeri
Hi Yoshiharu,


-Original Message-
From: Yoshiharu Mori [mailto:y-m...@sraoss.co.jp] 
Sent: 2011. november 25. 14:17
To: The Pacemaker cluster resource manager
Cc: Attila Megyeri
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila

 A quick snippet from the corosync.log
 
 Nov 23 05:43:05 psql1 pgsql[2845]: DEBUG: Checking right of master.
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: My data status=.
 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql1 xlog location : 
 0D00 Nov 23 05:43:05 psql1 pgsql[2845]: INFO: psql2 xlog 
 location : 0800
 
 As you see, the my data status returns an empty string.

My log is same. but it works.

Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Master is not exist.
Nov 18 19:28:26 osspc24-1 pgsql[17350]: INFO: Checking right of master.
Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: My data status=.
Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm01 xlog location : 
0520 Nov 18 19:28:19 osspc24-1 pgsql[17138]: INFO: pm02 xlog 
location : 0500

In my log, the following logs are outputted and started after checking xlog 
location(3 times). 

Nov 18 19:29:39 osspc24-1 pgsql[18720]: INFO: I have a master right.

Please show us more corosync.log.


===
I can leave it run forever, but will never show I have a master right.
To be honest, I have no idea what should promote the node to master.
What is it that the RA checks, and what could be wrong? I just cannot find 
where the problem is.

Right now I am running corosync on node 1 only, as I expect that this way it 
will have the most recent  xlog and start as a master.
But it never starts.

Here is the output for crm_mon -A :


Last updated: Fri Nov 25 13:52:58 2011
Stack: openais
Current DC: psql1 - partition WITHOUT quorum
Version: 1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
4 Resources configured.


Online: [ psql1 ]
OFFLINE: [ psql2 ]

 Master/Slave Set: msPostgresql [postgresql]
 Slaves: [ psql1 ]
 Stopped: [ postgresql:1 ]
 Clone Set: clnPingCheck [pingCheck]
 Started: [ psql1 ]
 Stopped: [ pingCheck:1 ]

Node Attributes:
* Node psql1:
+ default_ping_set  : 100
+ master-postgresql:0   : -INFINITY
+ pgsql-status  : HS:alone
+ pgsql-xlog-loc: 1200



I sent the log directly in private not to overload the list. I did a resource 
stop msPostgresql and resource start msPostgresql around 13:52.
You will see some extra debug messages starting with ATT - I added them to 
the RA to help my troubleshooting.

Thank you for your help,
Attila





 
 
 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: 2011. november 25. 9:28
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - 
 RA needed
 
 Hi Takatoshi,
 
 I have restored the PSQL to run without corosync so I cannot send you the 
 crm_mon output now.
 
 What I can tell for sure:
 - RA never promoted any of the nodes, no matter what the status was. It also 
 did not promote the node, when it was the only one.
 - I believe the issue is in the comparison of the xlogs. How could I 
 troubleshoot that? I see from the logs that crm NEVER tried to invoke pgsql 
 with promote
 - I tried previously the crm_mon -A option, but there was never a  
 pgsql-data-status attribute. The other attribs were there, including 
 the HS:alone
 - In the corosync log the only relevant RA message I see is  Master is not 
 exist.  I never saw a message like  My data is out-of-date
 
 Thank you!
 
 Attila
 
 
 -Original Message-
 From: Takatoshi MATSUO [mailto:matsuo@gmail.com]
 Sent: 2011. november 25. 8:56
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - 
 RA needed
 
 Hi Attila
 
 2011/11/24 Attila Megyeri amegy...@minerva-soft.com:
  Hi Takatoshi, All,
 
  Thanks for your reply.
  I see that you have invested significant effort in the development of the 
  RA. I spent the last day trying to set up the RA, but without much success.
 
  My infrastructure is very similar to yours, except for the fact that 
  currently I am testing with a single network adapter.
 
  Replication works nicely when I start the databases manually, not using 
  corosync.
 
  When I try to start using corosync,I see that the ping resources start 
  normally, but the msPostgresql starts on both nodes in slave mode, and I 
  see HS:alone
 
 To see HS:alone is normal.
 And RA compares xlog locations and promote the postgresql having new data.
 
  In the Wiki you state, the if I start on a signle node only, PSQL should 
  start in Master mode (PRI), but this is not the case.
 
 If the data is old, the node can't be master.
 To be master needs pgsql-data-status=LATEST or STREAMING|SYNC.
 Plese check it using

[Pacemaker] faq / howto needed for cib troubleshooting

2011-11-24 Thread Attila Megyeri
Hi Gents,

I see from time to time that you are asking for cibadmin -Ql type outputs to 
help people troubleshoot their problems.

Currenty I have an issue promoting a MS resource (the PSQL issue in the 
previous mail) - and I would like to start troubleshooting the problem, but did 
not find any howto's or documentation on this topic.
Could you  provide me any details on how to troubleshoot cib states?
My current issue is that I have a MS resource that is started in slave/slave 
mode, and the promote is never even called by the cib. I'd like to start the 
research but have no idea how to do it.

I have read the pacemaker doc, as well as the cluster from srcatch doc, but 
there are no troubleshooting hints.

Thank you in advance,

Attila

-Original Message-
From: Attila Megyeri [mailto:amegy...@minerva-soft.com] 
Sent: 2011. november 23. 16:53
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Takatoshi, All,

Thanks for your reply.
I see that you have invested significant effort in the development of the RA. I 
spent the last day trying to set up the RA, but without much success.

My infrastructure is very similar to yours, except for the fact that currently 
I am testing with a single network adapter.

Replication works nicely when I start the databases manually, not using 
corosync.

When I try to start using corosync,I see that the ping resources start 
normally, but the msPostgresql starts on both nodes in slave mode, and I see 
HS:alone

In the Wiki you state, the if I start on a signle node only, PSQL should start 
in Master mode (PRI), but this is not the case.

The recovery.conf file is created immediately, and from the logs I see no 
attempt at all to promote the node.
In the postgres logs I see that node1, which is supposed to be a master, tries 
to connect to the vip-rep IP address, which is NOT brought up, because it 
depends on the Master role...

Do you have any idea?


My environment:
Debian Squeeze, with backported pacemaker (Version: 1.1.5) - official pacemaker 
in debian is rather old and buggy Postgres 9.1, streaming replication, sync mode
Node1: psql1, 10.12.1.21
Node1: psql2, 10.12.1.22

Crm config:

node psql1 \
attributes standby=off
node psql2 \
attributes standby=off
primitive pingCheck ocf:pacemaker:ping \
params name=default_ping_set host_list=10.12.1.1 multiplier=100 \
op start interval=0s timeout=60s on-fail=restart \
op monitor interval=10s timeout=60s on-fail=restart \
op stop interval=0s timeout=60s on-fail=ignore
primitive postgresql ocf:heartbeat:pgsql \
params pgctl=/usr/lib/postgresql/9.1/bin/pg_ctl psql=/usr/bin/psql 
pgdata=/var/lib/postgresql/9.1/main 
config=/etc/postgresql/9.1/main/postgresql.conf 
pgctldata=/usr/lib/postgresql/9.1/bin/pg_controldata rep_mode=sync 
node_list=psql1 psql2 restore_command=cp 
/var/lib/postgresql/9.1/main/pg_archive/%f %p master_ip=10.12.1.28 \
op start interval=0s timeout=60s on-fail=restart \
op monitor interval=7s timeout=60s on-fail=restart \
op monitor interval=2s role=Master timeout=60s on-fail=restart \
op promote interval=0s timeout=60s on-fail=restart \
op demote interval=0s timeout=60s on-fail=block \
op stop interval=0s timeout=60s on-fail=block \
op notify interval=0s timeout=60s
primitive vip-master ocf:heartbeat:IPaddr2 \
params ip=10.12.1.20 nic=eth0 cidr_netmask=24 \
op start interval=0s timeout=60s on-fail=restart \
op monitor interval=10s timeout=60s on-fail=restart \
op stop interval=0s timeout=60s on-fail=block \
meta target-role=Started
primitive vip-rep ocf:heartbeat:IPaddr2 \
params ip=10.12.1.28 nic=eth0 cidr_netmask=24 \
op start interval=0s timeout=60s on-fail=restart \
op monitor interval=10s timeout=60s on-fail=restart \
op stop interval=0s timeout=60s on-fail=block \
meta target-role=Started
primitive vip-slave ocf:heartbeat:IPaddr2 \
params ip=10.12.1.27 nic=eth0 cidr_netmask=24 \
meta resource-stickiness=1 \
op start interval=0s timeout=60s on-fail=restart \
op monitor interval=10s timeout=60s on-fail=restart \
op stop interval=0s timeout=60s on-fail=block
group master-group vip-master vip-rep
ms msPostgresql postgresql \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true target-role=Master
clone clnPingCheck pingCheck
location rsc_location-1 vip-slave \
rule $id=rsc_location-1-rule 200: pgsql-status eq HS:sync \
rule $id=rsc_location-1-rule-0 100: pgsql-status eq PRI \
rule $id=rsc_location-1-rule-1 -inf: not_defined pgsql-status \
rule $id=rsc_location-1-rule-2 -inf: pgsql-status ne HS:sync and 
pgsql-status ne PRI location rsc_location-2 msPostgresql \
rule $id=rsc_location-2-rule $role=master 200: #uname eq psql1

Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-23 Thread Attila Megyeri
=INFINITY \
migration-threshold=1



Regards,
Attila



-Original Message-
From: Takatoshi MATSUO [mailto:matsuo@gmail.com] 
Sent: 2011. november 17. 8:04
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi  All

I create a RA for PosstgrSQL 9.1 Streaming Replication based on pgsql.

RA
  https://github.com/t-matsuo/resource-agents/blob/pgsql91/heartbeat/pgsql
Documents
  https://github.com/t-matsuo/resource-agents/wiki

It is almost totally changed from previous patch 
http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018193.html
.
It create recovery.conf and promote PostgreSQL automatically.
Additionally it can switch between the synchronous and asynchronous replication 
automatically.

If you please, use them and comment.

Regards,
Takatoshi MATSUO

2011/11/17 Serge Dubrouski serge...@gmail.com:


 On Wed, Nov 16, 2011 at 12:55 PM, Attila Megyeri 
 amegy...@minerva-soft.com
 wrote:

 Hi Florian,

 -Original Message-
 From: Florian Haas [mailto:flor...@hastexo.com]
 Sent: 2011. november 16. 11:49
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Postgresql streaming replication failover - 
 RA needed

 Hi Attila,

 On 2011-11-16 10:27, Attila Megyeri wrote:
  Hi All,
 
 
 
  We have a two-node postgresql 9.1 system configured using streaming 
  replicaiton(active/active with a read-only slave).
 
  We want to automate the failover process and I couldn't really find 
  a resource agent that could do the job.

 That is correct; the pgsql resource agent (unlike its mysql 
 counterpart) does not support streaming replication. We've had a 
 contributor submit a patch at one point, but it was somewhat 
 ill-conceived and thus did not make it into the upstream repo. The relevant 
 thread is here:

 http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195
 .html

 Would you feel comfortable modifying the pgsql resource agent to 
 support replication? If so, we could revisit this issue and 
 potentially add streaming replication support to pgsql.


 Well I'm not sure I would be able to do that change. Failover is 
 relatively easy to do but I really have no idea how to do the failback part.

 And that's exactly the reason why I haven't implemented it yet. With 
 the current way how replication is done in PostgreSQL there is no easy 
 way to switch between roles, or at least I don't know about a such way.
 Implementing just fail-over functionality by creating a trigger file 
 on a slave server in the case of failure on master side doesn't create 
 a full master-slave implementation in my opinion.


 I will definitively have to sort this out somehow, I am just unsure 
 whether I will try to use the repmgr mentioned in the video, or 
 pacemaker with some level of customization...

 Is the resource agent that you mentioned available somewhere?

 Thanks.
 Attila



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacem
 aker



 --
 Serge Dubrouski.

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs:
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacema
 ker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Postgresql streaming replication failover - RAneeded

2011-11-23 Thread Attila Megyeri
Hi Brett,

I would be very interested to see your solution. So far I had no luck with 
Takatoshi's RA :(

Regards,
Attila

-Original Message-
From: Buckingham, Brett [mailto:bbucking...@broadviewnet.com] 
Sent: 2011. november 17. 18:40
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RAneeded

 Well I'm not sure I would be able to do that change. Failover is
relatively easy to do but I really have no idea how to do the failback part.

And that's exactly the reason why I haven't implemented it yet. With
the current way how replication is done in PostgreSQL there is no easy way to 
switch between roles, or at least I don't know about a such way.
Implementing just fail-over functionality by creating a trigger file on a slave 
server in the case of failure on master side doesn't create a full master-slave 
implementation in my opinion.

We have created just such a multi-state RA, which incorporates a design to 
manage failover, failback, and fallback (regular backups).  Please give us a 
few days - a member of my team is removing any product-specifics from it, and 
we'll post it shortly.

Brett


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Colocating resources on the same physical server when resources are run inside virtual servers

2011-11-17 Thread Attila Megyeri
HI,


-Original Message-
From: Andrew Beekhof [mailto:and...@beekhof.net] 
Sent: 2011. november 18. 0:28
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Colocating resources on the same physical server when 
resources are run inside virtual servers

On Fri, Nov 18, 2011 at 12:46 AM, Andreas Kurz andr...@hastexo.com wrote:
 Hello,

 On 11/16/2011 09:04 PM, Attila Megyeri wrote:
 Hi Team,



 Resources A and B are running within virtual servers. There are 
 two physical servers ph1 and ph2 with two virtualized nodes on each.



 What would be the easiest way to have a specific resource (e.g. 
 resource
 A) move to another node (From node 1 to node 2) in case when a 
 different resource (e.g. B)

 moves from node 3 to node 4?



 Resources A and B are independent, but nodes 1 and 3 are virtual 
 servers running on physical host ph1 whereas nodes 2 and 4 are 
 virtual servers on physical host ph2 and

 the goal is to have resources A and B run on the same physical
 (host) server.

 Is this one cluster or are these two independent clusters?


Well, my first idea was to make them independent, but I am ready to merge them 
as well :)



 For one cluster there would the (hidden,undocumented) feature to use a 
 node-attribute in colocation constraints

Did I not document it, or just not explain how useful it could be in these 
cases? :-)


So I guess I found something not very trivial :)

  unfortunately the crm shell is
 not aware of this feature so you would have to manipulate the cib 
 directly ... might be worth trying.

 Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now




 Thank you in advance,



 Bests



 Attila









 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacem
 aker




 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacema
 ker



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-16 Thread Attila Megyeri
Hi All,

We have a two-node postgresql 9.1 system configured using streaming replicaiton 
(active/active with a read-only slave).
We want to automate the failover process and I couldn't really find a resource 
agent that could do the job.

All HA solutions for postgresql I have seen are based on a DRBD active/passive 
approach, that we would not prefer.

At the first stage I would be satisified with the failover only - meaning 
that the more complex failback would not be required.
Of course if the failback could be implemented as well, that would be the right 
solution for us.

Does anyone have experience with the above setup? Any feedback is appreciated!

Regards,
Attila
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Postgresql streaming replication failover - RA needed

2011-11-16 Thread Attila Megyeri
Hi Florian,

-Original Message-
From: Florian Haas [mailto:flor...@hastexo.com] 
Sent: 2011. november 16. 11:49
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Postgresql streaming replication failover - RA needed

Hi Attila,

On 2011-11-16 10:27, Attila Megyeri wrote:
 Hi All,
 
  
 
 We have a two-node postgresql 9.1 system configured using streaming 
 replicaiton(active/active with a read-only slave).
 
 We want to automate the failover process and I couldn't really find a 
 resource agent that could do the job.

That is correct; the pgsql resource agent (unlike its mysql counterpart) does 
not support streaming replication. We've had a contributor submit a patch at 
one point, but it was somewhat ill-conceived and thus did not make it into the 
upstream repo. The relevant thread is here:

http://lists.linux-ha.org/pipermail/linux-ha-dev/2011-February/018195.html

Would you feel comfortable modifying the pgsql resource agent to support 
replication? If so, we could revisit this issue and potentially add streaming 
replication support to pgsql.


Well I'm not sure I would be able to do that change. Failover is relatively 
easy to do but I really have no idea how to do the failback part. I will 
definitively have to sort this out somehow, I am just unsure whether I will try 
to use the repmgr mentioned in the video, or pacemaker with some level of 
customization...

Is the resource agent that you mentioned available somewhere?

Thanks.
Attila



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Colocating resources on the same physical server when resources are run inside virtual servers

2011-11-16 Thread Attila Megyeri
Hi Team,

Resources A and B are running within virtual servers. There are two 
physical servers ph1 and ph2 with two virtualized nodes on each.

What would be the easiest way to have a specific resource (e.g. resource A) 
move to another node (From node 1 to node 2) in case when a different 
resource (e.g. B)
moves from node 3 to node 4?

Resources A and B are independent, but nodes 1 and 3 are virtual servers 
running on physical host ph1 whereas nodes 2 and 4 are virtual servers on 
physical host ph2 and
the goal is to have resources A and B run on the same physical (host) 
server.

Thank you in advance,

Bests

Attila



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Multinode cluster question

2011-11-13 Thread Attila Megyeri

-Original Message-
From: Andreas Kurz [mailto:andr...@hastexo.com] 
Sent: 2011. november 10. 11:03
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Multinode cluster question

On 11/09/2011 12:04 AM, Attila Megyeri wrote:
 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: 2011. november 8. 16:13
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Multinode cluster question
 
 -Original Message-
 From: Andreas Kurz [mailto:andr...@hastexo.com]
 Sent: 2011. november 8. 15:27
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Multinode cluster question
 
 On 11/08/2011 03:14 PM, Attila Megyeri wrote:

 -Original Message-
 From: Andreas Kurz [mailto:andr...@hastexo.com]
 Sent: 2011. november 8. 15:03
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Multinode cluster question

 On 11/08/2011 02:50 PM, Attila Megyeri wrote:
 -Original Message-
 From: Andreas Kurz [mailto:andr...@hastexo.com]
 Sent: 2011. november 8. 14:42
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Multinode cluster question

 On 11/08/2011 12:02 PM, Attila Megyeri wrote:
 Hi All,

  

 I need some help/guidance, on how to make sure that certain 
 resources (running in virtual cluster nodes) are run on the same physical 
 server.

  

 The setup:

  

 I have a cluster made of two physical nodes, that I am willing to 
 use for HA purposes (no LB for the time being).

 I have a failover IPfrom the provider, that is controlled using a 
 resource agent from one pair of the virtual machines(web1 and 
 web2), and the IP is assigned always to one of the physical servers.

 On the physical server I use iptables pre/postrouting to direct the 
 traffic to the appropriate virtual node.The routing points to the 
 web VIP, and red5 VIP.

  

 On the physical servers I have 3-3 virtual servers, that host the 
 specific roles of the solution, e.g. db1 db2, web1 web2, red5_1 red5_2.

 The virtual servers use the default gateway of their own physical 
 server to talk to the outside world.

  

 My first idea was to create 3 independent two-node clusters. Db 
 cluster, web cluster, red5 cluster.

 The db cluster is a M/S psql, with a virtual IP.

 The web cluster is an apache2 cluster, cloned on two virtual 
 servers, with a failover IP RA (if node1 on phy1 fails, failover Ip 
 is redirected to phy2 and vice versa).

 Red5 is a red5 cluster running on two instances, with a virtual IP 
 (internal).

  

 This is where it gets interesting - because of the default gateway.

 The db cluster is accessed from the intranet only - no worries here.

  

 Red5 is different - but it needs further explanation.

 Let's assume that all roles (db master, web, red5) are running on 
 phisical server  1.

 Web1 fails for some reason. Web2 role will become active, and the 
 external failover IP will point from now on to physical node2.  The 
 iptables script still points to the same VIP address, but it now 
 runs on a different node. No issue shere, as Web2  gets its traffic 
 properly, as it KNOWs that it is running on node2 now.

  

 The issue is with Red5.

 Red5 runs on node1, and uses default gw on node1. [it does not know 
 that the external failover IP no longer points to node1].

 When a request is received on the failover IP (now ph node2), 
 iptables redirects it to red5's VIP. Red5, running on node1 gets 
 this request, but does not know that it shall be routed through node2!

 As such, the replies, will be routed through ph node1 - as it is 
 the default gw. This is definitively not the right approach.

  

 The actual question is:

 -  Should I treat all nodes inside the same cluster (db1, db2,
 web1, web2, red1, red2) - and this way I could possibly detect that 
 failover IP has changed and I should do something with red5?

 -  Do something could mean for me one of the following:

 o   If web VIP is running on physical node 2 (on node web2), then
 move red VIP to physical node2 (to node red2)

 o   Alternatively, only change the default gateway for red1, to use
 node2 as the default gateway?

 Why not doing all traffic between the VMs via an internal net only ...
 e.g. a direct connnected nic between hosts, bridged and all VMs connected 
 to it?



 I'm not sure I got you right.
 All VMs are connected and bridged. All VMs can use any of the physical 
 hosts as their default gateway. And there is no issue with internal traffic.
 The questions is what to do when a resource runs on node1 (default GW is 
 host1), but the external failover IP points to host2, and host2 routes the 
 packets to VIP of the resource. The outgoing packet will leave the internal 
 network on host1, and it entered on host2.

 If they are all bridged and connected why would you have a routing 
 problem? They are all in the same net, no routing involved ... only the arp 
 caches need to be updated after an IP failover ... this is handled e.g

[Pacemaker] Multinode cluster question

2011-11-08 Thread Attila Megyeri
Hi All,

I need some help/guidance, on how to make sure that certain resources (running 
in virtual cluster nodes) are run on the same physical server.

The setup:

I have a cluster made of two physical nodes, that I am willing to use for HA 
purposes (no LB for the time being).
I have a failover IP from the provider, that is controlled using a resource 
agent from one pair of the virtual machines (web1 and web2), and the IP is 
assigned always to one of the physical servers.
On the physical server I use iptables pre/postrouting to direct the traffic to 
the appropriate virtual node. The routing points to the web VIP, and red5 VIP.

On the physical servers I have 3-3 virtual servers, that host the specific 
roles of the solution, e.g. db1 db2, web1 web2, red5_1 red5_2.
The virtual servers use the default gateway of their own physical server to 
talk to the outside world.

My first idea was to create 3 independent two-node clusters. Db cluster, web 
cluster, red5 cluster.
The db cluster is a M/S psql, with a virtual IP.
The web cluster is an apache2 cluster, cloned on two virtual servers, with a 
failover IP RA (if node1 on phy1 fails, failover Ip is redirected to phy2 and 
vice versa).
Red5 is a red5 cluster running on two instances, with a virtual IP (internal).

This is where it gets interesting - because of the default gateway.
The db cluster is accessed from the intranet only - no worries here.

Red5 is different - but it needs further explanation.
Let's assume that all roles (db master, web, red5) are running on phisical 
server  1.
Web1 fails for some reason. Web2 role will become active, and the external 
failover IP will point from now on to physical node2.  The iptables script 
still points to the same VIP address, but it now runs on a different node. No 
issue shere, as Web2  gets its traffic properly, as it KNOWs that it is running 
on node2 now.

The issue is with Red5.
Red5 runs on node1, and uses default gw on node1. [it does not know that the 
external failover IP no longer points to node1].
When a request is received on the failover IP (now ph node2), iptables 
redirects it to red5's VIP. Red5, running on node1 gets this request, but does 
not know that it shall be routed through node2!
As such, the replies, will be routed through ph node1 - as it is the default 
gw. This is definitively not the right approach.

The actual question is:

-  Should I treat all nodes inside the same cluster (db1, db2, web1, 
web2, red1, red2) - and this way I could possibly detect that failover IP has 
changed and I should do something with red5?

-  Do something could mean for me one of the following:

o   If web VIP is running on physical node 2 (on node web2), then move 
red VIP to physical node2 (to node red2)

o   Alternatively, only change the default gateway for red1, to use node2 as 
the default gateway?

I hope my question is clear, and that the setup mentioned here is quite common.
I am asking the experts, what is the recommended approach in this case.


Thank you in advance,

Attila




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Multinode cluster question

2011-11-08 Thread Attila Megyeri
-Original Message-
From: Andreas Kurz [mailto:andr...@hastexo.com] 
Sent: 2011. november 8. 14:42
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Multinode cluster question

On 11/08/2011 12:02 PM, Attila Megyeri wrote:
 Hi All,
 
  
 
 I need some help/guidance, on how to make sure that certain resources 
 (running in virtual cluster nodes) are run on the same physical server.
 
  
 
 The setup:
 
  
 
 I have a cluster made of two physical nodes, that I am willing to use 
 for HA purposes (no LB for the time being).
 
 I have a failover IPfrom the provider, that is controlled using a 
 resource agent from one pair of the virtual machines(web1 and web2), 
 and the IP is assigned always to one of the physical servers.
 
 On the physical server I use iptables pre/postrouting to direct the 
 traffic to the appropriate virtual node.The routing points to the web 
 VIP, and red5 VIP.
 
  
 
 On the physical servers I have 3-3 virtual servers, that host the 
 specific roles of the solution, e.g. db1 db2, web1 web2, red5_1 red5_2.
 
 The virtual servers use the default gateway of their own physical 
 server to talk to the outside world.
 
  
 
 My first idea was to create 3 independent two-node clusters. Db 
 cluster, web cluster, red5 cluster.
 
 The db cluster is a M/S psql, with a virtual IP.
 
 The web cluster is an apache2 cluster, cloned on two virtual servers, 
 with a failover IP RA (if node1 on phy1 fails, failover Ip is 
 redirected to phy2 and vice versa).
 
 Red5 is a red5 cluster running on two instances, with a virtual IP 
 (internal).
 
  
 
 This is where it gets interesting - because of the default gateway.
 
 The db cluster is accessed from the intranet only - no worries here.
 
  
 
 Red5 is different - but it needs further explanation.
 
 Let's assume that all roles (db master, web, red5) are running on 
 phisical server  1.
 
 Web1 fails for some reason. Web2 role will become active, and the 
 external failover IP will point from now on to physical node2.  The 
 iptables script still points to the same VIP address, but it now runs 
 on a different node. No issue shere, as Web2  gets its traffic 
 properly, as it KNOWs that it is running on node2 now.
 
  
 
 The issue is with Red5.
 
 Red5 runs on node1, and uses default gw on node1. [it does not know 
 that the external failover IP no longer points to node1].
 
 When a request is received on the failover IP (now ph node2), iptables 
 redirects it to red5's VIP. Red5, running on node1 gets this request, 
 but does not know that it shall be routed through node2!
 
 As such, the replies, will be routed through ph node1 - as it is the 
 default gw. This is definitively not the right approach.
 
  
 
 The actual question is:
 
 -  Should I treat all nodes inside the same cluster (db1, db2,
 web1, web2, red1, red2) - and this way I could possibly detect that 
 failover IP has changed and I should do something with red5?
 
 -  Do something could mean for me one of the following:
 
 o   If web VIP is running on physical node 2 (on node web2), then
 move red VIP to physical node2 (to node red2)
 
 o   Alternatively, only change the default gateway for red1, to use
 node2 as the default gateway?

Why not doing all traffic between the VMs via an internal net only ...
e.g. a direct connnected nic between hosts, bridged and all VMs connected to it?



I'm not sure I got you right.
All VMs are connected and bridged. All VMs can use any of the physical hosts as 
their default gateway. And there is no issue with internal traffic.
The questions is what to do when a resource runs on node1 (default GW is 
host1), but the external failover IP points to host2, and host2 routes the 
packets to VIP of the resource. The outgoing packet will leave the internal 
network on host1, and it entered on host2.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Multinode cluster question

2011-11-08 Thread Attila Megyeri

-Original Message-
From: Andreas Kurz [mailto:andr...@hastexo.com] 
Sent: 2011. november 8. 15:03
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Multinode cluster question

On 11/08/2011 02:50 PM, Attila Megyeri wrote:
 -Original Message-
 From: Andreas Kurz [mailto:andr...@hastexo.com]
 Sent: 2011. november 8. 14:42
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Multinode cluster question
 
 On 11/08/2011 12:02 PM, Attila Megyeri wrote:
 Hi All,

  

 I need some help/guidance, on how to make sure that certain resources 
 (running in virtual cluster nodes) are run on the same physical server.

  

 The setup:

  

 I have a cluster made of two physical nodes, that I am willing to use 
 for HA purposes (no LB for the time being).

 I have a failover IPfrom the provider, that is controlled using a 
 resource agent from one pair of the virtual machines(web1 and web2), 
 and the IP is assigned always to one of the physical servers.

 On the physical server I use iptables pre/postrouting to direct the 
 traffic to the appropriate virtual node.The routing points to the web 
 VIP, and red5 VIP.

  

 On the physical servers I have 3-3 virtual servers, that host the 
 specific roles of the solution, e.g. db1 db2, web1 web2, red5_1 red5_2.

 The virtual servers use the default gateway of their own physical 
 server to talk to the outside world.

  

 My first idea was to create 3 independent two-node clusters. Db 
 cluster, web cluster, red5 cluster.

 The db cluster is a M/S psql, with a virtual IP.

 The web cluster is an apache2 cluster, cloned on two virtual servers, 
 with a failover IP RA (if node1 on phy1 fails, failover Ip is 
 redirected to phy2 and vice versa).

 Red5 is a red5 cluster running on two instances, with a virtual IP 
 (internal).

  

 This is where it gets interesting - because of the default gateway.

 The db cluster is accessed from the intranet only - no worries here.

  

 Red5 is different - but it needs further explanation.

 Let's assume that all roles (db master, web, red5) are running on 
 phisical server  1.

 Web1 fails for some reason. Web2 role will become active, and the 
 external failover IP will point from now on to physical node2.  The 
 iptables script still points to the same VIP address, but it now runs 
 on a different node. No issue shere, as Web2  gets its traffic 
 properly, as it KNOWs that it is running on node2 now.

  

 The issue is with Red5.

 Red5 runs on node1, and uses default gw on node1. [it does not know 
 that the external failover IP no longer points to node1].

 When a request is received on the failover IP (now ph node2), 
 iptables redirects it to red5's VIP. Red5, running on node1 gets this 
 request, but does not know that it shall be routed through node2!

 As such, the replies, will be routed through ph node1 - as it is the 
 default gw. This is definitively not the right approach.

  

 The actual question is:

 -  Should I treat all nodes inside the same cluster (db1, db2,
 web1, web2, red1, red2) - and this way I could possibly detect that 
 failover IP has changed and I should do something with red5?

 -  Do something could mean for me one of the following:

 o   If web VIP is running on physical node 2 (on node web2), then
 move red VIP to physical node2 (to node red2)

 o   Alternatively, only change the default gateway for red1, to use
 node2 as the default gateway?
 
 Why not doing all traffic between the VMs via an internal net only ...
 e.g. a direct connnected nic between hosts, bridged and all VMs connected to 
 it?
 
 
 
 I'm not sure I got you right.
 All VMs are connected and bridged. All VMs can use any of the physical hosts 
 as their default gateway. And there is no issue with internal traffic.
 The questions is what to do when a resource runs on node1 (default GW is 
 host1), but the external failover IP points to host2, and host2 routes the 
 packets to VIP of the resource. The outgoing packet will leave the internal 
 network on host1, and it entered on host2.

If they are all bridged and connected why would you have a routing problem? 
They are all in the same net, no routing involved ... only the arp caches need 
to be updated after an IP failover ... this is handled e.g. in the IPAddr2 RA 
of Pacemaker.


The problem is that red1 is not failed over when web1 fails, e.g. it still runs 
on node1.
Or were you thinking about a VIP for the default gateway as well?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Multinode cluster question

2011-11-08 Thread Attila Megyeri
-Original Message-
From: Andreas Kurz [mailto:andr...@hastexo.com] 
Sent: 2011. november 8. 15:27
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Multinode cluster question

On 11/08/2011 03:14 PM, Attila Megyeri wrote:
 
 -Original Message-
 From: Andreas Kurz [mailto:andr...@hastexo.com]
 Sent: 2011. november 8. 15:03
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Multinode cluster question
 
 On 11/08/2011 02:50 PM, Attila Megyeri wrote:
 -Original Message-
 From: Andreas Kurz [mailto:andr...@hastexo.com]
 Sent: 2011. november 8. 14:42
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Multinode cluster question

 On 11/08/2011 12:02 PM, Attila Megyeri wrote:
 Hi All,

  

 I need some help/guidance, on how to make sure that certain 
 resources (running in virtual cluster nodes) are run on the same physical 
 server.

  

 The setup:

  

 I have a cluster made of two physical nodes, that I am willing to 
 use for HA purposes (no LB for the time being).

 I have a failover IPfrom the provider, that is controlled using a 
 resource agent from one pair of the virtual machines(web1 and web2), 
 and the IP is assigned always to one of the physical servers.

 On the physical server I use iptables pre/postrouting to direct the 
 traffic to the appropriate virtual node.The routing points to the 
 web VIP, and red5 VIP.

  

 On the physical servers I have 3-3 virtual servers, that host the 
 specific roles of the solution, e.g. db1 db2, web1 web2, red5_1 red5_2.

 The virtual servers use the default gateway of their own physical 
 server to talk to the outside world.

  

 My first idea was to create 3 independent two-node clusters. Db 
 cluster, web cluster, red5 cluster.

 The db cluster is a M/S psql, with a virtual IP.

 The web cluster is an apache2 cluster, cloned on two virtual 
 servers, with a failover IP RA (if node1 on phy1 fails, failover Ip 
 is redirected to phy2 and vice versa).

 Red5 is a red5 cluster running on two instances, with a virtual IP 
 (internal).

  

 This is where it gets interesting - because of the default gateway.

 The db cluster is accessed from the intranet only - no worries here.

  

 Red5 is different - but it needs further explanation.

 Let's assume that all roles (db master, web, red5) are running on 
 phisical server  1.

 Web1 fails for some reason. Web2 role will become active, and the 
 external failover IP will point from now on to physical node2.  The 
 iptables script still points to the same VIP address, but it now 
 runs on a different node. No issue shere, as Web2  gets its traffic 
 properly, as it KNOWs that it is running on node2 now.

  

 The issue is with Red5.

 Red5 runs on node1, and uses default gw on node1. [it does not know 
 that the external failover IP no longer points to node1].

 When a request is received on the failover IP (now ph node2), 
 iptables redirects it to red5's VIP. Red5, running on node1 gets 
 this request, but does not know that it shall be routed through node2!

 As such, the replies, will be routed through ph node1 - as it is the 
 default gw. This is definitively not the right approach.

  

 The actual question is:

 -  Should I treat all nodes inside the same cluster (db1, db2,
 web1, web2, red1, red2) - and this way I could possibly detect that 
 failover IP has changed and I should do something with red5?

 -  Do something could mean for me one of the following:

 o   If web VIP is running on physical node 2 (on node web2), then
 move red VIP to physical node2 (to node red2)

 o   Alternatively, only change the default gateway for red1, to use
 node2 as the default gateway?

 Why not doing all traffic between the VMs via an internal net only ...
 e.g. a direct connnected nic between hosts, bridged and all VMs connected to 
 it?



 I'm not sure I got you right.
 All VMs are connected and bridged. All VMs can use any of the physical hosts 
 as their default gateway. And there is no issue with internal traffic.
 The questions is what to do when a resource runs on node1 (default GW is 
 host1), but the external failover IP points to host2, and host2 routes the 
 packets to VIP of the resource. The outgoing packet will leave the internal 
 network on host1, and it entered on host2.
 
 If they are all bridged and connected why would you have a routing problem? 
 They are all in the same net, no routing involved ... only the arp caches 
 need to be updated after an IP failover ... this is handled e.g. in the 
 IPAddr2 RA of Pacemaker.
 
 
 The problem is that red1 is not failed over when web1 fails, e.g. it still 
 runs on node1.
 Or were you thinking about a VIP for the default gateway as well?

Maybe I misinterpreted your description, so pleas correct me if my assumptions 
are wrong ... but if all VMs are in the same subnet, no routing is involved so 
you don't need to take care about the default gateway ... it is not used when 
there is a direct

Re: [Pacemaker] Multinode cluster question

2011-11-08 Thread Attila Megyeri
-Original Message-
From: Attila Megyeri [mailto:amegy...@minerva-soft.com] 
Sent: 2011. november 8. 16:13
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Multinode cluster question

-Original Message-
From: Andreas Kurz [mailto:andr...@hastexo.com]
Sent: 2011. november 8. 15:27
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Multinode cluster question

On 11/08/2011 03:14 PM, Attila Megyeri wrote:
 
 -Original Message-
 From: Andreas Kurz [mailto:andr...@hastexo.com]
 Sent: 2011. november 8. 15:03
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Multinode cluster question
 
 On 11/08/2011 02:50 PM, Attila Megyeri wrote:
 -Original Message-
 From: Andreas Kurz [mailto:andr...@hastexo.com]
 Sent: 2011. november 8. 14:42
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] Multinode cluster question

 On 11/08/2011 12:02 PM, Attila Megyeri wrote:
 Hi All,

  

 I need some help/guidance, on how to make sure that certain 
 resources (running in virtual cluster nodes) are run on the same physical 
 server.

  

 The setup:

  

 I have a cluster made of two physical nodes, that I am willing to 
 use for HA purposes (no LB for the time being).

 I have a failover IPfrom the provider, that is controlled using a 
 resource agent from one pair of the virtual machines(web1 and web2), 
 and the IP is assigned always to one of the physical servers.

 On the physical server I use iptables pre/postrouting to direct the 
 traffic to the appropriate virtual node.The routing points to the 
 web VIP, and red5 VIP.

  

 On the physical servers I have 3-3 virtual servers, that host the 
 specific roles of the solution, e.g. db1 db2, web1 web2, red5_1 red5_2.

 The virtual servers use the default gateway of their own physical 
 server to talk to the outside world.

  

 My first idea was to create 3 independent two-node clusters. Db 
 cluster, web cluster, red5 cluster.

 The db cluster is a M/S psql, with a virtual IP.

 The web cluster is an apache2 cluster, cloned on two virtual 
 servers, with a failover IP RA (if node1 on phy1 fails, failover Ip 
 is redirected to phy2 and vice versa).

 Red5 is a red5 cluster running on two instances, with a virtual IP 
 (internal).

  

 This is where it gets interesting - because of the default gateway.

 The db cluster is accessed from the intranet only - no worries here.

  

 Red5 is different - but it needs further explanation.

 Let's assume that all roles (db master, web, red5) are running on 
 phisical server  1.

 Web1 fails for some reason. Web2 role will become active, and the 
 external failover IP will point from now on to physical node2.  The 
 iptables script still points to the same VIP address, but it now 
 runs on a different node. No issue shere, as Web2  gets its traffic 
 properly, as it KNOWs that it is running on node2 now.

  

 The issue is with Red5.

 Red5 runs on node1, and uses default gw on node1. [it does not know 
 that the external failover IP no longer points to node1].

 When a request is received on the failover IP (now ph node2), 
 iptables redirects it to red5's VIP. Red5, running on node1 gets 
 this request, but does not know that it shall be routed through node2!

 As such, the replies, will be routed through ph node1 - as it is the 
 default gw. This is definitively not the right approach.

  

 The actual question is:

 -  Should I treat all nodes inside the same cluster (db1, db2,
 web1, web2, red1, red2) - and this way I could possibly detect that 
 failover IP has changed and I should do something with red5?

 -  Do something could mean for me one of the following:

 o   If web VIP is running on physical node 2 (on node web2), then
 move red VIP to physical node2 (to node red2)

 o   Alternatively, only change the default gateway for red1, to use
 node2 as the default gateway?

 Why not doing all traffic between the VMs via an internal net only ...
 e.g. a direct connnected nic between hosts, bridged and all VMs connected to 
 it?



 I'm not sure I got you right.
 All VMs are connected and bridged. All VMs can use any of the physical hosts 
 as their default gateway. And there is no issue with internal traffic.
 The questions is what to do when a resource runs on node1 (default GW is 
 host1), but the external failover IP points to host2, and host2 routes the 
 packets to VIP of the resource. The outgoing packet will leave the internal 
 network on host1, and it entered on host2.
 
 If they are all bridged and connected why would you have a routing problem? 
 They are all in the same net, no routing involved ... only the arp caches 
 need to be updated after an IP failover ... this is handled e.g. in the 
 IPAddr2 RA of Pacemaker.
 
 
 The problem is that red1 is not failed over when web1 fails, e.g. it still 
 runs on node1.
 Or were you thinking about a VIP for the default gateway as well?

Maybe I misinterpreted your description, so pleas

Re: [Pacemaker] Trouble swinging a Shared IP (VIP) upon active mysql instance shutdown

2011-11-06 Thread Attila Megyeri

 PATRICKZOBLISEIN@... writes:

 
 Hi All,
 I currently have a 2-node Corosync+Pacemaker cluster configured with a 
Shared IP (VIP).  When the active node fails, the VIP swings to the passive 
node and becomes active - this works very well so far.  
 
 Also - on each node is a MySQL instance - a Master/Master replication 
configuration.
 
 My issue at the moment, is getting Pacemaker configured properly in order to 
swing the shared ip (VIP) when the active MySQL instance is down.
 
 Whenever I shutdown the mysql service on EITHER node - Pacemaker restarts 
that instance - instead of restarting, I'd really like to see the VIP swing to 
the other node.
 
 Below is my crm configuration - I understand that the example_mysql1 
primitive is configured to start the mysql instance - but is there a way to 
say instead of op start - op shared_ip_one?
 
 I've only started playing with Pacemaker and Corosync and am still digesting 
concepts - so I'm hoping what I'm trying to accomplish is 'simple' and I'm not 
able to see the forest for the trees.
 
 
 Many thanks in advance,
 Patrick
 
 
 `crm configure edit`
 
 
 node mycluster1.oz.com
 node mycluster2.oz.com
 primitive example_mysql1 ocf:heartbeat:mysql \
         params binary=mysqld_safe datadir=/var/lib/mysql/data 
pid=/var/lib/mysql/data/mysql.pid socket=/var/lib/mysql/data/mysql.sock 
log=/var/lib/mysql/data/mysql_cluster.log \
         op start interval=0 timeout=120 \
         op stop interval=0 timeout=120 \
         op monitor interval=10 timeout=30 depth=0
 primitive shared_ip_one ocf:heartbeat:IPaddr \
         params ip=10.4.1.100 cidr_netmask=255.255.255.0 nic=eth0 \
         op monitor interval=10s
 ms example_mysql1_master example_mysql1 \
         meta target-role=Slave
 location cli-prefer-example_mysql1 example_mysql1_master \
         rule $id=cli-prefer-rule-example_mysql1 inf: #uname eq 
mycluster2.oz.com
 location cli-prefer-shared_ip_one shared_ip_one \
         rule $id=cli-prefer-rule-shared_ip_one inf: #uname eq 
mycluster1.oz.com
 property $id=cib-bootstrap-options \
         dc-version=1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87 \
         cluster-infrastructure=openais \
         expected-quorum-votes=2 \
         stonith-enabled=false
 
 

Hi Patrick,

I am having the exact issue with Pacemaker + Mysql, namely, that mysql is 
restarted instead of the VIP being moved to the passive node.

I tried all possible conbinations of migration-threshold, no help.

Were you able to resolve the issue with Andrew's assistance?

Thanks.





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Circular replication help needed - how to make sure VIP runs on same node with a healthy mysql

2011-11-06 Thread Attila Megyeri
Hi Florian,

First of all thanks for getting back to me . You will find my answers inline.

-Original Message-
From: Florian Haas [mailto:flor...@hastexo.com] 
Sent: 2011. november 6. 15:34
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Circular replication help needed - how to make sure 
VIP runs on same node with a healthy mysql

On 2011-11-05 18:20, Attila Megyeri wrote:
 Hello,
 
  
 
 I am having a hard time configuring a relatively simple mysql environment.
 
  
 
 What I'd like to achieve is:
 
 * One master and a slave, with replication
 
 * Relatively quick failover if the master node, or Mysql fail.
 
  
 
 I tried the Mysql cluster approach, but seemed to be too slow, and to 
 many limitations (foreign keys, triggers, views, etc).
 
  
 
 I decided to go with the Mysql replication.

You want a simple 2-node MySQL cluster with failover? Why not go with DRBD 
then, as everyone else would in that situation?


My reasons:
- I am using virtualization, and DRBD seemed to be to complex compared to a 
mysql replication.
- I had some experience with M/S mysql setups (it was available actually) and I 
thought that applying pacemaker with RAs I can make the automatic  failover 
easily.
- I tried mysql ndbcluster as well, but was not happy with the results - so 
came back to mysql replication.
Then I read the article on clusterlabs and liked the idea of having two 
circular masters and I thought this would be great - but I would need only one 
master / where I would assign a VIP, and failover would finally be easy as 
there is no need to track the binlog positions, etc. So basically I wanted to 
user M/M with a VIP which is collocated with an ACTIVE mysql instance. But with 
clone I cannot do this, so I chose ms resource...




 Tried to use the mysql RA from clusterlabs 3.9.2 - but no luck, 
 replicaiton simply did not work out.

Sorry to say this, but as a co-author of that agent I'll say that that's 
exactly the kind of feedback we strongly dislike, as it doesn't help us at all 
improving the agent, or its documentation. So,

- What were you trying to achieve?
- What was your configuration?
- What went wrong?
- What were you unable to fix?

Sorry if my post was too generic. I spent days trying to set it up and I failed 
/ even though I never give up things easily.

My system is a virtualized Debian, with debian's mysql and pacemaker. I tried 
first with the pacemaker in the stable branch (1.0.9) then also from the 
backports (1.1.5).
As the RAs were very obsolete in 1.0.9 I installed them from the 3.9.2 tar.gz.

My intention was to convert my nicely working M/M circular replication into a 
M/S, just to make sure that writes will happen to the same node always.

After applying the ms resource, I was expecting that I will see an active 
master node ,and a slave node receiving logs from the master, and in case of 
master failure the slave would become the master, VIP would be assigned to the 
newly promoted masternode and the applications would not notice any difference. 
Unfortunately this never happened.

I had many issues, some of them I was able to resolve, but then I simply gave 
up. Some of the issues I had:
- monitoring was not working for mysql. No idea why. VIP was being checked 
every X seconds, but mysql was not. Then somehow this started to work.
- Corosync froze many times, only kill -9 helped.
- So far these issues were not RA related, I know. But then - when I finally 
had my mysql master and slave up and running, slave was not configured (by the 
RA) to get the binlogs. (I checked the mysql log, and there was simply no 
CHANGE MASTER ... command.) I saw some STOP / START SLAVE commands, some 
readonly on/off commands, but the save has never received anything from the 
master.
- The node attributes in the crm configure showed invalid binlog entries.

I just did not have any more time to spend with it / and you probably know how 
difficult it is to troubleshoot such issues, so I finally gave up, installed 
M/M circular replication and asked here for help :)

I have deleted my previous config so I cannot really copy them here, but I 
asked some folks on this list who had similar problems and the answers I got so 
far suggest me they weren't able to resolve their problems either.





For the DRBD based approach (which I would highly recommend), do consider 
taking a look at 
http://www.hastexo.com/content/mysql-high-availability-sprint-launch-pacemaker.
We'll be happy to provide you with the virtual images used in this tutorial, so 
you can set things up yourself in a cleanroom testing environment.

Cheers,
Florian


Cheers,

Attila
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org

[Pacemaker] Circular replication help needed - how to make sure VIP runs on same node with a healthy mysql

2011-11-05 Thread Attila Megyeri
Hello,

I am having a hard time configuring a relatively simple mysql environment.

What I'd like to achieve is:

* One master and a slave, with replication

* Relatively quick failover if the master node, or Mysql fail.

I tried the Mysql cluster approach, but seemed to be too slow, and to many 
limitations (foreign keys, triggers, views, etc).

I decided to go with the Mysql replication.
Tried to use the mysql RA from clusterlabs 3.9.2 - but no luck, replicaiton 
simply did not work out.

The I tried the circular master/master replicaiton approach described in

http://www.clusterlabs.org/wiki/Load_Balanced_MySQL_Replicated_Cluster

This gave the best results so far, BUT.

With this approach, the VIP is floating over the two nodes and if mysql fails 
ont he node where the VIP is assigned, i am left without mysql

I read through all the archives and I did not find anyone else with this 
problem - is that possible? For me this seemed to be a very basic setup.

My question is: if i follow the circular Master/master replication approach, 
how can I make sure that the VIP is active on a node where I actually have a 
healty mysql running? Unfortunately i cannot reference a particular member of a 
clone.

My config is:

node db1 \
attributes IP=10.100.1.31 standby=off
node db2 \
attributes IP=10.100.1.32 standby=off
primitive mysql ocf:heartbeat:mysql \
params binary=/usr/bin/mysqld_safe config=/etc/mysql/my.cnf 
datadir=/var/lib/mysql user=mysql pid=/var/run/mysqld/mysqld.pid 
socket=/var/run/mysqld/mysqld.sock test_passwd=pass 
test_table=replicatest.connectioncheck test_user=slave_user \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s \
op monitor interval=10 timeout=30s \
meta migration-threshold=10
primitive vip ocf:heartbeat:IPaddr2 \
params lvs_support=true ip=10.100.1.30 cidr_netmask=8 
broadcast=10.255.255.255 \
op monitor interval=20s timeout=20s \
meta migration-threshold=10
clone cl_mysql mysql \
meta clone-max=2
property $id=cib-bootstrap-options \
dc-version=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
stonith-enabled=false



Thank you in advance!

Attila
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker