Re: [ClusterLabs] Mysql upgrade in DRBD setup
hi Ken, My problem with the scenario you described is the following: On the central side, if I use M-S replication, the master binlog information will be different on the master and the slave. Therefore, if a failover occurs, remote sites will have difficulties with the "change master" operation (binlog file and position differ on the two hosts). This was the reason for choosing drbd for the central master. Yes, Galera could be an option, but that would require some redesign and also the experience is missing... What about the following, for the DRBD upgrade: - I would upgrade the active node normally, causing a small downtime. (Cluster in maintenance mode) - Then, when the master is up and running again, I would mount a local dummy mysql dir on the slave (content does not matter), and perform the upgrade of the secondary node. (program files would be upgraded, just as the dummy database I don't care about) - Thenf finally, I would attempt a failover to the secondary, just to test if all is fine. Besides the small downtime, I don't see any significant risks in this approach, do you? Thanks, Attila -Original Message- From: Ken Gaillot [mailto:kgail...@redhat.com] Sent: Friday, October 13, 2017 9:03 PM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: Re: [ClusterLabs] Mysql upgrade in DRBD setup On Fri, 2017-10-13 at 17:35 +0200, Attila Megyeri wrote: > Hi Ken, Kristián, > > > Thanks - I am familiar with the native replication, and we use that as > well. > But in this scenario I have to use DRBD. (There is a DRBD Mysql > cluster that is a central site, which is replicated to many sites > using native replication, and all sites have DRBD clusters as well - > In this setup I have to use DRBD for high availability). > > > Anyway - I thought there is a better approach for the DRBD-replicated > Mysql than what I outlined. > What I am concerned about, is what will happen if I upgrade the active > node (let's say I'm okay with the downtime) - when I fail over to the > other node, where the program files and the data files are on > different versions...And when I start upgrading that. > > Any experience anyone? > > @Kristián: my experience shows that If I try to update mysql without a > mounted data fs - it will fail terribly... So the only option is to > upgrade the mounted, and active instance - but the issue is the > version difference (prog vs. data) Exactly -- which is why I'd still go with native replication for this, too. It just adds a step in the upgrade process I outlined earlier: repoint all the other sites' mysql instances to the second central server after it is upgraded (before or after it is made master, doesn't matter). I'm assuming only the master is allowed to write. Another alternative would be to use galera for multi-master (at least for the two servers at the central site). Also, it's still possible to use DRBD beneath a native replication setup, but you'd have to replicate both the master and slave data (using only one at a time on any given server). This makes more sense if the mysql servers are running inside VMs or containers that can migrate between the physical machines. > > Thanks! > > > > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Thursday, October 12, 2017 9:22 PM > To: Cluster Labs - All topics related to open-source clustering > welcomed > Subject: Re: [ClusterLabs] Mysql upgrade in DRBD setup > > On Thu, 2017-10-12 at 18:51 +0200, Attila Megyeri wrote: > > Hi all, > > > > What is the recommended mysql server upgrade methodology in case of > > an active/passive DRBD storage? > > (Ubuntu is the platform) > > If you want to minimize downtime in a MySQL upgrade, your best bet is > to use MySQL native replication rather than replicate the storage. > > 1. starting point: node1 = master, node2 = slave 2. stop mysql on > node2, upgrade, start mysql again, ensure OK 3. switch master to > node2 and slave to node1, ensure OK 4. stop mysql on node1, upgrade, > start mysql again, ensure OK > > You might have a small window where the database is read-only while > you switch masters (you can keep it to a few seconds if you arrange > things well), but other than that, you won't have any downtime, even > if some part of the upgrade gives you trouble. > > > > > 1) On the passive node the mysql data directory is not mounted, > > so the backup fails (some postinstall jobs will attempt to perform > > manipulations on certain files in the data directory). > > 2) If the upgrade is done on the active node, it will restart > > the service (with the service r
Re: [ClusterLabs] Mysql upgrade in DRBD setup
Hi Ken, Kristián, Thanks - I am familiar with the native replication, and we use that as well. But in this scenario I have to use DRBD. (There is a DRBD Mysql cluster that is a central site, which is replicated to many sites using native replication, and all sites have DRBD clusters as well - In this setup I have to use DRBD for high availability). Anyway - I thought there is a better approach for the DRBD-replicated Mysql than what I outlined. What I am concerned about, is what will happen if I upgrade the active node (let's say I'm okay with the downtime) - when I fail over to the other node, where the program files and the data files are on different versions...And when I start upgrading that. Any experience anyone? @Kristián: my experience shows that If I try to update mysql without a mounted data fs - it will fail terribly... So the only option is to upgrade the mounted, and active instance - but the issue is the version difference (prog vs. data) Thanks! -Original Message- From: Ken Gaillot [mailto:kgail...@redhat.com] Sent: Thursday, October 12, 2017 9:22 PM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: Re: [ClusterLabs] Mysql upgrade in DRBD setup On Thu, 2017-10-12 at 18:51 +0200, Attila Megyeri wrote: > Hi all, > > What is the recommended mysql server upgrade methodology in case of an > active/passive DRBD storage? > (Ubuntu is the platform) If you want to minimize downtime in a MySQL upgrade, your best bet is to use MySQL native replication rather than replicate the storage. 1. starting point: node1 = master, node2 = slave 2. stop mysql on node2, upgrade, start mysql again, ensure OK 3. switch master to node2 and slave to node1, ensure OK 4. stop mysql on node1, upgrade, start mysql again, ensure OK You might have a small window where the database is read-only while you switch masters (you can keep it to a few seconds if you arrange things well), but other than that, you won't have any downtime, even if some part of the upgrade gives you trouble. > > 1) On the passive node the mysql data directory is not mounted, > so the backup fails (some postinstall jobs will attempt to perform > manipulations on certain files in the data directory). > 2) If the upgrade is done on the active node, it will restart the > service (with the service restart, not in a crm managed fassion…), > which is not a very good option (downtime in a HA solution). Not to > mention, that it will update some files in the mysql data directory, > which can cause strange issues if the A/P pair is changed – since on > the other node the program code will still be the old one, while the > data dir is already upgraded. > > Any hints are welcome! > > Thanks, > Attila > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Mysql upgrade in DRBD setup
Hi, This does not really answer my question. Placing the cluster into maintenance mode just avoids monitoring and restarting, but what about the things I am asking below? (data dir related questions) thanks From: Kristián Feldsam [mailto:ad...@feldhost.cz] Sent: Thursday, October 12, 2017 7:51 PM To: Attila Megyeri ; users@clusterlabs.org Subject: Re:[ClusterLabs] Mysql upgrade in DRBD setup hello, you should put cluster to maintenance mode Sent from my MI 5 On Attila Megyeri mailto:amegy...@minerva-soft.com>>, Oct 12, 2017 6:55 PM wrote: Hi all, What is the recommended mysql server upgrade methodology in case of an active/passive DRBD storage? (Ubuntu is the platform) 1) On the passive node the mysql data directory is not mounted, so the backup fails (some postinstall jobs will attempt to perform manipulations on certain files in the data directory). 2) If the upgrade is done on the active node, it will restart the service (with the service restart, not in a crm managed fassion…), which is not a very good option (downtime in a HA solution). Not to mention, that it will update some files in the mysql data directory, which can cause strange issues if the A/P pair is changed – since on the other node the program code will still be the old one, while the data dir is already upgraded. Any hints are welcome! Thanks, Attila ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Mysql upgrade in DRBD setup
Hi all, What is the recommended mysql server upgrade methodology in case of an active/passive DRBD storage? (Ubuntu is the platform) 1) On the passive node the mysql data directory is not mounted, so the backup fails (some postinstall jobs will attempt to perform manipulations on certain files in the data directory). 2) If the upgrade is done on the active node, it will restart the service (with the service restart, not in a crm managed fassion...), which is not a very good option (downtime in a HA solution). Not to mention, that it will update some files in the mysql data directory, which can cause strange issues if the A/P pair is changed - since on the other node the program code will still be the old one, while the data dir is already upgraded. Any hints are welcome! Thanks, Attila ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] clearing failed actions
One more thing to add. Two almost identical clusters, with the identical asterisk primitive produce a different crm_verify output. on one cluster, it returns no warnings, whereas the other once complains: On the problematic one: crm_verify --live-check -VV warning: get_failcount_full: Setting asterisk.failure_timeout=120 in asterisk-stop-0 conflicts with on-fail=block: ignoring timeout Warnings found during check: config may not be valid The relevant primitive is in both clusters: primitive asterisk ocf:heartbeat:asterisk \ op monitor interval="10s" timeout="45s" on-fail="restart" \ op start interval="0" timeout="60s" on-fail="standby" \ op stop interval="0" timeout="60s" on-fail="block" \ meta migration-threshold="3" failure-timeout="2m" Why is the same configuration valid in one, but not in the other cluster? Shall I simply omit the "op stop" line? thanks :) Attila > -Original Message- > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > Sent: Monday, June 19, 2017 9:47 PM > To: Cluster Labs - All topics related to open-source clustering welcomed > ; kgail...@redhat.com > Subject: Re: [ClusterLabs] clearing failed actions > > I did another experiment, even simpler. > > Created one node, one resource, using pacemaker 1.1.14 on ubuntu. > > Configured failcount to 1, migration threshold to 2, failure timeout to 1 > minute. > > crm_mon: > > Last updated: Mon Jun 19 19:43:41 2017 Last change: Mon Jun 19 > 19:37:09 2017 by root via cibadmin on test > Stack: corosync > Current DC: test (version 1.1.14-70404b0) - partition with quorum > 1 node and 1 resource configured > > Online: [ test ] > > db-ip-master(ocf::heartbeat:IPaddr2): Started test > > Node Attributes: > * Node test: > > Migration Summary: > * Node test: >db-ip-master: migration-threshold=2 fail-count=1 > > crm verify: > > crm_verify --live-check - > info: validate_with_relaxng:Creating RNG parser context > info: determine_online_status: Node test is online > info: get_failcount_full: db-ip-master has failed 1 times on test > info: get_failcount_full: db-ip-master has failed 1 times on test > info: get_failcount_full: db-ip-master has failed 1 times on test > info: get_failcount_full: db-ip-master has failed 1 times on test > info: native_print: db-ip-master(ocf::heartbeat:IPaddr2): > Started test > info: get_failcount_full: db-ip-master has failed 1 times on test > info: common_apply_stickiness: db-ip-master can fail 1 more times on > test before being forced off > info: LogActions: Leave db-ip-master(Started test) > > > crm configure is: > > node 168362242: test \ > attributes standby=off > primitive db-ip-master IPaddr2 \ > params lvs_support=true ip=10.9.1.10 cidr_netmask=24 > broadcast=10.9.1.255 \ > op start interval=0 timeout=20s on-fail=restart \ > op monitor interval=20s timeout=20s \ > op stop interval=0 timeout=20s on-fail=block \ > meta migration-threshold=2 failure-timeout=1m target-role=Started > location loc1 db-ip-master 0: test > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.14-70404b0 \ > cluster-infrastructure=corosync \ > stonith-enabled=false \ > cluster-recheck-interval=30s \ > symmetric-cluster=false > > > > > Corosync log: > > > Jun 19 19:45:07 [331] test crmd: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_TIMER_POPPED origin=crm_timer_popped ] > Jun 19 19:45:07 [330] testpengine: info: process_pe_message:Input > has > not changed since last time, not saving to disk > Jun 19 19:45:07 [330] testpengine: info: determine_online_status: > Node test is online > Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: > db-ip-master > has failed 1 times on test > Jun 19 19:45:07 [330] testpengine: info: native_print: db-ip-master > (ocf::heartbeat:IPaddr2): Started test > Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: > db-ip-ma
Re: [ClusterLabs] clearing failed actions
I did another experiment, even simpler. Created one node, one resource, using pacemaker 1.1.14 on ubuntu. Configured failcount to 1, migration threshold to 2, failure timeout to 1 minute. crm_mon: Last updated: Mon Jun 19 19:43:41 2017 Last change: Mon Jun 19 19:37:09 2017 by root via cibadmin on test Stack: corosync Current DC: test (version 1.1.14-70404b0) - partition with quorum 1 node and 1 resource configured Online: [ test ] db-ip-master(ocf::heartbeat:IPaddr2): Started test Node Attributes: * Node test: Migration Summary: * Node test: db-ip-master: migration-threshold=2 fail-count=1 crm verify: crm_verify --live-check - info: validate_with_relaxng:Creating RNG parser context info: determine_online_status: Node test is online info: get_failcount_full: db-ip-master has failed 1 times on test info: get_failcount_full: db-ip-master has failed 1 times on test info: get_failcount_full: db-ip-master has failed 1 times on test info: get_failcount_full: db-ip-master has failed 1 times on test info: native_print: db-ip-master(ocf::heartbeat:IPaddr2): Started test info: get_failcount_full: db-ip-master has failed 1 times on test info: common_apply_stickiness: db-ip-master can fail 1 more times on test before being forced off info: LogActions: Leave db-ip-master(Started test) crm configure is: node 168362242: test \ attributes standby=off primitive db-ip-master IPaddr2 \ params lvs_support=true ip=10.9.1.10 cidr_netmask=24 broadcast=10.9.1.255 \ op start interval=0 timeout=20s on-fail=restart \ op monitor interval=20s timeout=20s \ op stop interval=0 timeout=20s on-fail=block \ meta migration-threshold=2 failure-timeout=1m target-role=Started location loc1 db-ip-master 0: test property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.14-70404b0 \ cluster-infrastructure=corosync \ stonith-enabled=false \ cluster-recheck-interval=30s \ symmetric-cluster=false Corosync log: Jun 19 19:45:07 [331] test crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Jun 19 19:45:07 [330] testpengine: info: process_pe_message:Input has not changed since last time, not saving to disk Jun 19 19:45:07 [330] testpengine: info: determine_online_status: Node test is online Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: db-ip-master has failed 1 times on test Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: db-ip-master has failed 1 times on test Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: db-ip-master has failed 1 times on test Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: db-ip-master has failed 1 times on test Jun 19 19:45:07 [330] testpengine: info: native_print: db-ip-master (ocf::heartbeat:IPaddr2): Started test Jun 19 19:45:07 [330] testpengine: info: get_failcount_full: db-ip-master has failed 1 times on test Jun 19 19:45:07 [330] testpengine: info: common_apply_stickiness: db-ip-master can fail 1 more times on test before being forced off Jun 19 19:45:07 [330] testpengine: info: LogActions:Leave db-ip-master(Started test) Jun 19 19:45:07 [330] testpengine: notice: process_pe_message: Calculated Transition 34: /var/lib/pacemaker/pengine/pe-input-6.bz2 Jun 19 19:45:07 [331] test crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jun 19 19:45:07 [331] test crmd: notice: run_graph: Transition 34 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete Jun 19 19:45:07 [331] test crmd: info: do_log:FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE Jun 19 19:45:07 [331] test crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] I hope someone can help me figure this out :) Thanks! > -Original Message----- > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > Sent: Monday, June 19, 2017 7:45 PM > To: kgail...@redhat.com; Cluster Labs - All topics related to open-source > clustering welcomed > Subject: Re: [ClusterLabs] clearing failed actions > > Hi Ken, > > /sorry for the long text/ > > I have created a relatively simple setup to localize the issue. > Three nodes, no fencing, just a master/slave mysql with two virual IPs. > Just as a reminden, my primary issue is, that on clu
Re: [ClusterLabs] clearing failed actions
] ctmgr crmd:debug: do_state_transition: Starting PEngine Recheck Timer Jun 19 17:37:06 [18998] ctmgr crmd:debug: crm_timer_start:Started PEngine Recheck Timer (I_PE_CALC:3ms), src=277 As you can see from the logs, pacemaker does not even try to re-monitor the resource that had a failure, or at least I'm not seeing it. Cluster recheck interval is set to 30 seconds for troubleshooting reasons. If I execute a crm resource cleanup db-ip-master Tha failure is removed. Now am I taking something terribly wrong here? Or is this simply a bug in 1.1.10? Thanks, Attila > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Wednesday, June 7, 2017 10:14 PM > To: Attila Megyeri ; Cluster Labs - All topics > related to open-source clustering welcomed > Subject: Re: [ClusterLabs] clearing failed actions > > On 06/01/2017 02:44 PM, Attila Megyeri wrote: > > Ken, > > > > I noticed something strange, this might be the issue. > > > > In some cases, even the manual cleanup does not work. > > > > I have a failed action of resource "A" on node "a". DC is node "b". > > > > e.g. > > Failed actions: > > jboss_imssrv1_monitor_1 (node=ctims1, call=108, rc=1, > status=complete, last-rc-change=Thu Jun 1 14:13:36 2017 > > > > > > When I attempt to do a "crm resource cleanup A" from node "b", nothing > happens. Basically the lrmd on "a" is not notified that it should monitor the > resource. > > > > > > When I execute a "crm resource cleanup A" command on node "a" (where > the operation failed) , the failed action is cleared properly. > > > > Why could this be happening? > > Which component should be responsible for this? pengine, crmd, lrmd? > > The crm shell will send commands to attrd (to clear fail counts) and > crmd (to clear the resource history), which in turn will record changes > in the cib. > > I'm not sure how crm shell implements it, but crm_resource sends > individual messages to each node when cleaning up a resource without > specifying a particular node. You could check the pacemaker log on each > node to see whether attrd and crmd are receiving those commands, and > what they do in response. > > > >> -Original Message- > >> From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > >> Sent: Thursday, June 1, 2017 6:57 PM > >> To: kgail...@redhat.com; Cluster Labs - All topics related to open-source > >> clustering welcomed > >> Subject: Re: [ClusterLabs] clearing failed actions > >> > >> thanks Ken, > >> > >> > >> > >> > >> > >>> -Original Message- > >>> From: Ken Gaillot [mailto:kgail...@redhat.com] > >>> Sent: Thursday, June 1, 2017 12:04 AM > >>> To: users@clusterlabs.org > >>> Subject: Re: [ClusterLabs] clearing failed actions > >>> > >>> On 05/31/2017 12:17 PM, Ken Gaillot wrote: > >>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote: > >>>>> Hi Ken, > >>>>> > >>>>> > >>>>>> -Original Message- > >>>>>> From: Ken Gaillot [mailto:kgail...@redhat.com] > >>>>>> Sent: Tuesday, May 30, 2017 4:32 PM > >>>>>> To: users@clusterlabs.org > >>>>>> Subject: Re: [ClusterLabs] clearing failed actions > >>>>>> > >>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Shouldn't the > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> cluster-recheck-interval="2m" > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> property instruct pacemaker to recheck the cluster every 2 minutes > >>> and > >>>>>>> clean the failcounts? > >>>>>> > >>>>>> It instructs pacemaker to recalculate whether any actions need to be > >>>>>> taken (including expiring any failcounts appropriately). > >>>>>> > >>>>>>> At the primitive level I also have a > >>>>>>> > >>>>>>> > >>>>>>> > >>>>&g
Re: [ClusterLabs] clearing failed actions
Ken, I noticed something strange, this might be the issue. In some cases, even the manual cleanup does not work. I have a failed action of resource "A" on node "a". DC is node "b". e.g. Failed actions: jboss_imssrv1_monitor_1 (node=ctims1, call=108, rc=1, status=complete, last-rc-change=Thu Jun 1 14:13:36 2017 When I attempt to do a "crm resource cleanup A" from node "b", nothing happens. Basically the lrmd on "a" is not notified that it should monitor the resource. When I execute a "crm resource cleanup A" command on node "a" (where the operation failed) , the failed action is cleared properly. Why could this be happening? Which component should be responsible for this? pengine, crmd, lrmd? > -Original Message- > From: Attila Megyeri [mailto:amegy...@minerva-soft.com] > Sent: Thursday, June 1, 2017 6:57 PM > To: kgail...@redhat.com; Cluster Labs - All topics related to open-source > clustering welcomed > Subject: Re: [ClusterLabs] clearing failed actions > > thanks Ken, > > > > > > > -Original Message- > > From: Ken Gaillot [mailto:kgail...@redhat.com] > > Sent: Thursday, June 1, 2017 12:04 AM > > To: users@clusterlabs.org > > Subject: Re: [ClusterLabs] clearing failed actions > > > > On 05/31/2017 12:17 PM, Ken Gaillot wrote: > > > On 05/30/2017 02:50 PM, Attila Megyeri wrote: > > >> Hi Ken, > > >> > > >> > > >>> -Original Message- > > >>> From: Ken Gaillot [mailto:kgail...@redhat.com] > > >>> Sent: Tuesday, May 30, 2017 4:32 PM > > >>> To: users@clusterlabs.org > > >>> Subject: Re: [ClusterLabs] clearing failed actions > > >>> > > >>> On 05/30/2017 09:13 AM, Attila Megyeri wrote: > > >>>> Hi, > > >>>> > > >>>> > > >>>> > > >>>> Shouldn't the > > >>>> > > >>>> > > >>>> > > >>>> cluster-recheck-interval="2m" > > >>>> > > >>>> > > >>>> > > >>>> property instruct pacemaker to recheck the cluster every 2 minutes > > and > > >>>> clean the failcounts? > > >>> > > >>> It instructs pacemaker to recalculate whether any actions need to be > > >>> taken (including expiring any failcounts appropriately). > > >>> > > >>>> At the primitive level I also have a > > >>>> > > >>>> > > >>>> > > >>>> migration-threshold="30" failure-timeout="2m" > > >>>> > > >>>> > > >>>> > > >>>> but whenever I have a failure, it remains there forever. > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> What could be causing this? > > >>>> > > >>>> > > >>>> > > >>>> thanks, > > >>>> > > >>>> Attila > > >>> Is it a single old failure, or a recurring failure? The failure timeout > > >>> works in a somewhat nonintuitive way. Old failures are not individually > > >>> expired. Instead, all failures of a resource are simultaneously cleared > > >>> if all of them are older than the failure-timeout. So if something keeps > > >>> failing repeatedly (more frequently than the failure-timeout), none of > > >>> the failures will be cleared. > > >>> > > >>> If it's not a repeating failure, something odd is going on. > > >> > > >> It is not a repeating failure. Let's say that a resource fails for > > >> whatever > > action, It will remain in the failed actions (crm_mon -Af) until I issue a > > "crm > > resource cleanup ". Even after days or weeks, even > though > > I see in the logs that cluster is rechecked every 120 seconds. > > >> > > >> How could I troubleshoot this issue? > > >> > > >> thanks! > > > > > > > > > Ah, I see what you're saying. That's expected behavior. > > > > > > The failure-timeout applies to the failure *count* (which is used for > > > checking against migration-threshold), not the failure *history* (which > > > is used
Re: [ClusterLabs] clearing failed actions
thanks Ken, > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Thursday, June 1, 2017 12:04 AM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] clearing failed actions > > On 05/31/2017 12:17 PM, Ken Gaillot wrote: > > On 05/30/2017 02:50 PM, Attila Megyeri wrote: > >> Hi Ken, > >> > >> > >>> -Original Message- > >>> From: Ken Gaillot [mailto:kgail...@redhat.com] > >>> Sent: Tuesday, May 30, 2017 4:32 PM > >>> To: users@clusterlabs.org > >>> Subject: Re: [ClusterLabs] clearing failed actions > >>> > >>> On 05/30/2017 09:13 AM, Attila Megyeri wrote: > >>>> Hi, > >>>> > >>>> > >>>> > >>>> Shouldn't the > >>>> > >>>> > >>>> > >>>> cluster-recheck-interval="2m" > >>>> > >>>> > >>>> > >>>> property instruct pacemaker to recheck the cluster every 2 minutes > and > >>>> clean the failcounts? > >>> > >>> It instructs pacemaker to recalculate whether any actions need to be > >>> taken (including expiring any failcounts appropriately). > >>> > >>>> At the primitive level I also have a > >>>> > >>>> > >>>> > >>>> migration-threshold="30" failure-timeout="2m" > >>>> > >>>> > >>>> > >>>> but whenever I have a failure, it remains there forever. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> What could be causing this? > >>>> > >>>> > >>>> > >>>> thanks, > >>>> > >>>> Attila > >>> Is it a single old failure, or a recurring failure? The failure timeout > >>> works in a somewhat nonintuitive way. Old failures are not individually > >>> expired. Instead, all failures of a resource are simultaneously cleared > >>> if all of them are older than the failure-timeout. So if something keeps > >>> failing repeatedly (more frequently than the failure-timeout), none of > >>> the failures will be cleared. > >>> > >>> If it's not a repeating failure, something odd is going on. > >> > >> It is not a repeating failure. Let's say that a resource fails for whatever > action, It will remain in the failed actions (crm_mon -Af) until I issue a > "crm > resource cleanup ". Even after days or weeks, even though > I see in the logs that cluster is rechecked every 120 seconds. > >> > >> How could I troubleshoot this issue? > >> > >> thanks! > > > > > > Ah, I see what you're saying. That's expected behavior. > > > > The failure-timeout applies to the failure *count* (which is used for > > checking against migration-threshold), not the failure *history* (which > > is used for the status display). > > > > The idea is to have it no longer affect the cluster behavior, but still > > allow an administrator to know that it happened. That's why a manual > > cleanup is required to clear the history. > > Hmm, I'm wrong there ... failure-timeout does expire the failure history > used for status display. > > It works with the current versions. It's possible 1.1.10 had issues with > that. > Well if nothing helps I will try to upgrade to a more recent version.. > Check the status to see which node is DC, and look at the pacemaker log > there after the failure occurred. There should be a message about the > failcount expiring. You can also look at the live CIB and search for > last_failure to see what is used for the display. [AM] In the pacemaker log I see at every recheck interval the following lines: Jun 01 16:54:08 [8700] ctabsws2pengine: warning: unpack_rsc_op: Processing failed op start for jboss_admin2 on ctadmin2: unknown error (1) If I check the CIB for the failure I see: Really have no clue why this isn't cleared... > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: clearing failed actions
Thanks. We have several clusters working fine since several years, without STONITH, so we did not really bother to implement it. As for the failed actions - I cannot recall since when these are not cleared, but they aren't. When I check the pengine log on the DC, at every recheck interval I see lines like: ...pengine: warning: unpack_rsc_op:Processing failed op start for jboss_admin1 on ctadmin1: unknown error (1) or ...pengine: warning: unpack_rsc_op:Processing failed op monitor for jboss_abssrv2 on ctabs2: unknown error (1) And these are the failed actions visible in the crm_mon -f as well: Failed actions: jboss_admin1_start_0 (node=ctadmin1, call=120, rc=1, status=Timed Out, last-rc-change=Thu Jun 1 14:17:31 2017 , queued=40001ms, exec=0ms ): unknown error jboss_abssrv2_monitor_1 (node=ctabs2, call=106, rc=1, status=complete, last-rc-change=Thu Jun 1 14:13:36 2017 , queued=0ms, exec=0ms ): unknown error If I do a resource cleanup, the errors are gone. At the same time I see no actions on the mentioned nodes - this log is from the DC... On the mentioned notes a regular monitoring operation is performed, and reults in 0 - no error. What am I missing here? > -Original Message- > From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] > Sent: Thursday, June 1, 2017 8:34 AM > To: users@clusterlabs.org > Subject: [ClusterLabs] Antw: Re: Antw: clearing failed actions > > >>> Digimer schrieb am 01.06.2017 um 00:03 in Nachricht > <50aad2be-185b-0348-6a93-987034c9c...@alteeve.ca>: > [...] > > I don't know, but according to Ken's last email, what you're seeing is > > expected. I replied because of the miss understanding of the rolls > > quorum and fencing play. Running a cluster without fencing is dangerous. > > I'd recommend this: Enable a working STONITH. Then if you see your cluster > never uses STONITH, and everything works fine, and you feel you don't > needit, then you can get rid of it. > But don't try the other way 'round: Omit STONITH, expecting the cluster > would work flawlessly, then (not) add STONITH. > > Regards, > Ulrich > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: clearing failed actions
> -Original Message- > From: Digimer [mailto:li...@alteeve.ca] > Sent: Wednesday, May 31, 2017 2:20 PM > To: Cluster Labs - All topics related to open-source clustering welcomed > ; Attila Megyeri > Subject: Re: [ClusterLabs] Antw: clearing failed actions > > On 31/05/17 07:52 PM, Attila Megyeri wrote: > > Hi, > > > > > > > >> -Original Message- > >> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] > >> Sent: Wednesday, May 31, 2017 8:52 AM > >> To: users@clusterlabs.org > >> Subject: [ClusterLabs] Antw: clearing failed actions > >> > >>>>> Attila Megyeri schrieb am 30.05.2017 > um > >> 16:13 in > >> Nachricht > >> > >> soft.local>: > >>> Hi, > >>> > >>> Shouldn't the > >>> > >>> cluster-recheck-interval="2m" > >>> > >>> property instruct pacemaker to recheck the cluster every 2 minutes and > >> clean > >>> the failcounts? > >>> > >>> At the primitive level I also have a > >>> > >>> migration-threshold="30" failure-timeout="2m" > >>> > >>> but whenever I have a failure, it remains there forever. > >> > >> What type of failure do you have, and what is the status after that? Do > you > >> have fencing enabled? > >> > > > > Typically a failed start, or a failed monitor. > > Fencing is disabled as we have multiple nodes / quorum. > > Stonith and quorum solve different problems. Stonith is required, quorum > is optional. > > https://www.alteeve.com/w/The_2-Node_Myth > > > Pacemaker is 1.1.10. > > I see your point, but does it relate to the failcount issue? By turning stonith off, the fail counters will not be removed even if the service recovers immediately after restart? > > > > > >>> > >>> > >>> What could be causing this? > >>> > >>> thanks, > >>> Attila > >> > >> > >> > >> > >> > >> ___ > >> Users mailing list: Users@clusterlabs.org > >> http://lists.clusterlabs.org/mailman/listinfo/users > >> > >> Project Home: http://www.clusterlabs.org > >> Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> Bugs: http://bugs.clusterlabs.org > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > -- > Digimer > Papers and Projects: https://alteeve.com/w/ > "I am, somehow, less interested in the weight and convolutions of > Einstein’s brain than in the near certainty that people of equal talent > have lived and died in cotton fields and sweatshops." - Stephen Jay Gould ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: clearing failed actions
Hi, > -Original Message- > From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] > Sent: Wednesday, May 31, 2017 8:52 AM > To: users@clusterlabs.org > Subject: [ClusterLabs] Antw: clearing failed actions > > >>> Attila Megyeri schrieb am 30.05.2017 um > 16:13 in > Nachricht > soft.local>: > > Hi, > > > > Shouldn't the > > > > cluster-recheck-interval="2m" > > > > property instruct pacemaker to recheck the cluster every 2 minutes and > clean > > the failcounts? > > > > At the primitive level I also have a > > > > migration-threshold="30" failure-timeout="2m" > > > > but whenever I have a failure, it remains there forever. > > What type of failure do you have, and what is the status after that? Do you > have fencing enabled? > Typically a failed start, or a failed monitor. Fencing is disabled as we have multiple nodes / quorum. Pacemaker is 1.1.10. > > > > > > What could be causing this? > > > > thanks, > > Attila > > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] clearing failed actions
Hi Ken, > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Tuesday, May 30, 2017 4:32 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] clearing failed actions > > On 05/30/2017 09:13 AM, Attila Megyeri wrote: > > Hi, > > > > > > > > Shouldn't the > > > > > > > > cluster-recheck-interval="2m" > > > > > > > > property instruct pacemaker to recheck the cluster every 2 minutes and > > clean the failcounts? > > It instructs pacemaker to recalculate whether any actions need to be > taken (including expiring any failcounts appropriately). > > > At the primitive level I also have a > > > > > > > > migration-threshold="30" failure-timeout="2m" > > > > > > > > but whenever I have a failure, it remains there forever. > > > > > > > > > > > > What could be causing this? > > > > > > > > thanks, > > > > Attila > Is it a single old failure, or a recurring failure? The failure timeout > works in a somewhat nonintuitive way. Old failures are not individually > expired. Instead, all failures of a resource are simultaneously cleared > if all of them are older than the failure-timeout. So if something keeps > failing repeatedly (more frequently than the failure-timeout), none of > the failures will be cleared. > > If it's not a repeating failure, something odd is going on. It is not a repeating failure. Let's say that a resource fails for whatever action, It will remain in the failed actions (crm_mon -Af) until I issue a "crm resource cleanup ". Even after days or weeks, even though I see in the logs that cluster is rechecked every 120 seconds. How could I troubleshoot this issue? thanks! > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] clearing failed actions
Hi, Shouldn't the cluster-recheck-interval="2m" property instruct pacemaker to recheck the cluster every 2 minutes and clean the failcounts? At the primitive level I also have a migration-threshold="30" failure-timeout="2m" but whenever I have a failure, it remains there forever. What could be causing this? thanks, Attila ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
Hi Klaus, Thank you for your response. I tried many things, but no luck. We have many pacemaker clusters with 99% identical configurations, package versions, and only this one causes issues. (BTW we use unicast for corosync, but this is the same for our other clusters as well.) I checked all connection settings between the nodes (to confirm there are no firewall issues), increased the number of cores on each node, but still - as long as a monitor operation is pending for a resource, no other operation is executed. e.g. resource A is being monitored, and timeout is 90 seconds, until this check times out I cannot do a cleanup or start/stop on any other resource. Two more interesting things: - cluster recheck is set to 2 minutes, and even though the resources are running properly, the fail counters are not reduced and crm_mon lists the resources in failed actions section. forever. Or until I manually do resource cleanup. - If i execute a crm resource cleanup RES_name from another node, sometimes it simply does not clean up the failed state. If I execute this from the node where the resource IS actually runing, the resource is removed from the failed actions. What do you recommend, how could I start troubleshooting these issues? As I said, this setup works fine in several other systems, but here I am really-realy stuck. thanks! Attila > -Original Message- > From: Klaus Wenninger [mailto:kwenn...@redhat.com] > Sent: Wednesday, May 10, 2017 2:04 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond > > On 05/09/2017 10:34 PM, Attila Megyeri wrote: > > > > Actually I found some more details: > > > > > > > > there are two resources: A and B > > > > > > > > resource B depends on resource A (when the RA monitors B, if will fail > > if A is not running properly) > > > > > > > > If I stop resource A, the next monitor operation of "B" will fail. > > Interestingly, this check happens immediately after A is stopped. > > > > > > > > B is configured to restart if monitor fails. Start timeout is rather > > long, 180 seconds. So pacemaker tries to restart B, and waits. > > > > > > > > If I want to start "A", nothing happens until the start operation of > > "B" fails - typically several minutes. > > > > > > > > > > > > Is this the right behavior? > > > > It appears that pacemaker is blocked until resource B is being > > started, and I cannot really start its dependency... > > > > Shouldn't it be possible to start a resource while another resource is > > also starting? > > > > As long as resources don't depend on each other parallel starting should > work/happen. > > The number of parallel actions executed is derived from the number of > cores and > when load is detected some kind of throttling kicks in (in fact reduction of > the operations executed in parallel with the aim to reduce the load induced > by pacemaker). When throttling kicks in you should get log messages (there > is in fact a parallel discussion going on ...). > No idea if throttling might be a reason here but maybe worth considering > at least. > > Another reason why certain things happen with quite some delay I've > observed > is that obviously some situations are just resolved when the > cluster-recheck-interval > triggers a pengine run in addition to those triggered by changes. > You might easily verify this by changing the cluster-recheck-interval. > > Regards, > Klaus > > > > > > > > > > > Thanks, > > > > Attila > > > > > > > > > > > > *From:*Attila Megyeri [mailto:amegy...@minerva-soft.com] > > *Sent:* Tuesday, May 9, 2017 9:53 PM > > *To:* users@clusterlabs.org; kgail...@redhat.com > > *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond > > > > > > > > Hi Ken, all, > > > > > > > > > > > > We ran into an issue very similar to the one described in > > https://bugzilla.redhat.com/show_bug.cgi?id=1430112 / [Intel 7.4 Bug] > > Pacemaker occasionally takes minutes to respond > > > > > > > > But in our case we are not using fencing/stonith at all. > > > > > > > > Many times when I want to start/stop/cleanup a resource, it takes tens > > of seconds (or even minutes) till the command gets executed. The logs > > show nothing in that period, the redundant rings show no fault. > > > > > > > > Could this be the same issue? &g
Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
Actually I found some more details: there are two resources: A and B resource B depends on resource A (when the RA monitors B, if will fail if A is not running properly) If I stop resource A, the next monitor operation of "B" will fail. Interestingly, this check happens immediately after A is stopped. B is configured to restart if monitor fails. Start timeout is rather long, 180 seconds. So pacemaker tries to restart B, and waits. If I want to start "A", nothing happens until the start operation of "B" fails - typically several minutes. Is this the right behavior? It appears that pacemaker is blocked until resource B is being started, and I cannot really start its dependency... Shouldn't it be possible to start a resource while another resource is also starting? Thanks, Attila From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Tuesday, May 9, 2017 9:53 PM To: users@clusterlabs.org; kgail...@redhat.com Subject: [ClusterLabs] Pacemaker occasionally takes minutes to respond Hi Ken, all, We ran into an issue very similar to the one described in https://bugzilla.redhat.com/show_bug.cgi?id=1430112 / [Intel 7.4 Bug] Pacemaker occasionally takes minutes to respond But in our case we are not using fencing/stonith at all. Many times when I want to start/stop/cleanup a resource, it takes tens of seconds (or even minutes) till the command gets executed. The logs show nothing in that period, the redundant rings show no fault. Could this be the same issue? Any hints on how to troubleshoot this? It is pacemaker 1.1.10, corosync 2.3.3 Cheers, Attila ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Pacemaker occasionally takes minutes to respond
Hi Ken, all, We ran into an issue very similar to the one described in https://bugzilla.redhat.com/show_bug.cgi?id=1430112 / [Intel 7.4 Bug] Pacemaker occasionally takes minutes to respond But in our case we are not using fencing/stonith at all. Many times when I want to start/stop/cleanup a resource, it takes tens of seconds (or even minutes) till the command gets executed. The logs show nothing in that period, the redundant rings show no fault. Could this be the same issue? Any hints on how to troubleshoot this? It is pacemaker 1.1.10, corosync 2.3.3 Cheers, Attila ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave
Hi Ken, > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Thursday, August 25, 2016 6:03 PM > To: Attila Megyeri ; Cluster Labs - All topics > related to open-source clustering welcomed > Subject: Re: [ClusterLabs] Mysql slave did not start replication after > failure, > and read-only IP also remained active on the much outdated slave > > On 08/22/2016 03:56 PM, Attila Megyeri wrote: > > Hi Ken, > > > > Thanks a lot for your feedback, my answers are inline. > > > > > > > >> -Original Message- > >> From: Ken Gaillot [mailto:kgail...@redhat.com] > >> Sent: Monday, August 22, 2016 4:12 PM > >> To: users@clusterlabs.org > >> Subject: Re: [ClusterLabs] Mysql slave did not start replication after > failure, > >> and read-only IP also remained active on the much outdated slave > >> > >> On 08/22/2016 07:24 AM, Attila Megyeri wrote: > >>> Hi Andrei, > >>> > >>> I waited several hours, and nothing happened. > >> > >> And actually, we can see from the configuration you provided that > >> cluster-recheck-interval is 2 minutes. > >> > >> I don't see anything about stonith; is it enabled and tested? This looks > >> like a situation where stonith would come into play. I know that power > >> fencing can be rough on a MySQL database, but perhaps intelligent > >> switches with network fencing would be appropriate. > > > > Yes, there is no stonith in place because we found it too agressive for this > purpose. And to be honest I'm not sure if that would have worked here. > > The problem is, in a situation like this where corosync communication is > broken, both nodes will think they are the only surviving node, and > bring up all resources. Network fencing would isolate one of the nodes > so at least it can't cause any serious trouble. > Quorum is enabled, and this is a three-node cluster, so this should not be an issue here. The real problem was that the mysql instance was still running, it was not either master or slave, and the VIP was still assigned, and as such, clients were able to connect to this outdated instance. As the next step I will upgrade the resource agents, as I saw there were already some fixes for issues where the slave status is wrong Anyway, thanks for your help! > >> The "Corosync main process was not scheduled" message is the start of > >> the trouble. It means the system was overloaded and corosync didn't get > >> any CPU time, so it couldn't maintain cluster communication. > >> > > > > True, this was the cause of the issue, but we still couldn't find a > > solution to > get rid of the original problem. > > Nevertheless I think that the issue is that the RA did not properly detect > the state of mysql. > > > > > >> Probably the most useful thing would be to upgrade to a recent version > >> of corosync+pacemaker+resource-agents. Recent corosync versions run > with > >> realtime priority, which makes this much less likely. > >> > >> Other than that, figure out what the load issue was, and try to prevent > >> it from recurring. > >> > > > > Whereas the original problem might have been caused by corosync CPU > issue, I am sure that once the load was gone the proper mysql state should > have been detected. > > The RA, responsible for this is almost the latest version, and I did not see > any changes related to this functionality. > > > >> I'm not familiar enough with the RA to comment on its behavior. If you > >> think it's suspect, check the logs during the incident for messages from > >> the RA. > >> > > > > So did I, but there are very few details logged while this was happening, so > I am pretty much stuck :( > > > > I thought that someone might have a clue what is wrong in the RA - that > causes this fake state detection. > > (Un)fortunately I cannot reproduce this situation for now. > > Perhaps with the split-brain, the slave tried to come up as master? > > > > > Who could help me in troubleshooting this? > > > > Thanks, > > Attila > > > > > > > > > >>> I assume that the RA does not treat this case properly. Mysql was > running, > >> but the "show slave status" command returned something that the RA > was > >> not prepared to parse, and instead of reporting a non-readable attribute, > it > >> returned some generic error, that did not stop th
Re: [ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave
Hi Ken, Thanks a lot for your feedback, my answers are inline. > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Monday, August 22, 2016 4:12 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Mysql slave did not start replication after > failure, > and read-only IP also remained active on the much outdated slave > > On 08/22/2016 07:24 AM, Attila Megyeri wrote: > > Hi Andrei, > > > > I waited several hours, and nothing happened. > > And actually, we can see from the configuration you provided that > cluster-recheck-interval is 2 minutes. > > I don't see anything about stonith; is it enabled and tested? This looks > like a situation where stonith would come into play. I know that power > fencing can be rough on a MySQL database, but perhaps intelligent > switches with network fencing would be appropriate. Yes, there is no stonith in place because we found it too agressive for this purpose. And to be honest I'm not sure if that would have worked here. > > The "Corosync main process was not scheduled" message is the start of > the trouble. It means the system was overloaded and corosync didn't get > any CPU time, so it couldn't maintain cluster communication. > True, this was the cause of the issue, but we still couldn't find a solution to get rid of the original problem. Nevertheless I think that the issue is that the RA did not properly detect the state of mysql. > Probably the most useful thing would be to upgrade to a recent version > of corosync+pacemaker+resource-agents. Recent corosync versions run with > realtime priority, which makes this much less likely. > > Other than that, figure out what the load issue was, and try to prevent > it from recurring. > Whereas the original problem might have been caused by corosync CPU issue, I am sure that once the load was gone the proper mysql state should have been detected. The RA, responsible for this is almost the latest version, and I did not see any changes related to this functionality. > I'm not familiar enough with the RA to comment on its behavior. If you > think it's suspect, check the logs during the incident for messages from > the RA. > So did I, but there are very few details logged while this was happening, so I am pretty much stuck :( I thought that someone might have a clue what is wrong in the RA - that causes this fake state detection. (Un)fortunately I cannot reproduce this situation for now. Who could help me in troubleshooting this? Thanks, Attila > > I assume that the RA does not treat this case properly. Mysql was running, > but the "show slave status" command returned something that the RA was > not prepared to parse, and instead of reporting a non-readable attribute, it > returned some generic error, that did not stop the server. > > > > Rgds, > > Attila > > > > > > -Original Message- > > From: Andrei Borzenkov [mailto:arvidj...@gmail.com] > > Sent: Monday, August 22, 2016 11:42 AM > > To: Cluster Labs - All topics related to open-source clustering welcomed > > > Subject: Re: [ClusterLabs] Mysql slave did not start replication after > > failure, > and read-only IP also remained active on the much outdated slave > > > > On Mon, Aug 22, 2016 at 12:18 PM, Attila Megyeri > > wrote: > >> Dear community, > >> > >> > >> > >> A few days ago we had an issue in our Mysql M/S replication cluster. > >> > >> We have a one R/W Master, and a one RO Slave setup. RO VIP is > supposed to be > >> running on the slave if it is not too much behind the master, and if any > >> error occurs, RO VIP is moved to the master. > >> > >> > >> > >> Something happened with the slave Mysql (some disk issue, still > >> investigating), but the problem is, that the slave VIP remained on the > slave > >> device, even though the slave process was not running, and the server > was > >> much outdated. > >> > >> > >> > >> During the issue the following log entries appeared (just an extract as it > >> would be too long): > >> > >> > >> > >> > >> > >> Aug 20 02:04:07 ctdb1 corosync[1056]: [MAIN ] Corosync main process > was > >> not scheduled for 14088.5488 ms (threshold is 4000. ms). Consider > token > >> timeout increase. > >> > >> Aug 20 02:04:07 ctdb1 corosync[1056]: [TOTEM ] A processor failed, > forming > >> new configuration. > >> > >> Aug 20 02:04:34 ctdb1 corosync[
Re: [ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave
Hi Andrei, I waited several hours, and nothing happened. I assume that the RA does not treat this case properly. Mysql was running, but the "show slave status" command returned something that the RA was not prepared to parse, and instead of reporting a non-readable attribute, it returned some generic error, that did not stop the server. Rgds, Attila -Original Message- From: Andrei Borzenkov [mailto:arvidj...@gmail.com] Sent: Monday, August 22, 2016 11:42 AM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: Re: [ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave On Mon, Aug 22, 2016 at 12:18 PM, Attila Megyeri wrote: > Dear community, > > > > A few days ago we had an issue in our Mysql M/S replication cluster. > > We have a one R/W Master, and a one RO Slave setup. RO VIP is supposed to be > running on the slave if it is not too much behind the master, and if any > error occurs, RO VIP is moved to the master. > > > > Something happened with the slave Mysql (some disk issue, still > investigating), but the problem is, that the slave VIP remained on the slave > device, even though the slave process was not running, and the server was > much outdated. > > > > During the issue the following log entries appeared (just an extract as it > would be too long): > > > > > > Aug 20 02:04:07 ctdb1 corosync[1056]: [MAIN ] Corosync main process was > not scheduled for 14088.5488 ms (threshold is 4000. ms). Consider token > timeout increase. > > Aug 20 02:04:07 ctdb1 corosync[1056]: [TOTEM ] A processor failed, forming > new configuration. > > Aug 20 02:04:34 ctdb1 corosync[1056]: [MAIN ] Corosync main process was > not scheduled for 27065.2559 ms (threshold is 4000. ms). Consider token > timeout increase. > > Aug 20 02:04:34 ctdb1 corosync[1056]: [TOTEM ] A new membership (xxx:6720) > was formed. Members left: 168362243 168362281 168362282 168362301 168362302 > 168362311 168362312 1 > > Aug 20 02:04:34 ctdb1 corosync[1056]: [TOTEM ] A new membership (xxx:6724) > was formed. Members > > .. > > Aug 20 02:13:28 ctdb1 corosync[1056]: [MAIN ] Completed service > synchronization, ready to provide service. > > .. > > Aug 20 02:13:29 ctdb1 attrd[1584]: notice: attrd_trigger_update: Sending > flush op to all hosts for: readable (1) > > … > > Aug 20 02:13:32 ctdb1 mysql(db-mysql)[10492]: INFO: post-demote notification > for ctdb1 > > Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-master)[10490]: INFO: IP status = ok, > IP_CIP= > > Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-ip-master_stop_0 (call=371, rc=0, cib-update=179, confirmed=true) ok > > Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Adding inet address > xxx/24 with broadcast address to device eth0 > > Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Bringing device > eth0 up > > Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: > /usr/lib/heartbeat/send_arp -i 200 -r 5 -p > /usr/var/run/resource-agents/send_arp-xxx eth0 xxx auto not_used not_used > > Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-ip-slave_start_0 (call=377, rc=0, cib-update=180, confirmed=true) ok > > Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-ip-slave_monitor_2 (call=380, rc=0, cib-update=181, confirmed=false) > ok > > Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-mysql_notify_0 (call=374, rc=0, cib-update=0, confirmed=true) ok > > Aug 20 02:13:32 ctdb1 attrd[1584]: notice: attrd_trigger_update: Sending > flush op to all hosts for: master-db-mysql (1) > > Aug 20 02:13:32 ctdb1 attrd[1584]: notice: attrd_perform_update: Sent > update 1622: master-db-mysql=1 > > Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-mysql_demote_0 (call=384, rc=0, cib-update=182, confirmed=true) ok > > Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11160]: INFO: Ignoring post-demote > notification for my own demotion. > > Aug 20 02:13:33 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-mysql_notify_0 (call=387, rc=0, cib-update=0, confirmed=true) ok > > Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11185]: ERROR: check_slave invoked on > an instance that is not a replication slave. > > Aug 20 02:13:33 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-mysql_monitor_7000 (call=390, rc=0, cib-update=183, confirmed=false) ok > > Aug 20 02:13:33 ctdb1 ntpd[1560]: Listen normally on 16 eth0 . UDP 123 > > Aug 20 02:13:
[ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave
Dear community, A few days ago we had an issue in our Mysql M/S replication cluster. We have a one R/W Master, and a one RO Slave setup. RO VIP is supposed to be running on the slave if it is not too much behind the master, and if any error occurs, RO VIP is moved to the master. Something happened with the slave Mysql (some disk issue, still investigating), but the problem is, that the slave VIP remained on the slave device, even though the slave process was not running, and the server was much outdated. During the issue the following log entries appeared (just an extract as it would be too long): Aug 20 02:04:07 ctdb1 corosync[1056]: [MAIN ] Corosync main process was not scheduled for 14088.5488 ms (threshold is 4000. ms). Consider token timeout increase. Aug 20 02:04:07 ctdb1 corosync[1056]: [TOTEM ] A processor failed, forming new configuration. Aug 20 02:04:34 ctdb1 corosync[1056]: [MAIN ] Corosync main process was not scheduled for 27065.2559 ms (threshold is 4000. ms). Consider token timeout increase. Aug 20 02:04:34 ctdb1 corosync[1056]: [TOTEM ] A new membership (xxx:6720) was formed. Members left: 168362243 168362281 168362282 168362301 168362302 168362311 168362312 1 Aug 20 02:04:34 ctdb1 corosync[1056]: [TOTEM ] A new membership (xxx:6724) was formed. Members .. Aug 20 02:13:28 ctdb1 corosync[1056]: [MAIN ] Completed service synchronization, ready to provide service. .. Aug 20 02:13:29 ctdb1 attrd[1584]: notice: attrd_trigger_update: Sending flush op to all hosts for: readable (1) ... Aug 20 02:13:32 ctdb1 mysql(db-mysql)[10492]: INFO: post-demote notification for ctdb1 Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-master)[10490]: INFO: IP status = ok, IP_CIP= Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation db-ip-master_stop_0 (call=371, rc=0, cib-update=179, confirmed=true) ok Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Adding inet address xxx/24 with broadcast address to device eth0 Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Bringing device eth0 up Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p /usr/var/run/resource-agents/send_arp-xxx eth0 xxx auto not_used not_used Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation db-ip-slave_start_0 (call=377, rc=0, cib-update=180, confirmed=true) ok Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation db-ip-slave_monitor_2 (call=380, rc=0, cib-update=181, confirmed=false) ok Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation db-mysql_notify_0 (call=374, rc=0, cib-update=0, confirmed=true) ok Aug 20 02:13:32 ctdb1 attrd[1584]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-db-mysql (1) Aug 20 02:13:32 ctdb1 attrd[1584]: notice: attrd_perform_update: Sent update 1622: master-db-mysql=1 Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation db-mysql_demote_0 (call=384, rc=0, cib-update=182, confirmed=true) ok Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11160]: INFO: Ignoring post-demote notification for my own demotion. Aug 20 02:13:33 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation db-mysql_notify_0 (call=387, rc=0, cib-update=0, confirmed=true) ok Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11185]: ERROR: check_slave invoked on an instance that is not a replication slave. Aug 20 02:13:33 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation db-mysql_monitor_7000 (call=390, rc=0, cib-update=183, confirmed=false) ok Aug 20 02:13:33 ctdb1 ntpd[1560]: Listen normally on 16 eth0 . UDP 123 Aug 20 02:13:33 ctdb1 ntpd[1560]: Deleting interface #12 eth0, xxx#123, interface stats: received=0, sent=0, dropped=0, active_time=2637334 secs Aug 20 02:13:33 ctdb1 ntpd[1560]: peers refreshed Aug 20 02:13:33 ctdb1 ntpd[1560]: new interface(s) found: waking up resolver Aug 20 02:13:40 ctdb1 mysql(db-mysql)[11224]: ERROR: check_slave invoked on an instance that is not a replication slave. Aug 20 02:13:47 ctdb1 mysql(db-mysql)[11263]: ERROR: check_slave invoked on an instance that is not a replication slave. And from this time, the last two lines repeat every 7 seconds (mysql monitoring interval) The expected behavior was that the slave (RO) VIP should have been moved to the master, as the secondary db was outdated. Unfortunately I cannot recall what crm_mon was showing when the issue was present, but I am sure that the RA did not handle the situation properly. Placing the slave node into standby and the online resolved the issue immediately (Slave started to sync, and in a few minutes it catched up the master). Here is the relevant config from the configuration: primitive db-ip-master ocf:heartbeat:IPaddr2 \ params lvs_support="true" ip="XXX" cidr_netmask="24" broadcast="XXX" \ op start interval="0" timeout="20s" on-fail="re
Re: [ClusterLabs] Mysql M/S, binlogs - how to delete them safely without failing over first?
Hi Brett, Thanks for the quick response. I am using the RAs from https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/mysql I have no experience with the Percona agent so far. The clusterlabs RA also does the change master to on promote. I did not check with the most recent versions, but it definitively did not work properly before. What is the expected behaviour? When there is a failover, the new slave should know what was the last master position of the old slave (that was still synced)? How do you keep/maintain the binlog files in your environments? Thanks, Attila From: Brett Moser [mailto:brett.mo...@gmail.com] Sent: Tuesday, August 18, 2015 11:35 PM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: Re: [ClusterLabs] Mysql M/S, binlogs - how to delete them safely without failing over first? Hi Attila, It sounds like on failover the new slave is not having it's master replication file & position updated. What Resource Agent are you using to control the M/S mysql resource? Have you investigated the Percona agent? I performs the CHANGE MASTER TO commands for you, and I have found it to be a good RA for my M/S MySQL purposes. https://github.com/percona/percona-pacemaker-agents regards, -Brett Moser On Tue, Aug 18, 2015 at 2:19 PM, Attila Megyeri mailto:amegy...@minerva-soft.com>> wrote: Hi List, We are using M/S replication in a couple of clusters, and there is an issue that has been causing headaches for me for quite some time. My problem comes from the fact that binlog files grow very quickly on both the Master and Slave nodes. Let’s assume, that node 1 is the master – it logs all operations to binlog. Node 2 is the slave, it replicates everything properly. (It is strange, however, that node2 must also generate and keep binlog files while it is a slave, but let’s assume that this is by design). There are ways to configure mysql to keep the binlog files only for some time, e.g. 10 days, but I had an issue with this: To explain the issue, please consider the following case: Let’s say that both node1 and node2 are up-to-date, and we did a failover test on day 0. DB1 is the master, DB2 is the slave. DB1 has master position “A”, DB1 has master position “B”. After 20 days, lots of binlog files exist on both servers, I would like to get rid of them, as the slave is up-to-date. I decide to delete all binlog files older than 1 day by issuing “purge binary logs…”. I try to fail over so that DB2 becomes the master, but DB1 tries to connect to DB2 for replication and wants to replicate starting from a position that it “remembers” from the times when DB2 was the master, and starts to look for some binlog files that are 20 days old. Now, the issue is, that those old binlog files have been deleted since, and replication stops with error (cannot find binlogs). Am I doing something wrong here, or is something configured badly? OR am I assuming correctly that in order to purge the binlog files from both servers I need to make a failover first? Thank you! ___ Users mailing list: Users@clusterlabs.org<mailto:Users@clusterlabs.org> http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Mysql M/S, binlogs - how to delete them safely without failing over first?
Hi List, We are using M/S replication in a couple of clusters, and there is an issue that has been causing headaches for me for quite some time. My problem comes from the fact that binlog files grow very quickly on both the Master and Slave nodes. Let's assume, that node 1 is the master - it logs all operations to binlog. Node 2 is the slave, it replicates everything properly. (It is strange, however, that node2 must also generate and keep binlog files while it is a slave, but let's assume that this is by design). There are ways to configure mysql to keep the binlog files only for some time, e.g. 10 days, but I had an issue with this: To explain the issue, please consider the following case: Let's say that both node1 and node2 are up-to-date, and we did a failover test on day 0. DB1 is the master, DB2 is the slave. DB1 has master position "A", DB1 has master position "B". After 20 days, lots of binlog files exist on both servers, I would like to get rid of them, as the slave is up-to-date. I decide to delete all binlog files older than 1 day by issuing "purge binary logs...". I try to fail over so that DB2 becomes the master, but DB1 tries to connect to DB2 for replication and wants to replicate starting from a position that it "remembers" from the times when DB2 was the master, and starts to look for some binlog files that are 20 days old. Now, the issue is, that those old binlog files have been deleted since, and replication stops with error (cannot find binlogs). Am I doing something wrong here, or is something configured badly? OR am I assuming correctly that in order to purge the binlog files from both servers I need to make a failover first? Thank you! ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Memory leak in crm_mon ?
Hi Andrew, I managed to isolate / reproduce the issue. You might want to take a look, as it might be present in 1.1.12 as well. I monitor my cluster from putty, mainly this way: - I have a putty (Windows client) session, that connects via SSH to the box, authenticates using public key as a non-root user. - It immediately sends a "sudo crm_mon -Af" command, so with a single click I have a nice view of what the cluster is doing. Whenever I close this putty window (terminate the app), crm_mon process gets to 100% cpu usage, starts to leak, in a few hours consumes all memory and then destroys the whole cluster. This does not happen if I leave crm_mon with Ctrl-C. I can reproduce this 100% with crm_mon 1.1.10, with the mainstream ubuntu trusty packages. This might be related on how sudo executes crm_mon, and what it signalls to crm_mon when it gets terminated. Now I know what I need to pay attention to in order to avoid this problem, but you might want to check whether this issue is still present. Thanks, Attila -Original Message----- From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Friday, August 14, 2015 12:40 AM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: Re: [ClusterLabs] Memory leak in crm_mon ? -Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, August 11, 2015 2:49 AM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: Re: [ClusterLabs] Memory leak in crm_mon ? > On 10 Aug 2015, at 5:33 pm, Attila Megyeri wrote: > > Hi! > > We are building a new cluster on top of pacemaker/corosync and several times > during the past days we noticed that „crm_mon -Af” used up all the > memory+swap and caused high CPU usage. Killing the process solves the issue. > > We are using the binary package versions available in the latest ubuntu > trusty, namely: > > crmsh 1.2.5+hg1034-1ubuntu4 > > pacemaker > 1.1.10+git20130802-1ubuntu2.3 > pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.3 > corosync 2.3.3-1ubuntu1 > > Kernel is 3.13.0-46-generic > > Looking back some „atop” data, the CPU went to 100% many times during the > last couple of days, at various times, more often around midnight exaclty > (strange). > > 08.05 14:00 > 08.06 21:41 > 08.07 00:00 > 08.07 00:00 > 08.08 00:00 > 08.09 06:27 > > Checked the corosync log and syslog, but did not find any correlation between > the entries int he logs around the specific times. > For most of the time, the node running the crm_mon was the DC as well – not > running any resources (e.g. a pairless node for quorum). > > > We have another running system, where everything works perfecly, whereas it > is almost the same: > > crmsh 1.2.5+hg1034-1ubuntu4 > > pacemaker > 1.1.10+git20130802-1ubuntu2.1 > pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.1 > corosync 2.3.3-1ubuntu1 > > Kernel is 3.13.0-8-generic > > > Is this perhaps a known issue? Possibly, that version is over 2 years old. > Any hints? Getting something a little more recent would be the best place to start Thanks Andew, I tried to upgrade to 1.1.12 using the packages availabe at https://launchpad.net/~syseleven-platform . Int he first attept I upgraded a single node, to see how it works out but I ended up with errors like Could not establish cib_rw connection: Connection refused (111) I have disabled the firewall, no changes. The node appears to be running but does not see any of the other nodes. On the other nodes I see this node as an UNCLEAN one. (I assume corosync is fine, but pacemaker not) I use udpu for the transport. Am I doing something wrong? I tried to look for some howtos on upgrade, but the only thing I found was the rather outdated http://clusterlabs.org/wiki/Upgrade Could you please direct me to some howto/guide on how to perform the upgrade? Or am I facing some compatibility issue, so I should extract the whole cib, upgrade all nodes and reconfigure the cluster from the scratch? (The cluster is meant to go live in 2 days... :) ) Thanks a lot in advance > > Thanks! > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://
Re: [ClusterLabs] Memory leak in crm_mon ?
-Original Message- From: Andrew Beekhof [mailto:and...@beekhof.net] Sent: Tuesday, August 11, 2015 2:49 AM To: Cluster Labs - All topics related to open-source clustering welcomed Subject: Re: [ClusterLabs] Memory leak in crm_mon ? > On 10 Aug 2015, at 5:33 pm, Attila Megyeri wrote: > > Hi! > > We are building a new cluster on top of pacemaker/corosync and several times > during the past days we noticed that „crm_mon -Af” used up all the > memory+swap and caused high CPU usage. Killing the process solves the issue. > > We are using the binary package versions available in the latest ubuntu > trusty, namely: > > crmsh 1.2.5+hg1034-1ubuntu4 > > pacemaker > 1.1.10+git20130802-1ubuntu2.3 > pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.3 > corosync 2.3.3-1ubuntu1 > > Kernel is 3.13.0-46-generic > > Looking back some „atop” data, the CPU went to 100% many times during the > last couple of days, at various times, more often around midnight exaclty > (strange). > > 08.05 14:00 > 08.06 21:41 > 08.07 00:00 > 08.07 00:00 > 08.08 00:00 > 08.09 06:27 > > Checked the corosync log and syslog, but did not find any correlation between > the entries int he logs around the specific times. > For most of the time, the node running the crm_mon was the DC as well – not > running any resources (e.g. a pairless node for quorum). > > > We have another running system, where everything works perfecly, whereas it > is almost the same: > > crmsh 1.2.5+hg1034-1ubuntu4 > > pacemaker > 1.1.10+git20130802-1ubuntu2.1 > pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.1 > corosync 2.3.3-1ubuntu1 > > Kernel is 3.13.0-8-generic > > > Is this perhaps a known issue? Possibly, that version is over 2 years old. > Any hints? Getting something a little more recent would be the best place to start Thanks Andew, I tried to upgrade to 1.1.12 using the packages availabe at https://launchpad.net/~syseleven-platform . Int he first attept I upgraded a single node, to see how it works out but I ended up with errors like Could not establish cib_rw connection: Connection refused (111) I have disabled the firewall, no changes. The node appears to be running but does not see any of the other nodes. On the other nodes I see this node as an UNCLEAN one. (I assume corosync is fine, but pacemaker not) I use udpu for the transport. Am I doing something wrong? I tried to look for some howtos on upgrade, but the only thing I found was the rather outdated http://clusterlabs.org/wiki/Upgrade Could you please direct me to some howto/guide on how to perform the upgrade? Or am I facing some compatibility issue, so I should extract the whole cib, upgrade all nodes and reconfigure the cluster from the scratch? (The cluster is meant to go live in 2 days... :) ) Thanks a lot in advance > > Thanks! > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Memory leak in crm_mon ?
Hi! We are building a new cluster on top of pacemaker/corosync and several times during the past days we noticed that "crm_mon -Af" used up all the memory+swap and caused high CPU usage. Killing the process solves the issue. We are using the binary package versions available in the latest ubuntu trusty, namely: crmsh 1.2.5+hg1034-1ubuntu4 pacemaker1.1.10+git20130802-1ubuntu2.3 pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.3 corosync 2.3.3-1ubuntu1 Kernel is 3.13.0-46-generic Looking back some "atop" data, the CPU went to 100% many times during the last couple of days, at various times, more often around midnight exaclty (strange). 08.05 14:00 08.06 21:41 08.07 00:00 08.07 00:00 08.08 00:00 08.09 06:27 Checked the corosync log and syslog, but did not find any correlation between the entries int he logs around the specific times. For most of the time, the node running the crm_mon was the DC as well - not running any resources (e.g. a pairless node for quorum). We have another running system, where everything works perfecly, whereas it is almost the same: crmsh 1.2.5+hg1034-1ubuntu4 pacemaker1.1.10+git20130802-1ubuntu2.1 pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.1 corosync 2.3.3-1ubuntu1 Kernel is 3.13.0-8-generic Is this perhaps a known issue? Any hints? Thanks! ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org