Re: [ClusterLabs] Mysql upgrade in DRBD setup

2017-10-15 Thread Attila Megyeri
hi Ken, 

My problem with the scenario you described is the following:

On the central side, if I use M-S replication, the master binlog information 
will be different on the master and the slave. Therefore, if a failover occurs, 
remote sites will have difficulties with the "change master" operation (binlog 
file and position differ on the two hosts). This was the reason for choosing 
drbd for the central master.

Yes, Galera could be an option, but that would require some redesign and also 
the experience is missing...

What about the following, for the DRBD upgrade:

- I would upgrade the active node normally, causing a small downtime. (Cluster 
in maintenance mode)
- Then, when the master is up and running again, I would mount a local dummy  
mysql dir on the slave (content does not matter), and perform the upgrade of 
the secondary node. (program files would be upgraded, just as the dummy 
database I don't care about)
- Thenf finally, I would attempt a failover to the secondary, just to test if 
all is fine.

Besides the small downtime, I don't see any significant risks in this approach, 
do you?

Thanks,
Attila


-Original Message-
From: Ken Gaillot [mailto:kgail...@redhat.com] 
Sent: Friday, October 13, 2017 9:03 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Mysql upgrade in DRBD setup

On Fri, 2017-10-13 at 17:35 +0200, Attila Megyeri wrote:
> Hi Ken, Kristián,
> 
> 
> Thanks - I am familiar with the native replication, and we use that as 
> well.
> But in this scenario I have to use DRBD. (There is a DRBD Mysql 
> cluster that is a central site, which is replicated to many sites 
> using native replication, and all sites have DRBD clusters as well - 
> In this setup I have to use DRBD for high availability).
> 
> 
> Anyway - I thought there is a better approach for the DRBD-replicated 
> Mysql than what I outlined.
> What I am concerned about, is what will happen if I upgrade the active 
> node (let's say I'm okay with the downtime) - when I fail over to the 
> other node, where the program files and the data files are on 
> different versions...And when I start upgrading that.
> 
> Any experience anyone?
> 
> @Kristián: my experience shows that If I try to update mysql without a 
> mounted data fs - it will fail terribly... So the only option is to 
> upgrade the mounted, and active instance - but the issue is the 
> version difference (prog vs. data)

Exactly -- which is why I'd still go with native replication for this, too. It 
just adds a step in the upgrade process I outlined earlier:
repoint all the other sites' mysql instances to the second central server after 
it is upgraded (before or after it is made master, doesn't matter). I'm 
assuming only the master is allowed to write.

Another alternative would be to use galera for multi-master (at least for the 
two servers at the central site).

Also, it's still possible to use DRBD beneath a native replication setup, but 
you'd have to replicate both the master and slave data (using only one at a 
time on any given server). This makes more sense if the mysql servers are 
running inside VMs or containers that can migrate between the physical machines.

> 
> Thanks!
> 
> 
> 
> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Thursday, October 12, 2017 9:22 PM
> To: Cluster Labs - All topics related to open-source clustering 
> welcomed 
> Subject: Re: [ClusterLabs] Mysql upgrade in DRBD setup
> 
> On Thu, 2017-10-12 at 18:51 +0200, Attila Megyeri wrote:
> > Hi all,
> >  
> > What is the recommended mysql server upgrade methodology in case of 
> > an active/passive DRBD storage?
> > (Ubuntu is the platform)
> 
> If you want to minimize downtime in a MySQL upgrade, your best bet is 
> to use MySQL native replication rather than replicate the storage.
> 
> 1. starting point: node1 = master, node2 = slave 2. stop mysql on 
> node2, upgrade, start mysql again, ensure OK 3. switch master to
> node2 and slave to node1, ensure OK 4. stop mysql on node1, upgrade, 
> start mysql again, ensure OK
> 
> You might have a small window where the database is read-only while 
> you switch masters (you can keep it to a few seconds if you arrange 
> things well), but other than that, you won't have any downtime, even 
> if some part of the upgrade gives you trouble.
> 
> >  
> > 1)  On the passive node the mysql data directory is not mounted, 
> > so the backup fails (some postinstall jobs will attempt to perform 
> > manipulations on certain files in the data directory).
> > 2)  If the upgrade is done on the active node, it will restart 
> > the service (with the service r

Re: [ClusterLabs] Mysql upgrade in DRBD setup

2017-10-13 Thread Attila Megyeri
Hi Ken, Kristián,


Thanks - I am familiar with the native replication, and we use that as well.
But in this scenario I have to use DRBD. (There is a DRBD Mysql cluster that is 
a central site, which is replicated to many sites using native replication, and 
all sites have DRBD clusters as well - In this setup I have to use DRBD for 
high availability).


Anyway - I thought there is a better approach for the DRBD-replicated Mysql 
than what I outlined.
What I am concerned about, is what will happen if I upgrade the active node 
(let's say I'm okay with the downtime) - when I fail over to the other node, 
where the program files and the data files are on different versions...And when 
I start upgrading that.

Any experience anyone?

@Kristián: my experience shows that If I try to update mysql without a mounted 
data fs - it will fail terribly... So the only option is to upgrade the 
mounted, and active instance - but the issue is the version difference (prog 
vs. data)

Thanks!



-Original Message-
From: Ken Gaillot [mailto:kgail...@redhat.com] 
Sent: Thursday, October 12, 2017 9:22 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Mysql upgrade in DRBD setup

On Thu, 2017-10-12 at 18:51 +0200, Attila Megyeri wrote:
> Hi all,
>  
> What is the recommended mysql server upgrade methodology in case of an 
> active/passive DRBD storage?
> (Ubuntu is the platform)

If you want to minimize downtime in a MySQL upgrade, your best bet is to use 
MySQL native replication rather than replicate the storage.

1. starting point: node1 = master, node2 = slave 2. stop mysql on node2, 
upgrade, start mysql again, ensure OK 3. switch master to node2 and slave to 
node1, ensure OK 4. stop mysql on node1, upgrade, start mysql again, ensure OK

You might have a small window where the database is read-only while you switch 
masters (you can keep it to a few seconds if you arrange things well), but 
other than that, you won't have any downtime, even if some part of the upgrade 
gives you trouble.

>  
> 1)  On the passive node the mysql data directory is not mounted, 
> so the backup fails (some postinstall jobs will attempt to perform 
> manipulations on certain files in the data directory).
> 2)  If the upgrade is done on the active node, it will restart the 
> service (with the service restart, not in a crm managed fassion…), 
> which is not a very good option (downtime in a HA solution). Not to 
> mention, that it will update some files in the mysql data directory, 
> which can cause strange issues if the A/P pair is changed – since on 
> the other node the program code will still be the old one, while the 
> data dir is already upgraded.
>  
> Any hints are welcome!
>  
> Thanks,
> Attila
>  
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
--
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org 
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Mysql upgrade in DRBD setup

2017-10-12 Thread Attila Megyeri
Hi, This does not really answer my question. Placing the cluster into 
maintenance mode just avoids monitoring and restarting, but what about the 
things I am asking below? (data dir related questions)

thanks


From: Kristián Feldsam [mailto:ad...@feldhost.cz]
Sent: Thursday, October 12, 2017 7:51 PM
To: Attila Megyeri ; users@clusterlabs.org
Subject: Re:[ClusterLabs] Mysql upgrade in DRBD setup

hello, you should put cluster to maintenance mode



Sent from my MI 5
On Attila Megyeri 
mailto:amegy...@minerva-soft.com>>, Oct 12, 2017 
6:55 PM wrote:
Hi all,

What is the recommended mysql server upgrade methodology in case of an 
active/passive DRBD storage?
(Ubuntu is the platform)


1)  On the passive node the mysql data directory is not mounted, so the 
backup fails (some postinstall jobs will attempt to perform manipulations on 
certain files in the data directory).

2)  If the upgrade is done on the active node, it will restart the service 
(with the service restart, not in a crm managed fassion…), which is not a very 
good option (downtime in a HA solution). Not to mention, that it will update 
some files in the mysql data directory, which can cause strange issues if the 
A/P pair is changed – since on the other node the program code will still be 
the old one, while the data dir is already upgraded.

Any hints are welcome!

Thanks,
Attila

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Mysql upgrade in DRBD setup

2017-10-12 Thread Attila Megyeri
Hi all,

What is the recommended mysql server upgrade methodology in case of an 
active/passive DRBD storage?
(Ubuntu is the platform)


1)  On the passive node the mysql data directory is not mounted, so the 
backup fails (some postinstall jobs will attempt to perform manipulations on 
certain files in the data directory).

2)  If the upgrade is done on the active node, it will restart the service 
(with the service restart, not in a crm managed fassion...), which is not a 
very good option (downtime in a HA solution). Not to mention, that it will 
update some files in the mysql data directory, which can cause strange issues 
if the A/P pair is changed - since on the other node the program code will 
still be the old one, while the data dir is already upgraded.

Any hints are welcome!

Thanks,
Attila

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] clearing failed actions

2017-06-19 Thread Attila Megyeri
One more thing to add.
Two almost identical clusters, with the identical asterisk primitive produce a 
different crm_verify output. on one cluster, it returns no warnings, whereas 
the other once complains:

On the problematic one:

crm_verify --live-check -VV
warning: get_failcount_full:   Setting asterisk.failure_timeout=120 in 
asterisk-stop-0 conflicts with on-fail=block: ignoring timeout
Warnings found during check: config may not be valid


The relevant primitive is in both clusters:

primitive asterisk ocf:heartbeat:asterisk \
op monitor interval="10s" timeout="45s" on-fail="restart" \
op start interval="0" timeout="60s" on-fail="standby" \
op stop interval="0" timeout="60s" on-fail="block" \
meta migration-threshold="3" failure-timeout="2m"

Why is the same configuration valid in one, but not in the other cluster?
Shall I simply omit the "op stop" line?

thanks :)
Attila


> -Original Message-
> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> Sent: Monday, June 19, 2017 9:47 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; kgail...@redhat.com
> Subject: Re: [ClusterLabs] clearing failed actions
>
> I did another experiment, even simpler.
>
> Created one node, one resource, using pacemaker 1.1.14 on ubuntu.
>
> Configured failcount to 1, migration threshold to 2, failure timeout to 1
> minute.
>
> crm_mon:
>
> Last updated: Mon Jun 19 19:43:41 2017  Last change: Mon Jun 19
> 19:37:09 2017 by root via cibadmin on test
> Stack: corosync
> Current DC: test (version 1.1.14-70404b0) - partition with quorum
> 1 node and 1 resource configured
>
> Online: [ test ]
>
> db-ip-master(ocf::heartbeat:IPaddr2):   Started test
>
> Node Attributes:
> * Node test:
>
> Migration Summary:
> * Node test:
>db-ip-master: migration-threshold=2 fail-count=1
>
> crm verify:
>
> crm_verify --live-check -
> info: validate_with_relaxng:Creating RNG parser context
> info: determine_online_status:  Node test is online
> info: get_failcount_full:   db-ip-master has failed 1 times on test
> info: get_failcount_full:   db-ip-master has failed 1 times on test
> info: get_failcount_full:   db-ip-master has failed 1 times on test
> info: get_failcount_full:   db-ip-master has failed 1 times on test
> info: native_print: db-ip-master(ocf::heartbeat:IPaddr2):   
> Started test
> info: get_failcount_full:   db-ip-master has failed 1 times on test
> info: common_apply_stickiness:  db-ip-master can fail 1 more times on
> test before being forced off
> info: LogActions:   Leave   db-ip-master(Started test)
>
>
> crm configure is:
>
> node 168362242: test \
> attributes standby=off
> primitive db-ip-master IPaddr2 \
> params lvs_support=true ip=10.9.1.10 cidr_netmask=24
> broadcast=10.9.1.255 \
> op start interval=0 timeout=20s on-fail=restart \
> op monitor interval=20s timeout=20s \
> op stop interval=0 timeout=20s on-fail=block \
> meta migration-threshold=2 failure-timeout=1m target-role=Started
> location loc1 db-ip-master 0: test
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.14-70404b0 \
> cluster-infrastructure=corosync \
> stonith-enabled=false \
> cluster-recheck-interval=30s \
> symmetric-cluster=false
>
>
>
>
> Corosync log:
>
>
> Jun 19 19:45:07 [331] test   crmd:   notice: do_state_transition:   State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jun 19 19:45:07 [330] testpengine: info: process_pe_message:Input 
> has
> not changed since last time, not saving to disk
> Jun 19 19:45:07 [330] testpengine: info: determine_online_status:
> Node test is online
> Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
> db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
> db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
> db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
> db-ip-master
> has failed 1 times on test
> Jun 19 19:45:07 [330] testpengine: info: native_print:  db-ip-master
> (ocf::heartbeat:IPaddr2):   Started test
> Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
> db-ip-ma

Re: [ClusterLabs] clearing failed actions

2017-06-19 Thread Attila Megyeri
I did another experiment, even simpler.

Created one node, one resource, using pacemaker 1.1.14 on ubuntu.

Configured failcount to 1, migration threshold to 2, failure timeout to 1 
minute.

crm_mon:

Last updated: Mon Jun 19 19:43:41 2017  Last change: Mon Jun 19 
19:37:09 2017 by root via cibadmin on test
Stack: corosync
Current DC: test (version 1.1.14-70404b0) - partition with quorum
1 node and 1 resource configured

Online: [ test ]

db-ip-master(ocf::heartbeat:IPaddr2):   Started test

Node Attributes:
* Node test:

Migration Summary:
* Node test:
   db-ip-master: migration-threshold=2 fail-count=1

crm verify:

crm_verify --live-check -
info: validate_with_relaxng:Creating RNG parser context
info: determine_online_status:  Node test is online
info: get_failcount_full:   db-ip-master has failed 1 times on test
info: get_failcount_full:   db-ip-master has failed 1 times on test
info: get_failcount_full:   db-ip-master has failed 1 times on test
info: get_failcount_full:   db-ip-master has failed 1 times on test
info: native_print: db-ip-master(ocf::heartbeat:IPaddr2):   Started 
test
info: get_failcount_full:   db-ip-master has failed 1 times on test
info: common_apply_stickiness:  db-ip-master can fail 1 more times on 
test before being forced off
info: LogActions:   Leave   db-ip-master(Started test)


crm configure is:

node 168362242: test \
attributes standby=off
primitive db-ip-master IPaddr2 \
params lvs_support=true ip=10.9.1.10 cidr_netmask=24 
broadcast=10.9.1.255 \
op start interval=0 timeout=20s on-fail=restart \
op monitor interval=20s timeout=20s \
op stop interval=0 timeout=20s on-fail=block \
meta migration-threshold=2 failure-timeout=1m target-role=Started
location loc1 db-ip-master 0: test
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
stonith-enabled=false \
cluster-recheck-interval=30s \
symmetric-cluster=false




Corosync log:


Jun 19 19:45:07 [331] test   crmd:   notice: do_state_transition:   State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
Jun 19 19:45:07 [330] testpengine: info: process_pe_message:Input 
has not changed since last time, not saving to disk
Jun 19 19:45:07 [330] testpengine: info: determine_online_status:   
Node test is online
Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] testpengine: info: native_print:  db-ip-master
(ocf::heartbeat:IPaddr2):   Started test
Jun 19 19:45:07 [330] testpengine: info: get_failcount_full:
db-ip-master has failed 1 times on test
Jun 19 19:45:07 [330] testpengine: info: common_apply_stickiness:   
db-ip-master can fail 1 more times on test before being forced off
Jun 19 19:45:07 [330] testpengine: info: LogActions:Leave   
db-ip-master(Started test)
Jun 19 19:45:07 [330] testpengine:   notice: process_pe_message:
Calculated Transition 34: /var/lib/pacemaker/pengine/pe-input-6.bz2
Jun 19 19:45:07 [331] test   crmd: info: do_state_transition:   State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Jun 19 19:45:07 [331] test   crmd:   notice: run_graph: Transition 34 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete
Jun 19 19:45:07 [331] test   crmd: info: do_log:FSA: Input 
I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
Jun 19 19:45:07 [331] test   crmd:   notice: do_state_transition:   State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]


I hope someone can help me figure this out :)

Thanks!



> -Original Message-----
> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> Sent: Monday, June 19, 2017 7:45 PM
> To: kgail...@redhat.com; Cluster Labs - All topics related to open-source
> clustering welcomed 
> Subject: Re: [ClusterLabs] clearing failed actions
>
> Hi Ken,
>
> /sorry for the long text/
>
> I have created a relatively simple setup to localize the issue.
> Three nodes, no fencing, just a master/slave mysql with two virual IPs.
> Just as a reminden, my primary issue is, that on clu

Re: [ClusterLabs] clearing failed actions

2017-06-19 Thread Attila Megyeri
] ctmgr   crmd:debug: do_state_transition:
Starting PEngine Recheck Timer
Jun 19 17:37:06 [18998] ctmgr   crmd:debug: crm_timer_start:Started 
PEngine Recheck Timer (I_PE_CALC:3ms), src=277



As you can see from the logs, pacemaker does not even try to re-monitor the 
resource that had a failure, or at least I'm not seeing it.
Cluster recheck interval is set to 30 seconds for troubleshooting reasons.

If I execute a

crm resource cleanup db-ip-master

Tha failure is removed.

Now am I taking something terribly wrong here?
Or is this simply a bug in 1.1.10?


Thanks,
Attila




> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Wednesday, June 7, 2017 10:14 PM
> To: Attila Megyeri ; Cluster Labs - All topics
> related to open-source clustering welcomed 
> Subject: Re: [ClusterLabs] clearing failed actions
>
> On 06/01/2017 02:44 PM, Attila Megyeri wrote:
> > Ken,
> >
> > I noticed something strange, this might be the issue.
> >
> > In some cases, even the manual cleanup does not work.
> >
> > I have a failed action of resource "A" on node "a". DC is node "b".
> >
> > e.g.
> > Failed actions:
> > jboss_imssrv1_monitor_1 (node=ctims1, call=108, rc=1,
> status=complete, last-rc-change=Thu Jun  1 14:13:36 2017
> >
> >
> > When I attempt to do a "crm resource cleanup A" from node "b", nothing
> happens. Basically the lrmd on "a" is not notified that it should monitor the
> resource.
> >
> >
> > When I execute a "crm resource cleanup A" command on node "a" (where
> the operation failed) , the failed action is cleared properly.
> >
> > Why could this be happening?
> > Which component should be responsible for this? pengine, crmd, lrmd?
>
> The crm shell will send commands to attrd (to clear fail counts) and
> crmd (to clear the resource history), which in turn will record changes
> in the cib.
>
> I'm not sure how crm shell implements it, but crm_resource sends
> individual messages to each node when cleaning up a resource without
> specifying a particular node. You could check the pacemaker log on each
> node to see whether attrd and crmd are receiving those commands, and
> what they do in response.
>
>
> >> -Original Message-
> >> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> >> Sent: Thursday, June 1, 2017 6:57 PM
> >> To: kgail...@redhat.com; Cluster Labs - All topics related to open-source
> >> clustering welcomed 
> >> Subject: Re: [ClusterLabs] clearing failed actions
> >>
> >> thanks Ken,
> >>
> >>
> >>
> >>
> >>
> >>> -Original Message-
> >>> From: Ken Gaillot [mailto:kgail...@redhat.com]
> >>> Sent: Thursday, June 1, 2017 12:04 AM
> >>> To: users@clusterlabs.org
> >>> Subject: Re: [ClusterLabs] clearing failed actions
> >>>
> >>> On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> >>>> On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> >>>>> Hi Ken,
> >>>>>
> >>>>>
> >>>>>> -Original Message-
> >>>>>> From: Ken Gaillot [mailto:kgail...@redhat.com]
> >>>>>> Sent: Tuesday, May 30, 2017 4:32 PM
> >>>>>> To: users@clusterlabs.org
> >>>>>> Subject: Re: [ClusterLabs] clearing failed actions
> >>>>>>
> >>>>>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Shouldn't the
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> cluster-recheck-interval="2m"
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> property instruct pacemaker to recheck the cluster every 2 minutes
> >>> and
> >>>>>>> clean the failcounts?
> >>>>>>
> >>>>>> It instructs pacemaker to recalculate whether any actions need to be
> >>>>>> taken (including expiring any failcounts appropriately).
> >>>>>>
> >>>>>>> At the primitive level I also have a
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>&g

Re: [ClusterLabs] clearing failed actions

2017-06-01 Thread Attila Megyeri
Ken,

I noticed something strange, this might be the issue.

In some cases, even the manual cleanup does not work.

I have a failed action of resource "A" on node "a". DC is node "b".

e.g.
Failed actions:
jboss_imssrv1_monitor_1 (node=ctims1, call=108, rc=1, status=complete, 
last-rc-change=Thu Jun  1 14:13:36 2017


When I attempt to do a "crm resource cleanup A" from node "b", nothing happens. 
Basically the lrmd on "a" is not notified that it should monitor the resource.


When I execute a "crm resource cleanup A" command on node "a" (where the 
operation failed) , the failed action is cleared properly.

Why could this be happening?
Which component should be responsible for this? pengine, crmd, lrmd?




> -Original Message-
> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> Sent: Thursday, June 1, 2017 6:57 PM
> To: kgail...@redhat.com; Cluster Labs - All topics related to open-source
> clustering welcomed 
> Subject: Re: [ClusterLabs] clearing failed actions
> 
> thanks Ken,
> 
> 
> 
> 
> 
> > -Original Message-
> > From: Ken Gaillot [mailto:kgail...@redhat.com]
> > Sent: Thursday, June 1, 2017 12:04 AM
> > To: users@clusterlabs.org
> > Subject: Re: [ClusterLabs] clearing failed actions
> >
> > On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> > > On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> > >> Hi Ken,
> > >>
> > >>
> > >>> -Original Message-
> > >>> From: Ken Gaillot [mailto:kgail...@redhat.com]
> > >>> Sent: Tuesday, May 30, 2017 4:32 PM
> > >>> To: users@clusterlabs.org
> > >>> Subject: Re: [ClusterLabs] clearing failed actions
> > >>>
> > >>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> > >>>> Hi,
> > >>>>
> > >>>>
> > >>>>
> > >>>> Shouldn't the
> > >>>>
> > >>>>
> > >>>>
> > >>>> cluster-recheck-interval="2m"
> > >>>>
> > >>>>
> > >>>>
> > >>>> property instruct pacemaker to recheck the cluster every 2 minutes
> > and
> > >>>> clean the failcounts?
> > >>>
> > >>> It instructs pacemaker to recalculate whether any actions need to be
> > >>> taken (including expiring any failcounts appropriately).
> > >>>
> > >>>> At the primitive level I also have a
> > >>>>
> > >>>>
> > >>>>
> > >>>> migration-threshold="30" failure-timeout="2m"
> > >>>>
> > >>>>
> > >>>>
> > >>>> but whenever I have a failure, it remains there forever.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> What could be causing this?
> > >>>>
> > >>>>
> > >>>>
> > >>>> thanks,
> > >>>>
> > >>>> Attila
> > >>> Is it a single old failure, or a recurring failure? The failure timeout
> > >>> works in a somewhat nonintuitive way. Old failures are not individually
> > >>> expired. Instead, all failures of a resource are simultaneously cleared
> > >>> if all of them are older than the failure-timeout. So if something keeps
> > >>> failing repeatedly (more frequently than the failure-timeout), none of
> > >>> the failures will be cleared.
> > >>>
> > >>> If it's not a repeating failure, something odd is going on.
> > >>
> > >> It is not a repeating failure. Let's say that a resource fails for 
> > >> whatever
> > action, It will remain in the failed actions (crm_mon -Af) until I issue a 
> > "crm
> > resource cleanup ". Even after days or weeks, even
> though
> > I see in the logs that cluster is rechecked every 120 seconds.
> > >>
> > >> How could I troubleshoot this issue?
> > >>
> > >> thanks!
> > >
> > >
> > > Ah, I see what you're saying. That's expected behavior.
> > >
> > > The failure-timeout applies to the failure *count* (which is used for
> > > checking against migration-threshold), not the failure *history* (which
> > > is used 

Re: [ClusterLabs] clearing failed actions

2017-06-01 Thread Attila Megyeri
thanks Ken,





> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Thursday, June 1, 2017 12:04 AM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] clearing failed actions
> 
> On 05/31/2017 12:17 PM, Ken Gaillot wrote:
> > On 05/30/2017 02:50 PM, Attila Megyeri wrote:
> >> Hi Ken,
> >>
> >>
> >>> -Original Message-
> >>> From: Ken Gaillot [mailto:kgail...@redhat.com]
> >>> Sent: Tuesday, May 30, 2017 4:32 PM
> >>> To: users@clusterlabs.org
> >>> Subject: Re: [ClusterLabs] clearing failed actions
> >>>
> >>> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> >>>> Hi,
> >>>>
> >>>>
> >>>>
> >>>> Shouldn't the
> >>>>
> >>>>
> >>>>
> >>>> cluster-recheck-interval="2m"
> >>>>
> >>>>
> >>>>
> >>>> property instruct pacemaker to recheck the cluster every 2 minutes
> and
> >>>> clean the failcounts?
> >>>
> >>> It instructs pacemaker to recalculate whether any actions need to be
> >>> taken (including expiring any failcounts appropriately).
> >>>
> >>>> At the primitive level I also have a
> >>>>
> >>>>
> >>>>
> >>>> migration-threshold="30" failure-timeout="2m"
> >>>>
> >>>>
> >>>>
> >>>> but whenever I have a failure, it remains there forever.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> What could be causing this?
> >>>>
> >>>>
> >>>>
> >>>> thanks,
> >>>>
> >>>> Attila
> >>> Is it a single old failure, or a recurring failure? The failure timeout
> >>> works in a somewhat nonintuitive way. Old failures are not individually
> >>> expired. Instead, all failures of a resource are simultaneously cleared
> >>> if all of them are older than the failure-timeout. So if something keeps
> >>> failing repeatedly (more frequently than the failure-timeout), none of
> >>> the failures will be cleared.
> >>>
> >>> If it's not a repeating failure, something odd is going on.
> >>
> >> It is not a repeating failure. Let's say that a resource fails for whatever
> action, It will remain in the failed actions (crm_mon -Af) until I issue a 
> "crm
> resource cleanup ". Even after days or weeks, even though
> I see in the logs that cluster is rechecked every 120 seconds.
> >>
> >> How could I troubleshoot this issue?
> >>
> >> thanks!
> >
> >
> > Ah, I see what you're saying. That's expected behavior.
> >
> > The failure-timeout applies to the failure *count* (which is used for
> > checking against migration-threshold), not the failure *history* (which
> > is used for the status display).
> >
> > The idea is to have it no longer affect the cluster behavior, but still
> > allow an administrator to know that it happened. That's why a manual
> > cleanup is required to clear the history.
> 
> Hmm, I'm wrong there ... failure-timeout does expire the failure history
> used for status display.
> 
> It works with the current versions. It's possible 1.1.10 had issues with
> that.
> 

Well if nothing helps I will try to upgrade to a more recent version..



> Check the status to see which node is DC, and look at the pacemaker log
> there after the failure occurred. There should be a message about the
> failcount expiring. You can also look at the live CIB and search for
> last_failure to see what is used for the display.
[AM] 

In the pacemaker log I see at every recheck interval the following lines:

Jun 01 16:54:08 [8700] ctabsws2pengine:  warning: unpack_rsc_op:
Processing failed op start for jboss_admin2 on ctadmin2: unknown error (1)

If I check the  CIB for the failure I see:





Really have no clue why this isn't cleared...



> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: clearing failed actions

2017-06-01 Thread Attila Megyeri
Thanks.

We have several clusters working fine since several years, without STONITH, so 
we did not really bother to implement it.

As for the failed actions - I cannot recall since when these are not cleared, 
but they aren't.

When I check the pengine log on the DC, at every recheck interval I see lines 
like:

...pengine:  warning: unpack_rsc_op:Processing failed op start for 
jboss_admin1 on ctadmin1: unknown error (1)

or 
...pengine:  warning: unpack_rsc_op:Processing failed op monitor for 
jboss_abssrv2 on ctabs2: unknown error (1)


And these are the failed actions  visible in the crm_mon -f as well:


Failed actions:
jboss_admin1_start_0 (node=ctadmin1, call=120, rc=1, status=Timed Out, 
last-rc-change=Thu Jun  1 14:17:31 2017
, queued=40001ms, exec=0ms
): unknown error
jboss_abssrv2_monitor_1 (node=ctabs2, call=106, rc=1, status=complete, 
last-rc-change=Thu Jun  1 14:13:36 2017
, queued=0ms, exec=0ms
): unknown error


If I do a resource cleanup, the errors are gone.

At the same time I see no actions on the mentioned nodes - this log is from the 
DC...

On the mentioned notes a regular monitoring operation is performed, and reults 
in 0 - no error.

What am I missing here?



> -Original Message-
> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de]
> Sent: Thursday, June 1, 2017 8:34 AM
> To: users@clusterlabs.org
> Subject: [ClusterLabs] Antw: Re: Antw: clearing failed actions
> 
> >>> Digimer  schrieb am 01.06.2017 um 00:03 in Nachricht
> <50aad2be-185b-0348-6a93-987034c9c...@alteeve.ca>:
> [...]
> > I don't know, but according to Ken's last email, what you're seeing is
> > expected. I replied because of the miss understanding of the rolls
> > quorum and fencing play. Running a cluster without fencing is dangerous.
> 
> I'd recommend this: Enable a working STONITH. Then if you see your cluster
> never uses STONITH, and everything works fine, and you feel you don't
> needit, then you can get rid of it.
> But don't try the other way 'round: Omit STONITH, expecting the cluster
> would work flawlessly, then (not) add STONITH.
> 
> Regards,
> Ulrich
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: clearing failed actions

2017-05-31 Thread Attila Megyeri


> -Original Message-
> From: Digimer [mailto:li...@alteeve.ca]
> Sent: Wednesday, May 31, 2017 2:20 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> ; Attila Megyeri 
> Subject: Re: [ClusterLabs] Antw: clearing failed actions
> 
> On 31/05/17 07:52 PM, Attila Megyeri wrote:
> > Hi,
> >
> >
> >
> >> -Original Message-
> >> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de]
> >> Sent: Wednesday, May 31, 2017 8:52 AM
> >> To: users@clusterlabs.org
> >> Subject: [ClusterLabs] Antw: clearing failed actions
> >>
> >>>>> Attila Megyeri  schrieb am 30.05.2017
> um
> >> 16:13 in
> >> Nachricht
> >>
>  >> soft.local>:
> >>> Hi,
> >>>
> >>> Shouldn't the
> >>>
> >>> cluster-recheck-interval="2m"
> >>>
> >>> property instruct pacemaker to recheck the cluster every 2 minutes and
> >> clean
> >>> the failcounts?
> >>>
> >>> At the primitive level I also have a
> >>>
> >>> migration-threshold="30" failure-timeout="2m"
> >>>
> >>> but whenever I have a failure, it remains there forever.
> >>
> >> What type of failure do you have, and what is the status after that? Do
> you
> >> have fencing enabled?
> >>
> >
> > Typically a failed start, or a failed monitor.
> > Fencing is disabled as we have  multiple nodes / quorum.
> 
> Stonith and quorum solve different problems. Stonith is required, quorum
> is optional.
> 
> https://www.alteeve.com/w/The_2-Node_Myth
> 
> > Pacemaker is 1.1.10.
> >

I see your point, but does it relate to the failcount issue? By turning stonith 
off, the fail counters will not be removed even if the service recovers 
immediately after restart?




> >
> >
> >>>
> >>>
> >>> What could be causing this?
> >>>
> >>> thanks,
> >>> Attila
> >>
> >>
> >>
> >>
> >>
> >> ___
> >> Users mailing list: Users@clusterlabs.org
> >> http://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 
> 
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: clearing failed actions

2017-05-31 Thread Attila Megyeri
Hi,



> -Original Message-
> From: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de]
> Sent: Wednesday, May 31, 2017 8:52 AM
> To: users@clusterlabs.org
> Subject: [ClusterLabs] Antw: clearing failed actions
> 
> >>> Attila Megyeri  schrieb am 30.05.2017 um
> 16:13 in
> Nachricht
>  soft.local>:
> > Hi,
> >
> > Shouldn't the
> >
> > cluster-recheck-interval="2m"
> >
> > property instruct pacemaker to recheck the cluster every 2 minutes and
> clean
> > the failcounts?
> >
> > At the primitive level I also have a
> >
> > migration-threshold="30" failure-timeout="2m"
> >
> > but whenever I have a failure, it remains there forever.
> 
> What type of failure do you have, and what is the status after that? Do you
> have fencing enabled?
> 

Typically a failed start, or a failed monitor.
Fencing is disabled as we have  multiple nodes / quorum.

Pacemaker is 1.1.10.



> >
> >
> > What could be causing this?
> >
> > thanks,
> > Attila
> 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] clearing failed actions

2017-05-30 Thread Attila Megyeri
Hi Ken,


> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Tuesday, May 30, 2017 4:32 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] clearing failed actions
> 
> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> > Hi,
> >
> >
> >
> > Shouldn't the
> >
> >
> >
> > cluster-recheck-interval="2m"
> >
> >
> >
> > property instruct pacemaker to recheck the cluster every 2 minutes and
> > clean the failcounts?
> 
> It instructs pacemaker to recalculate whether any actions need to be
> taken (including expiring any failcounts appropriately).
> 
> > At the primitive level I also have a
> >
> >
> >
> > migration-threshold="30" failure-timeout="2m"
> >
> >
> >
> > but whenever I have a failure, it remains there forever.
> >
> >
> >
> >
> >
> > What could be causing this?
> >
> >
> >
> > thanks,
> >
> > Attila
> Is it a single old failure, or a recurring failure? The failure timeout
> works in a somewhat nonintuitive way. Old failures are not individually
> expired. Instead, all failures of a resource are simultaneously cleared
> if all of them are older than the failure-timeout. So if something keeps
> failing repeatedly (more frequently than the failure-timeout), none of
> the failures will be cleared.
> 
> If it's not a repeating failure, something odd is going on.

It is not a repeating failure. Let's say that a resource fails for whatever 
action, It will remain in the failed actions (crm_mon -Af) until I issue a "crm 
resource cleanup ". Even after days or weeks, even though I see 
in the logs that cluster is rechecked every 120 seconds.

How could I troubleshoot this issue?

thanks!


> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] clearing failed actions

2017-05-30 Thread Attila Megyeri
Hi,

Shouldn't the

cluster-recheck-interval="2m"

property instruct pacemaker to recheck the cluster every 2 minutes and clean 
the failcounts?

At the primitive level I also have a

migration-threshold="30" failure-timeout="2m"

but whenever I have a failure, it remains there forever.


What could be causing this?

thanks,
Attila
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond

2017-05-24 Thread Attila Megyeri
Hi Klaus,

Thank you for your response.
I tried many things, but no luck.

We have many pacemaker clusters with 99% identical configurations, package 
versions, and only this one causes issues. (BTW we use unicast for corosync, 
but this is the same for our other clusters as well.)
I checked all connection settings between the nodes (to confirm there are no 
firewall issues), increased the number of cores on each node, but still - as 
long as a monitor operation is pending for a resource, no other operation is 
executed.

e.g. resource A is being monitored, and timeout is 90 seconds, until this check 
times out I cannot do a cleanup or start/stop on any other resource.

Two more interesting things: 
- cluster recheck is set to 2 minutes, and even though the resources are 
running properly, the fail counters are not reduced and crm_mon lists the 
resources in failed actions section. forever. Or until I manually do resource 
cleanup.
- If i execute a crm resource cleanup RES_name from another node, sometimes it 
simply does not clean up the failed state. If I execute this from the node 
where the resource IS actually runing, the resource is removed from the failed 
actions.


What do you recommend, how could I start troubleshooting these issues? As I 
said, this setup works fine in several other systems, but here I am 
really-realy stuck.


thanks!

Attila





> -Original Message-
> From: Klaus Wenninger [mailto:kwenn...@redhat.com]
> Sent: Wednesday, May 10, 2017 2:04 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
> 
> On 05/09/2017 10:34 PM, Attila Megyeri wrote:
> >
> > Actually I found some more details:
> >
> >
> >
> > there are two resources: A and B
> >
> >
> >
> > resource B depends on resource A (when the RA monitors B, if will fail
> > if A is not running properly)
> >
> >
> >
> > If I stop resource A, the next monitor operation of "B" will fail.
> > Interestingly, this check happens immediately after A is stopped.
> >
> >
> >
> > B is configured to restart if monitor fails. Start timeout is rather
> > long, 180 seconds. So pacemaker tries to restart B, and waits.
> >
> >
> >
> > If I want to start "A", nothing happens until the start operation of
> > "B" fails - typically several minutes.
> >
> >
> >
> >
> >
> > Is this the right behavior?
> >
> > It appears that pacemaker is blocked until resource B is being
> > started, and I cannot really start its dependency...
> >
> > Shouldn't it be possible to start a resource while another resource is
> > also starting?
> >
> 
> As long as resources don't depend on each other parallel starting should
> work/happen.
> 
> The number of parallel actions executed is derived from the number of
> cores and
> when load is detected some kind of throttling kicks in (in fact reduction of
> the operations executed in parallel with the aim to reduce the load induced
> by pacemaker). When throttling kicks in you should get log messages (there
> is in fact a parallel discussion going on ...).
> No idea if throttling might be a reason here but maybe worth considering
> at least.
> 
> Another reason why certain things happen with quite some delay I've
> observed
> is that obviously some situations are just resolved when the
> cluster-recheck-interval
> triggers a pengine run in addition to those triggered by changes.
> You might easily verify this by changing the cluster-recheck-interval.
> 
> Regards,
> Klaus
> 
> >
> >
> >
> >
> > Thanks,
> >
> > Attila
> >
> >
> >
> >
> >
> > *From:*Attila Megyeri [mailto:amegy...@minerva-soft.com]
> > *Sent:* Tuesday, May 9, 2017 9:53 PM
> > *To:* users@clusterlabs.org; kgail...@redhat.com
> > *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond
> >
> >
> >
> > Hi Ken, all,
> >
> >
> >
> >
> >
> > We ran into an issue very similar to the one described in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1430112 /  [Intel 7.4 Bug]
> > Pacemaker occasionally takes minutes to respond
> >
> >
> >
> > But  in our case we are not using fencing/stonith at all.
> >
> >
> >
> > Many times when I want to start/stop/cleanup a resource, it takes tens
> > of seconds (or even minutes) till the command gets executed. The logs
> > show nothing in that period, the redundant rings show no fault.
> >
> >
> >
> > Could this be the same issue?
&g

Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond

2017-05-09 Thread Attila Megyeri
Actually I found some more details:

there are two resources: A and B

resource B depends on resource A (when the RA monitors B, if will fail if A is 
not running properly)

If I stop resource A, the next monitor operation of "B" will fail. 
Interestingly, this check happens immediately after A is stopped.

B is configured to restart if monitor fails. Start timeout is rather long, 180 
seconds. So pacemaker tries to restart B, and waits.

If I want to start "A", nothing happens until the start operation of "B" fails 
- typically several minutes.


Is this the right behavior?
It appears that pacemaker is blocked until resource B is being started, and I 
cannot really start its dependency...
Shouldn't it be possible to start a resource while another resource is also 
starting?


Thanks,
Attila


From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
Sent: Tuesday, May 9, 2017 9:53 PM
To: users@clusterlabs.org; kgail...@redhat.com
Subject: [ClusterLabs] Pacemaker occasionally takes minutes to respond

Hi Ken, all,


We ran into an issue very similar to the one described in 
https://bugzilla.redhat.com/show_bug.cgi?id=1430112 /  [Intel 7.4 Bug] 
Pacemaker occasionally takes minutes to respond

But  in our case we are not using fencing/stonith at all.

Many times when I want to start/stop/cleanup a resource, it takes tens of 
seconds (or even minutes) till the command gets executed. The logs show nothing 
in that period, the redundant rings show no fault.

Could this be the same issue?

Any hints on how to troubleshoot this?
It is  pacemaker 1.1.10, corosync 2.3.3


Cheers,
Attila



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker occasionally takes minutes to respond

2017-05-09 Thread Attila Megyeri
Hi Ken, all,


We ran into an issue very similar to the one described in 
https://bugzilla.redhat.com/show_bug.cgi?id=1430112 /  [Intel 7.4 Bug] 
Pacemaker occasionally takes minutes to respond

But  in our case we are not using fencing/stonith at all.

Many times when I want to start/stop/cleanup a resource, it takes tens of 
seconds (or even minutes) till the command gets executed. The logs show nothing 
in that period, the redundant rings show no fault.

Could this be the same issue?

Any hints on how to troubleshoot this?
It is  pacemaker 1.1.10, corosync 2.3.3


Cheers,
Attila



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave

2016-08-30 Thread Attila Megyeri

Hi Ken,



> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Thursday, August 25, 2016 6:03 PM
> To: Attila Megyeri ; Cluster Labs - All topics
> related to open-source clustering welcomed 
> Subject: Re: [ClusterLabs] Mysql slave did not start replication after 
> failure,
> and read-only IP also remained active on the much outdated slave
>
> On 08/22/2016 03:56 PM, Attila Megyeri wrote:
> > Hi Ken,
> >
> > Thanks a lot for your feedback, my answers are inline.
> >
> >
> >
> >> -Original Message-
> >> From: Ken Gaillot [mailto:kgail...@redhat.com]
> >> Sent: Monday, August 22, 2016 4:12 PM
> >> To: users@clusterlabs.org
> >> Subject: Re: [ClusterLabs] Mysql slave did not start replication after
> failure,
> >> and read-only IP also remained active on the much outdated slave
> >>
> >> On 08/22/2016 07:24 AM, Attila Megyeri wrote:
> >>> Hi Andrei,
> >>>
> >>> I waited several hours, and nothing happened.
> >>
> >> And actually, we can see from the configuration you provided that
> >> cluster-recheck-interval is 2 minutes.
> >>
> >> I don't see anything about stonith; is it enabled and tested? This looks
> >> like a situation where stonith would come into play. I know that power
> >> fencing can be rough on a MySQL database, but perhaps intelligent
> >> switches with network fencing would be appropriate.
> >
> > Yes, there is no stonith in place because we found it too agressive for this
> purpose. And to be honest I'm not sure if that would have worked here.
>
> The problem is, in a situation like this where corosync communication is
> broken, both nodes will think they are the only surviving node, and
> bring up all resources. Network fencing would isolate one of the nodes
> so at least it can't cause any serious trouble.
>

Quorum is enabled, and this is a three-node cluster, so this should not be an 
issue here.
The real problem was that the mysql instance was still running, it was not 
either master or slave, and the VIP was still assigned, and as such, clients 
were able to connect to this outdated instance.
As the next step I will upgrade the resource agents, as I saw there were 
already some fixes for issues where the slave status is wrong


Anyway, thanks for your help!





> >> The "Corosync main process was not scheduled" message is the start of
> >> the trouble. It means the system was overloaded and corosync didn't get
> >> any CPU time, so it couldn't maintain cluster communication.
> >>
> >
> > True, this was the cause of the issue, but we still couldn't find a 
> > solution to
> get rid of the original problem.
> > Nevertheless I think that the issue is that the RA did not properly detect
> the state of mysql.
> >
> >
> >> Probably the most useful thing would be to upgrade to a recent version
> >> of corosync+pacemaker+resource-agents. Recent corosync versions run
> with
> >> realtime priority, which makes this much less likely.
> >>
> >> Other than that, figure out what the load issue was, and try to prevent
> >> it from recurring.
> >>
> >
> > Whereas the original problem might have been caused by corosync CPU
> issue, I am sure that once the load was gone the proper mysql state should
> have been detected.
> > The RA, responsible for this is almost the latest version, and I did not see
> any changes related to this functionality.
> >
> >> I'm not familiar enough with the RA to comment on its behavior. If you
> >> think it's suspect, check the logs during the incident for messages from
> >> the RA.
> >>
> >
> > So did I, but there are very few details logged while this was happening, so
> I am pretty much stuck :(
> >
> > I thought that someone might have a clue what is wrong in the RA  - that
> causes this fake state detection.
> > (Un)fortunately I cannot reproduce this situation for now.
>
> Perhaps with the split-brain, the slave tried to come up as master?
>
> >
> > Who could help me in troubleshooting this?
> >
> > Thanks,
> > Attila
> >
> >
> >
> >
> >>> I assume that the RA does not treat this case properly. Mysql was
> running,
> >> but the "show slave status" command returned something that the RA
> was
> >> not prepared to parse, and instead of reporting a non-readable attribute,
> it
> >> returned some generic error, that did not stop th

Re: [ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave

2016-08-22 Thread Attila Megyeri
Hi Ken,

Thanks a lot for your feedback, my answers are inline.



> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Monday, August 22, 2016 4:12 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Mysql slave did not start replication after 
> failure,
> and read-only IP also remained active on the much outdated slave
> 
> On 08/22/2016 07:24 AM, Attila Megyeri wrote:
> > Hi Andrei,
> >
> > I waited several hours, and nothing happened.
> 
> And actually, we can see from the configuration you provided that
> cluster-recheck-interval is 2 minutes.
> 
> I don't see anything about stonith; is it enabled and tested? This looks
> like a situation where stonith would come into play. I know that power
> fencing can be rough on a MySQL database, but perhaps intelligent
> switches with network fencing would be appropriate.

Yes, there is no stonith in place because we found it too agressive for this 
purpose. And to be honest I'm not sure if that would have worked here.

> 
> The "Corosync main process was not scheduled" message is the start of
> the trouble. It means the system was overloaded and corosync didn't get
> any CPU time, so it couldn't maintain cluster communication.
> 
 
True, this was the cause of the issue, but we still couldn't find a solution to 
get rid of the original problem.
Nevertheless I think that the issue is that the RA did not properly detect the 
state of mysql.


> Probably the most useful thing would be to upgrade to a recent version
> of corosync+pacemaker+resource-agents. Recent corosync versions run with
> realtime priority, which makes this much less likely.
> 
> Other than that, figure out what the load issue was, and try to prevent
> it from recurring.
> 

Whereas the original problem might have been caused by corosync CPU issue, I am 
sure that once the load was gone the proper mysql state should have been 
detected.
The RA, responsible for this is almost the latest version, and I did not see 
any changes related to this functionality.

> I'm not familiar enough with the RA to comment on its behavior. If you
> think it's suspect, check the logs during the incident for messages from
> the RA.
> 

So did I, but there are very few details logged while this was happening, so I 
am pretty much stuck :(

I thought that someone might have a clue what is wrong in the RA  - that causes 
this fake state detection.
(Un)fortunately I cannot reproduce this situation for now.

 
Who could help me in troubleshooting this?

Thanks,
Attila




> > I assume that the RA does not treat this case properly. Mysql was running,
> but the "show slave status" command returned something that the RA was
> not prepared to parse, and instead of reporting a non-readable attribute, it
> returned some generic error, that did not stop the server.
> >
> > Rgds,
> > Attila
> >
> >
> > -Original Message-
> > From: Andrei Borzenkov [mailto:arvidj...@gmail.com]
> > Sent: Monday, August 22, 2016 11:42 AM
> > To: Cluster Labs - All topics related to open-source clustering welcomed
> 
> > Subject: Re: [ClusterLabs] Mysql slave did not start replication after 
> > failure,
> and read-only IP also remained active on the much outdated slave
> >
> > On Mon, Aug 22, 2016 at 12:18 PM, Attila Megyeri
> >  wrote:
> >> Dear community,
> >>
> >>
> >>
> >> A few days ago we had an issue in our Mysql M/S replication cluster.
> >>
> >> We have a one R/W Master, and a one RO Slave setup. RO VIP is
> supposed to be
> >> running on the slave if it is not too much behind the master, and if any
> >> error occurs, RO VIP is moved to the master.
> >>
> >>
> >>
> >> Something happened with the slave Mysql (some disk issue, still
> >> investigating), but the problem is, that the slave VIP remained on the
> slave
> >> device, even though the slave process was not running, and the server
> was
> >> much outdated.
> >>
> >>
> >>
> >> During the issue the following log entries appeared (just an extract as it
> >> would be too long):
> >>
> >>
> >>
> >>
> >>
> >> Aug 20 02:04:07 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process
> was
> >> not scheduled for 14088.5488 ms (threshold is 4000. ms). Consider
> token
> >> timeout increase.
> >>
> >> Aug 20 02:04:07 ctdb1 corosync[1056]:   [TOTEM ] A processor failed,
> forming
> >> new configuration.
> >>
> >> Aug 20 02:04:34 ctdb1 corosync[

Re: [ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave

2016-08-22 Thread Attila Megyeri
Hi Andrei,

I waited several hours, and nothing happened. 

I assume that the RA does not treat this case properly. Mysql was running, but 
the "show slave status" command returned something that the RA was not prepared 
to parse, and instead of reporting a non-readable attribute, it returned some 
generic error, that did not stop the server. 

Rgds,
Attila


-Original Message-
From: Andrei Borzenkov [mailto:arvidj...@gmail.com] 
Sent: Monday, August 22, 2016 11:42 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Mysql slave did not start replication after failure, 
and read-only IP also remained active on the much outdated slave

On Mon, Aug 22, 2016 at 12:18 PM, Attila Megyeri
 wrote:
> Dear community,
>
>
>
> A few days ago we had an issue in our Mysql M/S replication cluster.
>
> We have a one R/W Master, and a one RO Slave setup. RO VIP is supposed to be
> running on the slave if it is not too much behind the master, and if any
> error occurs, RO VIP is moved to the master.
>
>
>
> Something happened with the slave Mysql (some disk issue, still
> investigating), but the problem is, that the slave VIP remained on the slave
> device, even though the slave process was not running, and the server was
> much outdated.
>
>
>
> During the issue the following log entries appeared (just an extract as it
> would be too long):
>
>
>
>
>
> Aug 20 02:04:07 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was
> not scheduled for 14088.5488 ms (threshold is 4000. ms). Consider token
> timeout increase.
>
> Aug 20 02:04:07 ctdb1 corosync[1056]:   [TOTEM ] A processor failed, forming
> new configuration.
>
> Aug 20 02:04:34 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was
> not scheduled for 27065.2559 ms (threshold is 4000. ms). Consider token
> timeout increase.
>
> Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6720)
> was formed. Members left: 168362243 168362281 168362282 168362301 168362302
> 168362311 168362312 1
>
> Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6724)
> was formed. Members
>
> ..
>
> Aug 20 02:13:28 ctdb1 corosync[1056]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
>
> ..
>
> Aug 20 02:13:29 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: readable (1)
>
> …
>
> Aug 20 02:13:32 ctdb1 mysql(db-mysql)[10492]: INFO: post-demote notification
> for ctdb1
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-master)[10490]: INFO: IP status = ok,
> IP_CIP=
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-ip-master_stop_0 (call=371, rc=0, cib-update=179, confirmed=true) ok
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Adding inet address
> xxx/24 with broadcast address  to device eth0
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Bringing device
> eth0 up
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO:
> /usr/lib/heartbeat/send_arp -i 200 -r 5 -p
> /usr/var/run/resource-agents/send_arp-xxx eth0 xxx auto not_used not_used
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-ip-slave_start_0 (call=377, rc=0, cib-update=180, confirmed=true) ok
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-ip-slave_monitor_2 (call=380, rc=0, cib-update=181, confirmed=false)
> ok
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_notify_0 (call=374, rc=0, cib-update=0, confirmed=true) ok
>
> Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: master-db-mysql (1)
>
> Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_perform_update: Sent
> update 1622: master-db-mysql=1
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_demote_0 (call=384, rc=0, cib-update=182, confirmed=true) ok
>
> Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11160]: INFO: Ignoring post-demote
> notification for my own demotion.
>
> Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_notify_0 (call=387, rc=0, cib-update=0, confirmed=true) ok
>
> Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11185]: ERROR: check_slave invoked on
> an instance that is not a replication slave.
>
> Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_monitor_7000 (call=390, rc=0, cib-update=183, confirmed=false) ok
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: Listen normally on 16 eth0 . UDP 123
>
> Aug 20 02:13:

[ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave

2016-08-22 Thread Attila Megyeri
Dear community,

A few days ago we had an issue in our Mysql M/S replication cluster.
We have a one R/W Master, and a one RO Slave setup. RO VIP is supposed to be 
running on the slave if it is not too much behind the master, and if any error 
occurs, RO VIP is moved to the master.

Something happened with the slave Mysql (some disk issue, still investigating), 
but the problem is, that the slave VIP remained on the slave device, even 
though the slave process was not running, and the server was much outdated.

During the issue the following log entries appeared (just an extract as it 
would be too long):


Aug 20 02:04:07 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was not 
scheduled for 14088.5488 ms (threshold is 4000. ms). Consider token timeout 
increase.
Aug 20 02:04:07 ctdb1 corosync[1056]:   [TOTEM ] A processor failed, forming 
new configuration.
Aug 20 02:04:34 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was not 
scheduled for 27065.2559 ms (threshold is 4000. ms). Consider token timeout 
increase.
Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6720) 
was formed. Members left: 168362243 168362281 168362282 168362301 168362302 
168362311 168362312 1
Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6724) 
was formed. Members
..
Aug 20 02:13:28 ctdb1 corosync[1056]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
..
Aug 20 02:13:29 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: readable (1)
...
Aug 20 02:13:32 ctdb1 mysql(db-mysql)[10492]: INFO: post-demote notification 
for ctdb1
Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-master)[10490]: INFO: IP status = ok, 
IP_CIP=
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-ip-master_stop_0 (call=371, rc=0, cib-update=179, confirmed=true) ok
Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Adding inet address 
xxx/24 with broadcast address  to device eth0
Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Bringing device eth0 up
Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: 
/usr/lib/heartbeat/send_arp -i 200 -r 5 -p 
/usr/var/run/resource-agents/send_arp-xxx eth0 xxx auto not_used not_used
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-ip-slave_start_0 (call=377, rc=0, cib-update=180, confirmed=true) ok
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-ip-slave_monitor_2 (call=380, rc=0, cib-update=181, confirmed=false) ok
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-mysql_notify_0 (call=374, rc=0, cib-update=0, confirmed=true) ok
Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-db-mysql (1)
Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_perform_update: Sent update 
1622: master-db-mysql=1
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-mysql_demote_0 (call=384, rc=0, cib-update=182, confirmed=true) ok
Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11160]: INFO: Ignoring post-demote 
notification for my own demotion.
Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-mysql_notify_0 (call=387, rc=0, cib-update=0, confirmed=true) ok
Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11185]: ERROR: check_slave invoked on an 
instance that is not a replication slave.
Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-mysql_monitor_7000 (call=390, rc=0, cib-update=183, confirmed=false) ok
Aug 20 02:13:33 ctdb1 ntpd[1560]: Listen normally on 16 eth0 . UDP 123
Aug 20 02:13:33 ctdb1 ntpd[1560]: Deleting interface #12 eth0, xxx#123, 
interface stats: received=0, sent=0, dropped=0, active_time=2637334 secs
Aug 20 02:13:33 ctdb1 ntpd[1560]: peers refreshed
Aug 20 02:13:33 ctdb1 ntpd[1560]: new interface(s) found: waking up resolver
Aug 20 02:13:40 ctdb1 mysql(db-mysql)[11224]: ERROR: check_slave invoked on an 
instance that is not a replication slave.
Aug 20 02:13:47 ctdb1 mysql(db-mysql)[11263]: ERROR: check_slave invoked on an 
instance that is not a replication slave.

And from this time, the last two lines repeat every 7 seconds (mysql monitoring 
interval)


The expected behavior was that the slave (RO) VIP should have been moved to the 
master, as the secondary db was outdated.
Unfortunately I cannot recall what crm_mon was showing when the issue was 
present, but I am sure that the RA did not handle the situation properly.

Placing the slave node into standby and the online resolved the issue 
immediately (Slave started to sync, and in  a few minutes it catched up the 
master).


Here is the relevant config from the configuration:


primitive db-ip-master ocf:heartbeat:IPaddr2 \
params lvs_support="true" ip="XXX" cidr_netmask="24" 
broadcast="XXX" \
op start interval="0" timeout="20s" on-fail="re

Re: [ClusterLabs] Mysql M/S, binlogs - how to delete them safely without failing over first?

2015-08-18 Thread Attila Megyeri
Hi Brett,

Thanks for the quick response.

I am using the RAs from 
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/mysql

I have no experience with the Percona agent so far. The clusterlabs RA also 
does the change master to on promote. I did not check with the most recent 
versions, but it definitively did not work properly before.
What is the expected behaviour? When there is a failover, the new slave should 
know what was the last master position of the old slave (that was still synced)?


How do you keep/maintain the binlog files in your environments?

Thanks,
Attila


From: Brett Moser [mailto:brett.mo...@gmail.com]
Sent: Tuesday, August 18, 2015 11:35 PM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Mysql M/S, binlogs - how to delete them safely 
without failing over first?

Hi Attila,

It sounds like on failover the new slave is not having it's master replication 
file & position updated.

What Resource Agent are you using to control the M/S mysql resource?

Have you investigated the Percona agent?   I performs the CHANGE MASTER TO 
commands for you, and I have found it to be a good RA for my M/S MySQL purposes.

https://github.com/percona/percona-pacemaker-agents

regards,
-Brett Moser


On Tue, Aug 18, 2015 at 2:19 PM, Attila Megyeri 
mailto:amegy...@minerva-soft.com>> wrote:
Hi List,

We are using M/S replication in a couple of clusters, and there is an issue 
that has been causing headaches for me for quite some time.

My problem comes from the fact that binlog files grow very quickly on both the 
Master and Slave nodes.

Let’s assume, that node 1 is the master – it logs all operations to binlog.
Node 2 is the slave, it replicates everything properly. (It is strange, 
however, that node2 must also generate and keep binlog files while it is a 
slave, but let’s assume that this is by design).

There are ways to configure mysql to keep the binlog files only for some time, 
e.g. 10 days, but I had an issue with this:

To explain the issue, please consider the following case:
Let’s say that both node1 and node2 are up-to-date, and we did a failover test 
on day 0.
DB1 is the master, DB2 is the slave.
DB1 has master position “A”, DB1 has master position “B”.

After 20 days, lots of binlog files exist on both servers, I would like to get 
rid of them, as the slave is up-to-date.

I decide to delete all binlog files older than 1 day by issuing “purge binary 
logs…”.

I try to fail over so that DB2 becomes the master, but DB1 tries to connect to 
DB2 for replication and wants to replicate starting from a position that it 
“remembers” from the times when DB2 was the master, and starts to look for some 
binlog files that are 20 days old. Now, the issue is, that those old binlog 
files have been deleted since, and replication stops with error (cannot find 
binlogs).

Am I doing something wrong here, or is something configured badly?
OR am I assuming correctly that in order to purge the binlog files from both 
servers I need to make a failover first?


Thank you!

___
Users mailing list: Users@clusterlabs.org<mailto:Users@clusterlabs.org>
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Mysql M/S, binlogs - how to delete them safely without failing over first?

2015-08-18 Thread Attila Megyeri
Hi List,

We are using M/S replication in a couple of clusters, and there is an issue 
that has been causing headaches for me for quite some time.

My problem comes from the fact that binlog files grow very quickly on both the 
Master and Slave nodes.

Let's assume, that node 1 is the master - it logs all operations to binlog.
Node 2 is the slave, it replicates everything properly. (It is strange, 
however, that node2 must also generate and keep binlog files while it is a 
slave, but let's assume that this is by design).

There are ways to configure mysql to keep the binlog files only for some time, 
e.g. 10 days, but I had an issue with this:

To explain the issue, please consider the following case:
Let's say that both node1 and node2 are up-to-date, and we did a failover test 
on day 0.
DB1 is the master, DB2 is the slave.
DB1 has master position "A", DB1 has master position "B".

After 20 days, lots of binlog files exist on both servers, I would like to get 
rid of them, as the slave is up-to-date.

I decide to delete all binlog files older than 1 day by issuing "purge binary 
logs...".

I try to fail over so that DB2 becomes the master, but DB1 tries to connect to 
DB2 for replication and wants to replicate starting from a position that it 
"remembers" from the times when DB2 was the master, and starts to look for some 
binlog files that are 20 days old. Now, the issue is, that those old binlog 
files have been deleted since, and replication stops with error (cannot find 
binlogs).

Am I doing something wrong here, or is something configured badly?
OR am I assuming correctly that in order to purge the binlog files from both 
servers I need to make a failover first?


Thank you!
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Memory leak in crm_mon ?

2015-08-16 Thread Attila Megyeri
Hi Andrew,

I managed to isolate / reproduce the issue. You might want to take a look, as 
it might be present in 1.1.12 as well.

I monitor my cluster from putty, mainly this way:
- I have a putty (Windows client) session, that connects via SSH to the box, 
authenticates using public key as a non-root user.
- It immediately sends a "sudo crm_mon -Af" command, so with a single click I 
have a nice view of what the cluster is doing.

Whenever I close this putty window (terminate the app), crm_mon process gets to 
100% cpu usage, starts to leak, in a few hours consumes all memory and then 
destroys the whole cluster.
This does not happen if I leave crm_mon with Ctrl-C.

I can reproduce this 100% with crm_mon 1.1.10, with the mainstream ubuntu 
trusty packages.
This might be related on how sudo executes crm_mon, and what it signalls to 
crm_mon when it gets terminated.

Now I know what I need to pay attention to in order to avoid this problem, but 
you might want to check whether this issue is still present.


Thanks,
Attila 






-Original Message-----
From: Attila Megyeri [mailto:amegy...@minerva-soft.com] 
Sent: Friday, August 14, 2015 12:40 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Memory leak in crm_mon ?



-Original Message-
From: Andrew Beekhof [mailto:and...@beekhof.net] 
Sent: Tuesday, August 11, 2015 2:49 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Memory leak in crm_mon ?


> On 10 Aug 2015, at 5:33 pm, Attila Megyeri  wrote:
> 
> Hi!
>  
> We are building a new cluster on top of pacemaker/corosync and several times 
> during the past days we noticed that „crm_mon -Af” used up all the 
> memory+swap and caused high CPU usage. Killing the process solves the issue.
>  
> We are using the binary package versions available in the latest ubuntu 
> trusty, namely:
>  
> crmsh  1.2.5+hg1034-1ubuntu4  
>
> pacemaker
> 1.1.10+git20130802-1ubuntu2.3  
> pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.3  
> corosync 2.3.3-1ubuntu1   
>  
> Kernel is 3.13.0-46-generic
>  
> Looking back some „atop” data, the CPU went to 100% many times during the 
> last couple of days, at various times, more often around midnight exaclty 
> (strange).
>  
> 08.05 14:00
> 08.06 21:41
> 08.07 00:00
> 08.07 00:00
> 08.08 00:00
> 08.09 06:27
>  
> Checked the corosync log and syslog, but did not find any correlation between 
> the entries int he logs around the specific times.
> For most of the time, the node running the crm_mon was the DC as well – not 
> running any resources (e.g. a pairless node for quorum).
>  
>  
> We have another running system, where everything works perfecly, whereas it 
> is almost the same:
>  
> crmsh  1.2.5+hg1034-1ubuntu4  
> 
> pacemaker
> 1.1.10+git20130802-1ubuntu2.1 
> pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.1 
> corosync 2.3.3-1ubuntu1  
>  
> Kernel is 3.13.0-8-generic
>  
>  
> Is this perhaps a known issue?

Possibly, that version is over 2 years old.

> Any hints?

Getting something a little more recent would be the best place to start

Thanks Andew,

I tried to upgrade to 1.1.12 using the packages availabe at 
https://launchpad.net/~syseleven-platform . Int he first attept I upgraded a 
single node, to see how it works out but I ended up with errors like

Could not establish cib_rw connection: Connection refused (111)

I have disabled the firewall, no changes. The node appears to be running but 
does not see any of the other nodes. On the other nodes I see this node as an 
UNCLEAN one. (I assume corosync is fine, but pacemaker not)
I use udpu for the transport.

Am I doing something wrong? I tried to look for some howtos on upgrade, but the 
only thing I found was the rather outdated   http://clusterlabs.org/wiki/Upgrade

Could you please direct me to some howto/guide on how to perform the upgrade?

Or am I facing some compatibility issue, so I should extract the whole cib, 
upgrade all nodes and reconfigure the cluster from the scratch? (The cluster is 
meant to go live in 2 days... :) )

Thanks a lot in advance




>  
> Thanks!
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://

Re: [ClusterLabs] Memory leak in crm_mon ?

2015-08-13 Thread Attila Megyeri


-Original Message-
From: Andrew Beekhof [mailto:and...@beekhof.net] 
Sent: Tuesday, August 11, 2015 2:49 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Memory leak in crm_mon ?


> On 10 Aug 2015, at 5:33 pm, Attila Megyeri  wrote:
> 
> Hi!
>  
> We are building a new cluster on top of pacemaker/corosync and several times 
> during the past days we noticed that „crm_mon -Af” used up all the 
> memory+swap and caused high CPU usage. Killing the process solves the issue.
>  
> We are using the binary package versions available in the latest ubuntu 
> trusty, namely:
>  
> crmsh  1.2.5+hg1034-1ubuntu4  
>
> pacemaker
> 1.1.10+git20130802-1ubuntu2.3  
> pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.3  
> corosync 2.3.3-1ubuntu1   
>  
> Kernel is 3.13.0-46-generic
>  
> Looking back some „atop” data, the CPU went to 100% many times during the 
> last couple of days, at various times, more often around midnight exaclty 
> (strange).
>  
> 08.05 14:00
> 08.06 21:41
> 08.07 00:00
> 08.07 00:00
> 08.08 00:00
> 08.09 06:27
>  
> Checked the corosync log and syslog, but did not find any correlation between 
> the entries int he logs around the specific times.
> For most of the time, the node running the crm_mon was the DC as well – not 
> running any resources (e.g. a pairless node for quorum).
>  
>  
> We have another running system, where everything works perfecly, whereas it 
> is almost the same:
>  
> crmsh  1.2.5+hg1034-1ubuntu4  
> 
> pacemaker
> 1.1.10+git20130802-1ubuntu2.1 
> pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.1 
> corosync 2.3.3-1ubuntu1  
>  
> Kernel is 3.13.0-8-generic
>  
>  
> Is this perhaps a known issue?

Possibly, that version is over 2 years old.

> Any hints?

Getting something a little more recent would be the best place to start

Thanks Andew,

I tried to upgrade to 1.1.12 using the packages availabe at 
https://launchpad.net/~syseleven-platform . Int he first attept I upgraded a 
single node, to see how it works out but I ended up with errors like

Could not establish cib_rw connection: Connection refused (111)

I have disabled the firewall, no changes. The node appears to be running but 
does not see any of the other nodes. On the other nodes I see this node as an 
UNCLEAN one. (I assume corosync is fine, but pacemaker not)
I use udpu for the transport.

Am I doing something wrong? I tried to look for some howtos on upgrade, but the 
only thing I found was the rather outdated   http://clusterlabs.org/wiki/Upgrade

Could you please direct me to some howto/guide on how to perform the upgrade?

Or am I facing some compatibility issue, so I should extract the whole cib, 
upgrade all nodes and reconfigure the cluster from the scratch? (The cluster is 
meant to go live in 2 days... :) )

Thanks a lot in advance




>  
> Thanks!
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org 
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Memory leak in crm_mon ?

2015-08-10 Thread Attila Megyeri
Hi!

We are building a new cluster on top of pacemaker/corosync and several times 
during the past days we noticed that "crm_mon -Af" used up all the memory+swap 
and caused high CPU usage. Killing the process solves the issue.

We are using the binary package versions available in the latest ubuntu trusty, 
namely:

crmsh  1.2.5+hg1034-1ubuntu4
pacemaker1.1.10+git20130802-1ubuntu2.3
pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.3
corosync 2.3.3-1ubuntu1

Kernel is 3.13.0-46-generic

Looking back some "atop" data, the CPU went to 100% many times during the last 
couple of days, at various times, more often around midnight exaclty (strange).

08.05 14:00
08.06 21:41
08.07 00:00
08.07 00:00
08.08 00:00
08.09 06:27

Checked the corosync log and syslog, but did not find any correlation between 
the entries int he logs around the specific times.
For most of the time, the node running the crm_mon was the DC as well - not 
running any resources (e.g. a pairless node for quorum).


We have another running system, where everything works perfecly, whereas it is 
almost the same:

crmsh  1.2.5+hg1034-1ubuntu4
pacemaker1.1.10+git20130802-1ubuntu2.1
pacemaker-cli-utils1.1.10+git20130802-1ubuntu2.1
corosync 2.3.3-1ubuntu1

Kernel is 3.13.0-8-generic


Is this perhaps a known issue? Any hints?

Thanks!
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org