Re: [Linux-HA] Master Became Slave - Cluster unstable $$$
On 04/08/2014 12:18 AM, Ammar Sheikh Saleh wrote: yes i have the command ... its CentOS Then please review the man page of crm_master and try to adjust the scores where you want to start the master and where you want to start the slave. Before you follow my general steps you could also ask again on the list about using crm_master fom command line on centos - I am not really sure if it is really the same. 1. Check the current promotion scores using the pengine: ptest -Ls | grep promo - You should get a list of scores per master/slave resources and node 2. Check the set crm_master score using crm_master: crm_master -q -G -N node -l reboot -r resource-INSIDE-masterslave 3. Adjust the master/promotion scores (this is the most tricky part) crm_master -v NEW_MASTER_VALUE -l reboot -r resource-INSIDE-masterslave If you do not have constraints added by bad operations before that might help the cluster to promote the preferred site. But my procedure is without any warranty and further support, sorry. Maloja01 On Mon, Apr 7, 2014 at 4:16 PM, Maloja01 maloj...@arcor.de wrote: On 04/07/2014 03:00 PM, Ammar Sheikh Saleh wrote: thanks for your help ... can you guide me to the correct commands : I dont understand with is rsc in this command crm(live)node# attribute usage: attribute node set rsc value attribute node delete rsc attribute node show rsc how can I give a node a master attribute with high score in the above ? At SLES (SUSE) there is a command crm_master - do you have such a command? cheers! Ammar On Mon, Apr 7, 2014 at 3:49 PM, Maloja01 maloj...@arcor.de wrote: On 04/07/2014 01:23 PM, Ammar Sheikh Saleh wrote: thanks a million time for answering ... I included all my software versions /OS details in the thread ( http://linux-ha.996297.n3.nabble.com/Master-Became- Slave-Cluster-unstable-td15583.html) but here they are : this is the SW versions: corosync-2.3.0-1.el6.x86_64 drbd84-utils-8.4.2-1.el6.elrepo.x86_64 pacemaker-1.1.8-1.el6.x86_64 OS: CentOS 6.4 x64bit Ah sorry than I couldn't tell how stonith is working, as RH has a complete different setup beyond pacemaker. I do not know how they implement fencing. Sorry - but however best regards F.Maloja I need to correct something .. the setup is 2 nodes for HA and third one for Quorm only ... also config changed a little bit (attached) looking at the config right now ... I see a suspicious line (rule role=Master score=-INFINITY id=drbd-fence-by-handler-r0-rule-SuperDataClone expression attribute=#uname operation=ne value= lws1h1.npario.com id=drbd-fence-by-handler-r0-expr-SuperDataClone/ ) it might be the reason why services are not starting on the first node (not sure%100) ... I have a feeling your answer is the right one ... but I dont know the correct commands to do them: 1- I need to put back the Node1 back to Master (currently it is slaved) 2- remove any constraints Or preferred locations also remove any special attributes on Node1 that is making it slave what do you think ? how can I do these ? what are commands cheers! Ammar On Mon, Apr 7, 2014 at 2:09 PM, Maloja01 maloj...@arcor.de wrote: hi ammar, first we need to check: a) which OS (which Linux dist are you using)? b) which cluster version/packages do you have installed? c) how is your cluster config look like? As my tipp with crm_master/removing client-prefer-rules might only help in some combinations of a-b-c I need that info. Second I need to tell that free help means that I will not give any warrinty or whatever that your cluster gets better afterward. This could only be done by a cost intensive consulting. regards f.maloja On 04/07/2014 12:10 PM, aali...@gmail.com wrote: hi Maloja01, could you please tell me help do these steps/ what commands look like: - You Should use crm_master to Score the Master-Placement - you should remove all client-Prefer-location-rules which you added by your experiments using the GUI I dont want to make this worst ... production is down here.. and I am desperately in need of any help Thank you, Ammar _ Sent from http://linux-ha.996297.n3.nabble.com ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Master Became Slave - Cluster unstable $$$
FIRST you need to setup fencing (STONITH) - I do not see any stonith resource in your cluster - that WILL be a problem in your cluster. You could not migrate a Master/Slave. You Should use crm_master to Score the Master-Placement. And you should remove all client-Prefer-location-rules which you added by your experiments using the GUI, they might hurt the cluster in the future... AND as already written in this thread you must tell the cluster to ignore quorum, if you really only have a two-node-cluster. FMaloja On 04/07/2014 02:31 AM, aalishe wrote: tell me what informations are still needed to be sure / certain solving the problem ? I can provide it all thanks for your time -- View this message in context: http://linux-ha.996297.n3.nabble.com/Master-Became-Slave-Cluster-unstable-tp15583p15585.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Master Became Slave - Cluster unstable $$$
Hi all, I am new to corosync/pacemaker and I have a 2 node production cluster with (corosync+pacemaker+drbd) Node1 = lws1h1.mydomain.com Node2 = lws1h2.mydomain.com they are in online/online failover setup . services are only running where DRBD resides... the other node stays online to take over if Node1 fails this is the SW versions: corosync-2.3.0-1.el6.x86_64 drbd84-utils-8.4.2-1.el6.elrepo.x86_64 pacemaker-1.1.8-1.el6.x86_64 OS: CentOS 6.4 x64bit the cluster is configured with Quorum (not sure what that is) few days ago I placed one of the nodes in maintenance mode after services where going bad due to a problem I dont remember the details of how I moved/migrated the resources but I usually use LCMC GUI tool also I did some restart for corosync / pacemaker in a random ways :$ after that.. Node1 became slave and Node2 became master! services are now sticking on Node2 and I cant migrate them even by force to Node1 (tried command line tools and LCMC tool) more details/outputs: *### Start ###* [aalishe@lws1h1 ~]$ sudo crm_mon -Afro Last updated: Sun Apr 6 15:25:52 2014 Last change: Sun Apr 6 14:16:15 2014 via crm_resource on lws1h2.mydomain.com Stack: corosync Current DC: lws1h2.mydomain.com (2) - partition with quorum Version: 1.1.8-1.el6-394e906 2 Nodes configured, unknown expected votes 10 Resources configured. Online: [ lws1h1.mydomain.com lws1h2.mydomain.com ] Full list of resources: Resource Group: SuperMetaService SuperFloatIP (ocf::heartbeat:IPaddr2): Started lws1h2.mydomain.com SuperFs1 (ocf::heartbeat:Filesystem):Started lws1h2.mydomain.com SuperFs2 (ocf::heartbeat:Filesystem):Started lws1h2.mydomain.com SuperFs3 (ocf::heartbeat:Filesystem):Started lws1h2.mydomain.com SuperFs4 (ocf::heartbeat:Filesystem):Started lws1h2.mydomain.com Master/Slave Set: SuperDataClone [SuperData] Masters: [ lws1h2.mydomain.com ] Slaves: [ lws1h1.mydomain.com ] SuperMetaSQL(ocf::mydomain:pgsql):Started lws1h2.mydomain.com SuperGTS(ocf::mydomain:mmon): Started lws1h2.mydomain.com SuperCQP(ocf::mydomain:mmon): Started lws1h2.mydomain.com Node Attributes: * Node lws1h1.mydomain.com: * Node lws1h2.mydomain.com: + master-SuperData : 1 Operations: * Node lws1h2.mydomain.com: SuperFs1: migration-threshold=100 + (1241) start: rc=0 (ok) SuperMetaSQL: migration-threshold=100 + (1254) start: rc=0 (ok) + (1257) monitor: interval=3ms rc=0 (ok) SuperFloatIP: migration-threshold=100 + (1236) start: rc=0 (ok)+ (1239) monitor: interval=3ms rc=0 (ok) SuperData:0: migration-threshold=100+ (957) probe: rc=0 (ok)+ (1230) promot *### End ###* CRM Configuration *### Start ###* [aalishe@lws1h1 ~]$ sudo crm configure show node $id=1 lws1h1.mydomain.com \ attributes standby=off node $id=2 lws1h2.mydomain.com \ attributes standby=off primitive SuperCQP ocf:mydomain:mmon \ params mmond=/opt/mydomain/platform/bin/ cfgfile=/opt/mydomain/platform/etc/mmon_mydomain_cqp.xml pidfile=/opt/mydomain/platform/var/run/mmon_mydomain_cqp.pid user=mydomainsvc db=bigdata dbport=5434 \ operations $id=SuperCQP-operations \ op start interval=0 timeout=120 \ op stop interval=0 timeout=120 \ op monitor interval=120 timeout=120 start-delay=0 \ meta target-role=started is-managed=true primitive SuperData ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=60s \ meta target-role=started primitive SuperFloatIP ocf:heartbeat:IPaddr2 \ params ip=10.100.0.225 cidr_netmask=24 \ op monitor interval=30s \ meta target-role=started primitive SuperFs1 ocf:heartbeat:Filesystem \ params device=/dev/drbd1 directory=/mnt/drbd1 fstype=ext4 \ meta target-role=started primitive SuperFs2 ocf:heartbeat:Filesystem \ params device=/dev/drbd2 directory=/mnt/drbd2 fstype=ext4 \ meta target-role=started primitive SuperFs3 ocf:heartbeat:Filesystem \ params device=/dev/drbd3 directory=/mnt/drbd3 fstype=ext4 primitive SuperFs4 ocf:heartbeat:Filesystem \ params device=/dev/drbd4 directory=/mnt/drbd4 fstype=ext4 \ meta target-role=started primitive SuperGTS ocf:mydomain:mmon \ params mmond=/opt/mydomain/platform/bin/ cfgfile=/opt/mydomain/platform/etc/mmon_mydomain_gts.xml pidfile=/opt/mydomain/platform/var/run/mmon_mydomain_gts.pid user= mydomainsvc db=bigdata \ operations $id=SuperGTS-operations \ op start interval=0 timeout=120 \ op stop interval=0 timeout=120 \ op monitor interval=120 timeout=120 start-delay=0 \ meta target-role=started is-managed=true primitive SuperMetaSQL
Re: [Linux-HA] Master Became Slave - Cluster unstable $$$
I can't speak to your specific problem, but I can say for certain that you need to disable quorum[1] and enable stonith (also called fencing[2]). Once stonith is configured (and tested) in pacemaker, be sure to setup fencing in DRBD using the 'crm-fence-peer.sh' fence handler[3]. digimer 1. https://alteeve.ca/w/Quorum 2. https://alteeve.ca/w/AN!Cluster_Tutorial_2#Concept.3B_Fencing 3. https://alteeve.ca/w/AN!Cluster_Tutorial_2#Configuring_DRBD_Global_and_Common_Options (replace 'rhcs_fence' for 'crm-fence-peer.sh') On 06/04/14 08:02 PM, aalishe wrote: Hi all, I am new to corosync/pacemaker and I have a 2 node production cluster with (corosync+pacemaker+drbd) Node1 = lws1h1.mydomain.com Node2 = lws1h2.mydomain.com they are in online/online failover setup . services are only running where DRBD resides... the other node stays online to take over if Node1 fails this is the SW versions: corosync-2.3.0-1.el6.x86_64 drbd84-utils-8.4.2-1.el6.elrepo.x86_64 pacemaker-1.1.8-1.el6.x86_64 OS: CentOS 6.4 x64bit the cluster is configured with Quorum (not sure what that is) few days ago I placed one of the nodes in maintenance mode after services where going bad due to a problem I dont remember the details of how I moved/migrated the resources but I usually use LCMC GUI tool also I did some restart for corosync / pacemaker in a random ways :$ after that.. Node1 became slave and Node2 became master! services are now sticking on Node2 and I cant migrate them even by force to Node1 (tried command line tools and LCMC tool) more details/outputs: *### Start ###* [aalishe@lws1h1 ~]$ sudo crm_mon -Afro Last updated: Sun Apr 6 15:25:52 2014 Last change: Sun Apr 6 14:16:15 2014 via crm_resource on lws1h2.mydomain.com Stack: corosync Current DC: lws1h2.mydomain.com (2) - partition with quorum Version: 1.1.8-1.el6-394e906 2 Nodes configured, unknown expected votes 10 Resources configured. Online: [ lws1h1.mydomain.com lws1h2.mydomain.com ] Full list of resources: Resource Group: SuperMetaService SuperFloatIP (ocf::heartbeat:IPaddr2): Started lws1h2.mydomain.com SuperFs1 (ocf::heartbeat:Filesystem):Started lws1h2.mydomain.com SuperFs2 (ocf::heartbeat:Filesystem):Started lws1h2.mydomain.com SuperFs3 (ocf::heartbeat:Filesystem):Started lws1h2.mydomain.com SuperFs4 (ocf::heartbeat:Filesystem):Started lws1h2.mydomain.com Master/Slave Set: SuperDataClone [SuperData] Masters: [ lws1h2.mydomain.com ] Slaves: [ lws1h1.mydomain.com ] SuperMetaSQL(ocf::mydomain:pgsql):Started lws1h2.mydomain.com SuperGTS(ocf::mydomain:mmon): Started lws1h2.mydomain.com SuperCQP(ocf::mydomain:mmon): Started lws1h2.mydomain.com Node Attributes: * Node lws1h1.mydomain.com: * Node lws1h2.mydomain.com: + master-SuperData : 1 Operations: * Node lws1h2.mydomain.com: SuperFs1: migration-threshold=100 + (1241) start: rc=0 (ok) SuperMetaSQL: migration-threshold=100 + (1254) start: rc=0 (ok) + (1257) monitor: interval=3ms rc=0 (ok) SuperFloatIP: migration-threshold=100 + (1236) start: rc=0 (ok) + (1239) monitor: interval=3ms rc=0 (ok) SuperData:0: migration-threshold=100+ (957) probe: rc=0 (ok)+ (1230) promot *### End ###* CRM Configuration *### Start ###* [aalishe@lws1h1 ~]$ sudo crm configure show node $id=1 lws1h1.mydomain.com \ attributes standby=off node $id=2 lws1h2.mydomain.com \ attributes standby=off primitive SuperCQP ocf:mydomain:mmon \ params mmond=/opt/mydomain/platform/bin/ cfgfile=/opt/mydomain/platform/etc/mmon_mydomain_cqp.xml pidfile=/opt/mydomain/platform/var/run/mmon_mydomain_cqp.pid user=mydomainsvc db=bigdata dbport=5434 \ operations $id=SuperCQP-operations \ op start interval=0 timeout=120 \ op stop interval=0 timeout=120 \ op monitor interval=120 timeout=120 start-delay=0 \ meta target-role=started is-managed=true primitive SuperData ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=60s \ meta target-role=started primitive SuperFloatIP ocf:heartbeat:IPaddr2 \ params ip=10.100.0.225 cidr_netmask=24 \ op monitor interval=30s \ meta target-role=started primitive SuperFs1 ocf:heartbeat:Filesystem \ params device=/dev/drbd1 directory=/mnt/drbd1 fstype=ext4 \ meta target-role=started primitive SuperFs2 ocf:heartbeat:Filesystem \ params device=/dev/drbd2 directory=/mnt/drbd2 fstype=ext4 \ meta target-role=started primitive SuperFs3 ocf:heartbeat:Filesystem \ params device=/dev/drbd3 directory=/mnt/drbd3 fstype=ext4 primitive SuperFs4 ocf:heartbeat:Filesystem \
Re: [Linux-HA] Master Became Slave - Cluster unstable $$$
tell me what informations are still needed to be sure / certain solving the problem ? I can provide it all thanks for your time -- View this message in context: http://linux-ha.996297.n3.nabble.com/Master-Became-Slave-Cluster-unstable-tp15583p15585.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems