[Linux-HA] Master Became Slave - Cluster unstable $$$

aalishe Sun, 06 Apr 2014 17:03:26 -0700

Hi all,

I am new to corosync/pacemaker and  I have a 2 node "production" cluster
with (corosync+pacemaker+drbd)


Node1 = lws1h1.mydomain.com
Node2 = lws1h2.mydomain.com

they are in online/online  failover setup .....  services are only running
where DRBD resides... the other node stays online to take over if Node1
fails 

this is the SW versions:
corosync-2.3.0-1.el6.x86_64
drbd84-utils-8.4.2-1.el6.elrepo.x86_64
pacemaker-1.1.8-1.el6.x86_64
OS:  CentOS 6.4 x64bit

the cluster is configured with Quorum (not sure what that is) 

few days ago I placed one of the nodes in maintenance mode "after" services
where going bad due to a problem  .... I dont remember the details of how I
moved/migrated the resources but I usually use  LCMC GUI tool .... also I
did some restart for corosync / pacemaker in a random ways   :$

after that.. Node1 became slave and Node2 became master!

services are now sticking on Node2  and I cant migrate them even by force to
Node1  (tried command line tools and LCMC tool)


more details/outputs:


*####################### Start ###############################*
[aalishe@lws1h1 ~]$ sudo crm_mon -Afro
Last updated: Sun Apr  6 15:25:52 2014
Last change: Sun Apr  6 14:16:15 2014 via crm_resource on
lws1h2.mydomain.com
Stack: corosync
Current DC: lws1h2.mydomain.com (2) - partition with quorum
Version: 1.1.8-1.el6-394e906
2 Nodes configured, unknown expected votes
10 Resources configured.


Online: [ lws1h1.mydomain.com lws1h2.mydomain.com ]

Full list of resources:

 Resource Group: SuperMetaService
     SuperFloatIP       (ocf::heartbeat:IPaddr2):       Started 
lws1h2.mydomain.com
     SuperFs1   (ocf::heartbeat:Filesystem):    Started lws1h2.mydomain.com
     SuperFs2   (ocf::heartbeat:Filesystem):    Started lws1h2.mydomain.com
     SuperFs3   (ocf::heartbeat:Filesystem):    Started lws1h2.mydomain.com
     SuperFs4   (ocf::heartbeat:Filesystem):    Started lws1h2.mydomain.com
 Master/Slave Set: SuperDataClone [SuperData]
     Masters: [ lws1h2.mydomain.com ]
     Slaves: [ lws1h1.mydomain.com ]
SuperMetaSQL    (ocf::mydomain:pgsql):    Started lws1h2.mydomain.com
SuperGTS        (ocf::mydomain:mmon):     Started lws1h2.mydomain.com
SuperCQP        (ocf::mydomain:mmon):     Started lws1h2.mydomain.com

Node Attributes:
* Node lws1h1.mydomain.com:
* Node lws1h2.mydomain.com:
    + master-SuperData                  : 10000

Operations:
* Node lws1h2.mydomain.com: 
   SuperFs1: migration-threshold=1000000
    + (1241) start: rc=0 (ok)
   SuperMetaSQL: migration-threshold=1000000
    + (1254) start: rc=0 (ok)
    + (1257) monitor: interval=30000ms rc=0 (ok)
   SuperFloatIP: migration-threshold=1000000
    + (1236) start: rc=0 (ok)    + (1239) monitor: interval=30000ms rc=0 (ok)
SuperData:0: migration-threshold=1000000    + (957) probe: rc=0 (ok)    +
(1230) promot 

*########################### End ###########################*



CRM Configuration 
*####################### Start ###############################*
[aalishe@lws1h1 ~]$ sudo crm configure show
node $id="1" lws1h1.mydomain.com \
        attributes standby="off"
node $id="2" lws1h2.mydomain.com \
        attributes standby="off"
primitive SuperCQP ocf:mydomain:mmon \
        params mmond="/opt/mydomain/platform/bin/"
cfgfile="/opt/mydomain/platform/etc/mmon_mydomain_cqp.xml"
pidfile="/opt/mydomain/platform/var/run/mmon_mydomain_cqp.pid"
user="mydomainsvc" db="bigdata" dbport="5434" \
        operations $id="SuperCQP-operations" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="120" \
        op monitor interval="120" timeout="120" start-delay="0" \
        meta target-role="started" is-managed="true"
primitive SuperData ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="60s" \
        meta target-role="started"
primitive SuperFloatIP ocf:heartbeat:IPaddr2 \
        params ip="10.100.0.225" cidr_netmask="24" \
        op monitor interval="30s" \
        meta target-role="started"
primitive SuperFs1 ocf:heartbeat:Filesystem \
        params device="/dev/drbd1" directory="/mnt/drbd1" fstype="ext4" \
        meta target-role="started"
primitive SuperFs2 ocf:heartbeat:Filesystem \
        params device="/dev/drbd2" directory="/mnt/drbd2" fstype="ext4" \
        meta target-role="started"
primitive SuperFs3 ocf:heartbeat:Filesystem \
        params device="/dev/drbd3" directory="/mnt/drbd3" fstype="ext4"
primitive SuperFs4 ocf:heartbeat:Filesystem \
        params device="/dev/drbd4" directory="/mnt/drbd4" fstype="ext4" \
        meta target-role="started"
primitive SuperGTS ocf:mydomain:mmon \
        params mmond="/opt/mydomain/platform/bin/"
cfgfile="/opt/mydomain/platform/etc/mmon_mydomain_gts.xml"
pidfile="/opt/mydomain/platform/var/run/mmon_mydomain_gts.pid" user=
"mydomainsvc" db="bigdata" \
        operations $id="SuperGTS-operations" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="120" \
        op monitor interval="120" timeout="120" start-delay="0" \
        meta target-role="started" is-managed="true"
primitive SuperMetaSQL ocf:mydomain:pgsql \
        op monitor interval="30" timeout="30" depth="0" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="120" \
        params pgdata="/mnt/drbd1/pgsql/data" pgdb="bigdata" \
        meta target-role="Started" is-managed="true"
group SuperMetaService SuperFloatIP SuperFs1 SuperFs2 SuperFs3 SuperFs4 \
        meta target-role="Started"
ms SuperDataClone SuperData \
        meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
colocation CqpOnGts inf: SuperCQP SuperGTS
colocation GtsOnMeta inf: SuperGTS SuperMetaSQL
colocation MetaSQLonData inf: SuperMetaSQL SuperDataClone:Master
colocation ServiceOnDrbd inf: SuperMetaService SuperDataClone:Master
order CqpAfterGts inf: SuperGTS:start SuperCQP
order GtsAfterMeta inf: SuperMetaSQL:start SuperGTS
order MetaAfterService inf: SuperMetaService:start SuperMetaSQL
order ServiceAfterDrbd inf: SuperDataClone:promote SuperMetaService:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.8-1.el6-394e906" \
        cluster-infrastructure="corosync" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1396817730" \
        maintenance-mode="false"
*######################## End ###########################*


messages I saw the first time I migrated / or maybe unimgrated services 

*############################Start ######################*
  /usr/sbin/crm_resource -r SuperMetaService --migrate
WARNING: Creating rsc_location constraint 'cli-standby-SuperMetaService'
with a score of -INFINITY for resource SuperMetaService on
lws1h2.mydomain.com.
        This will prevent SuperMetaService from running on lws1h2.mydomain.com
until the constraint is removed using the 'crm_resource -U' command or
manually with cibadmin
        This will be the case even if lws1h2.mydomain.com is the last node in 
the
cluster
        This message can be disabled with -Q

[aali...@lws1h2.mydomain.com:~#]  /usr/sbin/cibadmin --obj_type constraints
-C -X '<rsc_location id="cli-standby-SuperDataClone"
rsc="SuperDataClone"><rule id="cli-standby-SuperDataClone-rule"
score="-INFINITY" role="Master"><expression attribute="#uname"
id="cli-standby-SuperDataClone-expression" operation="eq"
value="lws1h2.mydomain.com"/></rule></rsc_location>'

[aali...@lws1h2.mydomain.com:~#]  /usr/sbin/crm_resource -r SuperMetaSQL
--migrate
Resource SuperMetaSQL not moved: not-active and no preferred location
specified.
Error performing operation: Invalid argument
*###############################End#############################*


Errors from System Logs

*############################# Start ###############################*
[aalishe@lws1h1 ~]$ sudo tail -n 300 /var/log/messages  | grep -E
"error|warning"
Apr  6 14:16:14 lws1h1 cibmon[28024]:    error: crm_element_value: Couldn't
find src in NULL
Apr  6 14:16:14 lws1h1 cibmon[28024]:    error: crm_element_value: Couldn't
find src in NULL
Apr  6 14:16:15 lws1h1 cibmon[28024]:    error: crm_element_value: Couldn't
find src in NULL
Apr  6 14:16:15 lws1h1 cibmon[28024]:    error: crm_element_value: Couldn't
find src in NULL
Apr  6 14:16:15 lws1h1 kernel: crm_simulate[29878]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fffd9cff2b0 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr  6 14:16:15 lws1h1 kernel: crm_simulate[29886]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fff7f89d140 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr  6 14:16:15 lws1h1 kernel: crm_simulate[29893]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fffef934290 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr  6 14:16:15 lws1h1 kernel: crm_simulate[29900]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fff7d4e0d90 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr  6 14:16:51 lws1h1 cibmon[28024]:    error: crm_element_value: Couldn't
find src in NULL
Apr  6 14:16:51 lws1h1 kernel: crm_simulate[31902]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fff36787850 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr  6 14:16:51 lws1h1 kernel: crm_simulate[31909]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fff45ed4830 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr  6 14:16:52 lws1h1 cibmon[28024]:    error: crm_element_value: Couldn't
find src in NULL
Apr  6 14:16:52 lws1h1 kernel: crm_simulate[31917]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fffe68127a0 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr  6 14:16:52 lws1h1 kernel: crm_simulate[31924]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fff80548b20 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr  6 14:16:55 lws1h1 cibmon[28024]:    error: crm_element_value: Couldn't
find src in NULL
Apr  6 14:16:56 lws1h1 kernel: crm_simulate[31958]: segfault at 1d4c0 ip
0000003a2284812c sp 00007ffff103d430 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr  6 14:16:56 lws1h1 kernel: crm_simulate[31965]: segfault at 1d4c0 ip
0000003a2284812c sp 00007fffcb795e30 error 4 in
libc-2.12.so[3a22800000+18b000]
Apr  6 14:16:57 lws1h1 cibmon[28024]:    error: crm_element_value: Couldn't
find src in NULL
Apr  6 16:12:37 lws1h1 crm_simulate[29796]:    error: crm_xml_err: XML
Error: I/O warning : failed to load external entity
"/tmp/lcmc-test-387474fb-b2c3-4c7c-96ed-f207635579e2.xml"
Apr  6 16:12:37 lws1h1 crm_simulate[29796]:    error: filename2xml: Parsing
failed (domain=8, level=1, code=1549): failed to load external entity
"/tmp/lcmc-test-387474fb-b2c3-4c7c-96ed-f207635579e2.xml"
Apr  6 16:12:37 lws1h1 crm_simulate[29796]:    error: filename2xml: Couldn't
parse /tmp/lcmc-test-387474fb-b2c3-4c7c-96ed-f207635579e2.xml
Apr  6 16:12:37 lws1h1 crm_simulate[29796]:    error: crm_abort:
xpath_search: Triggered assert at xml.c:2742 : xml_top != NULL
Apr  6 16:12:37 lws1h1 crm_simulate[29796]:    error: crm_element_value:
Couldn't find validate-with in NULL
Apr  6 16:12:37 lws1h1 crm_simulate[29796]:    error: crm_abort:
update_validation: Triggered assert at xml.c:2586 : *xml_blob != NULL
Apr  6 16:12:37 lws1h1 crm_simulate[29796]:    error: crm_element_value:
Couldn't find validate-with in NULL
Apr  6 16:21:24 lws1h1 attrd[32458]:  warning: attrd_cib_callback: Update
fail-count-SuperCQP=(null) failed: Transport endpoint is not connected
Apr  6 16:21:24 lws1h1 attrd[32458]:  warning: attrd_cib_callback: Update
last-failure-SuperCQP=(null) failed: Transport endpoint is not connected
Apr  6 16:26:11 lws1h1 attrd[32458]:  warning: attrd_cib_callback: Update
fail-count-SuperCQP=(null) failed: Transport endpoint is not connected
Apr  6 16:26:11 lws1h1 crmd[15595]:  warning: decode_transition_key: Bad
UUID (crm-resource-2879) in sscanf result (3) for 0:0:crm-resource-2879
Apr  6 16:26:11 lws1h1 crmd[15595]:    error: send_msg_via_ipc: Unknown
Sub-system (2879_crm_resource)... discarding message.
Apr  6 16:38:59 lws1h1 attrd[32458]:  warning: attrd_cib_callback: Update
fail-count-SuperGTS=(null) failed: Transport endpoint is not connected
Apr  6 16:38:59 lws1h1 attrd[32458]:  warning: attrd_cib_callback: Update
last-failure-SuperGTS=(null) failed: Transport endpoint is not connected
*#####################################End###########################*



Thanks inadvance for your help 

** Please note that I will reward whomever "First" helps me "solve" this
issue **



--
View this message in context: 
http://linux-ha.996297.n3.nabble.com/Master-Became-Slave-Cluster-unstable-tp15583.html
Sent from the Linux-HA mailing list archive at Nabble.com.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Master Became Slave - Cluster unstable $$$

Reply via email to