On 07/18/2016 05:51 PM, Martin Schlegel wrote: > Hello all > > I cannot wrap my brain around what's going on here ... any help would prevent > me > from fencing my brain =:-D > > > > Problem: > > When completely network isolating a node, i.e. pg1 - sometimes a different > node > gets fenced instead, i.e. pg3 ... in this case I see a syslog message like > this > indicating the wrong stonith device was used: > stonith-ng[4650]: notice: Operation 'poweroff' [6216] (call 2 from > crmd.4654) for host 'pg1' with device 'p_ston_pg3' returned: 0 (OK) > > I had assumed that only the stonith resource p_ston_pg1 had hostname=pg1 and > was > the only resource eligible to be used to fence pg1 ! > > Why would it use p_ston_pg3 then ? > > > > > Configuration summary - more details and logs below: > > * 3x nodes pg1, pg2 and pg3 > * 3x stonith resources p_ston_pg1, p_ston_pg2 and p_ston_pg3 - one for each > node > * symmetric-cluster=false (!), please see location constraints > l_pgs_resources > and l_ston_pg1, l_ston_pg2 & l_ston_pg3 further below > * We rely on /etc/hosts to resolve pg1, pg2 and pg3 for corosync - the > actual > hostnames are completely different > * We rely on the option "hostname" for stonith:external/ipmi to specify the > name of the host to be managed by the defined STONITH device. > > > > > The stonith registration looks wrong to me (?) - I expected 1 single stonith > device to be registered per host - see crm_mon output - only 1 p_ston_pgX > resource gets started per host (!): > > root@test123:~# for node in pg{1..3} ; do ssh $node stonith_admin -L ; done > Warning: Permanently added 'pg1,10.148.128.28' (ECDSA) to the list of known > hosts. > 2 devices found > p_ston_pg3 > p_ston_pg2 > Warning: Permanently added 'pg2,10.148.128.7' (ECDSA) to the list of known > hosts. > 2 devices found > p_ston_pg3 > p_ston_pg1 > Warning: Permanently added 'pg3,10.148.128.37' (ECDSA) to the list of known > hosts. > 2 devices found > p_ston_pg1 > p_ston_pg2 > > > > > ... and for the host pg1 (same as for pg2 or pg3) 2x devices are found to > fence > off pg1 - I would expect only 1 device to show up: > > root@test123:~# for node in pg{1..3} ; do ssh $node stonith_admin -l pg1 ; > done > > Warning: Permanently added 'pg1,10.148.128.28' (ECDSA) to the list of known > hosts. > 2 devices found > p_ston_pg3 > p_ston_pg2 > > Warning: Permanently added 'pg2,10.148.128.7' (ECDSA) to the list of known > hosts. > 2 devices found > p_ston_pg1 > p_ston_pg3 > > Warning: Permanently added 'pg3,10.148.128.37' (ECDSA) to the list of known > hosts. > 2 devices found > p_ston_pg1 > p_ston_pg2 > > > > > crm_mon monitor output: > > root@test123:~# crm_mon -1 > Last updated: Mon Jul 18 22:45:00 2016 Last change: Mon Jul 18 > 20:52:14 > 2016 by root via cibadmin on pg2 > Stack: corosync > Current DC: pg1 (version 1.1.14-70404b0) - partition with quorum > 3 nodes and 25 resources configured > > Online: [ pg1 pg2 pg3 ] > > p_ston_pg1 (stonith:external/ipmi): Started pg2 > p_ston_pg2 (stonith:external/ipmi): Started pg3 > p_ston_pg3 (stonith:external/ipmi): Started pg1 > > > > > > > Configuration: > > [...] > > primitive p_ston_pg1 stonith:external/ipmi \ > params hostname=pg1 ipaddr=10.148.128.35 userid=root > passwd="/var/vcap/data/packages/pacemaker/ra-tmp/stonith/PG1-ipmipass" > passwd_method=file interface=lan priv=OPERATOR > > primitive p_ston_pg2 stonith:external/ipmi \ > params hostname=pg2 ipaddr=10.148.128.19 userid=root > passwd="/var/vcap/data/packages/pacemaker/ra-tmp/stonith/PG2-ipmipass" > passwd_method=file interface=lan priv=OPERATOR > > primitive p_ston_pg3 stonith:external/ipmi \ > params hostname=pg3 ipaddr=10.148.128.59 userid=root > passwd="/var/vcap/data/packages/pacemaker/ra-tmp/stonith/PG3-ipmipass" > passwd_method=file interface=lan priv=OPERATOR > > location l_pgs_resources { otherstuff p_ston_pg1 p_ston_pg2 p_ston_pg3 } > resource-discovery=exclusive \ > rule #uname eq pg1 \ > rule #uname eq pg2 \ > rule #uname eq pg3 > > location l_ston_pg1 p_ston_pg1 -inf: pg1 > location l_ston_pg2 p_ston_pg2 -inf: pg2 > location l_ston_pg3 p_ston_pg3 -inf: pg3
These constraints prevent each device from running on its intended target, but they don't limit which nodes each device can fence. For that, each device needs a pcmk_host_list or pcmk_host_map entry, for example: primitive p_ston_pg1 ... pcmk_host_map=pg1:pg1.ipmi.example.com Use pcmk_host_list if the fence device needs the node name as known to the cluster, and pcmk_host_map if you need to translate a node name to an address the device understands. > [...] > > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.14-70404b0 \ > cluster-infrastructure=corosync \ > symmetric-cluster=false \ > stonith-enabled=true \ > no-quorum-policy=stop \ > start-failure-is-fatal=false \ > stonith-action=poweroff \ > node-health-strategy=migrate-on-red \ > last-lrm-refresh=1468855127 > rsc_defaults rsc-options: \ > resource-stickiness=INFINITY \ > migration-threshold=2 > > > > > > pg2's /var/log/syslog: > > [...] > Jul 18 19:20:53 localhost crmd[4654]: notice: Executing poweroff fencing > operation (52) on pg1 (timeout=60000) > Jul 18 19:20:53 localhost stonith-ng[4650]: notice: Client > crmd.4654.909c34cb > wants to fence (poweroff) 'pg1' with device '(any)' > Jul 18 19:20:53 localhost stonith-ng[4650]: notice: Initiating remote > operation poweroff for pg1: 4bc5bf9f-b180-49ad-b142-7f14f988687a (0) > Jul 18 19:20:53 localhost crmd[4654]: notice: Initiating action 8: start > p_ston_pg2_start_0 on pg3 > Jul 18 19:20:53 localhost crmd[4654]: notice: Initiating action 10: start > p_ston_pg3_start_0 on pg2 (local) > Jul 18 19:20:55 localhost crmd[4654]: notice: Operation p_ston_pg3_start_0: > ok > (node=pg2, call=56, rc=0, cib-update=56, confirmed=true) > Jul 18 19:20:58 localhost stonith-ng[4650]: notice: Operation 'poweroff' > [6216] (call 2 from crmd.4654) for host 'pg1' with device 'p_ston_pg3' > returned: > 0 (OK) > Jul 18 19:20:58 localhost stonith-ng[4650]: notice: Operation poweroff of > pg1 > by pg2 for crmd.4654@pg2.4bc5bf9f: OK > Jul 18 19:20:58 localhost crmd[4654]: notice: Stonith operation > 2/52:0:0:577f46f1-b431-4b4d-9ed8-8a0918d791ce: OK (0) > Jul 18 19:20:58 localhost crmd[4654]: notice: Peer pg1 was terminated > (poweroff) by pg2 for pg2: OK (ref=4bc5bf9f-b180-49ad-b142-7f14f988687a) by > client crmd.4654 > [...] _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org