[ClusterLabs] pcmk_remote evaluation (continued)

Vladislav Bogdanov Wed, 20 Sep 2017 02:49:51 -0700

Hi,

as 1.1.17 received a lot of care in pcmk_remote, I decided to try it again
in rather big setup (less then previous, so I'm not hit by IPC disconnects 
here).


>From the first runs there are still some severe issues when cluster nodes are 
>fenced.

The following results are obtained by killing the DC node (md12k-3-srv) which 
was
hosting remote resources for nodes es7700-3-srv, es12kxe-3-srv and 
es12kxe-7-srv.
After the fence new DC (md12k-1-srv) has moved that resources the following way:
=====
Sep 20 08:53:28 md12k-1-srv pengine[2525]:   notice: Move    es12kxe-3-srv      
(Started md12k-3-srv -> md12k-4-srv)
Sep 20 08:53:28 md12k-1-srv pengine[2525]:   notice: Move    es12kxe-7-srv      
(Started md12k-3-srv -> md12k-1-srv)
Sep 20 08:53:28 md12k-1-srv pengine[2525]:   notice: Move    es7700-3-srv       
(Started md12k-3-srv -> md12k-2-srv)
=====

* The first issue I found (and I expect that to be a reason for some other 
issues) is that
pacemaker_remote does not drop an old crmds' connection after new crmd connects.
As IPC proxy connections are in the hash table, there is a 50% chance that 
remoted tries to
reach an old crmd to f.e. proxy checks of node attributes when resources are 
reprobed.
That leads to timeouts of that resources' probes with consequent reaction from 
a cluster.
A solution here could be to drop old IPC proxy connection as soon as new one is 
established.

* I _suspect_ that the issue above could lead to following lines in a cluster 
node logs.
I didn't check, but I suspect that when remoted decides to disconnect an old 
connection
after fenced node goes up and TCP connections are reset - it disconnects a new 
one too.
At least this issue happens at the same time fenced node rejoins a cluster.
These logs are for the case no resources operate node attributes (I removed 
that resources
from the CIB and set a stickiness for all others).
=====
Sep 20 08:55:41 md12k-1-srv crmd[11375]:    error: Timed out (10000 ms) while 
waiting for remote data
Sep 20 08:55:41 md12k-1-srv crmd[11375]:    error: Unable to receive expected 
reply, disconnecting.
Sep 20 08:55:41 md12k-1-srv crmd[11375]:    error: Remote lrmd server 
disconnected while waiting for reply with id 9823.
Sep 20 08:55:41 md12k-1-srv crmd[11375]:    error: Unexpected disconnect on 
remote-node es12kxe-7-srv
Sep 20 08:55:41 md12k-1-srv crmd[11375]:    error: Result of monitor operation 
for es12kxe-7-srv on md12k-1-srv: Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:    error: Couldn't perform 
lrmd_rsc_info operation (timeout=0): -107: Success (0)
Sep 20 08:55:41 md12k-1-srv crmd[11375]:    error: LRMD disconnected
Sep 20 08:55:41 md12k-1-srv crmd[11375]:    error: Could not add resource 
ost0033-es03a to LRM es12kxe-7-srv
Sep 20 08:55:41 md12k-1-srv crmd[11375]:    error: Invalid resource definition 
for ost0033-es03a
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: bad input   
<create_request_adv origin="te_rsc_command" t="crmd" version="3.0.13" 
subt="request" reference="lrm_invoke-tengine-150
5897731-3872" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" 
crm_host_to="md12k-1-srv" acl_target="hacluster" crm_user="hacluster">
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: bad input     <crm_xml>
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: bad input       <rsc_op 
id="221" operation="monitor" operation_key="ost0033-es03a_monitor_0" 
on_node="es12kxe-7-srv" on_node_uuid=
"es12kxe-7-srv" router_node="md12k-1-srv" 
transition-key="221:23:7:7df0b5ef-5bfa-4d21-82e3-4dc964d6cffd">
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: bad input         <primitive 
id="ost0033-es03a" class="ocf" provider="ddn" type="lustre-server"/>
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: bad input         
<attributes CRM_meta_on_node="es12kxe-7-srv" 
CRM_meta_on_node_uuid="es12kxe-7-srv" CRM_meta_op_target_rc="7" CRM
_meta_record_pending="false" CRM_meta_timeout="300000" crm_feature_set="3.0.13" 
device="/dev/ddn/es03a_ost0033" directory="/lustre/es03a/ost0033" 
lustre_resource_type="ost" manage_d
irectory="1"/>
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: bad input       </rsc_op>
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: bad input     </crm_xml>
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: bad input   
</create_request_adv>
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Resource ost0033-es03a no 
longer exists in the lrmd
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 221 
(ost0033-es03a_monitor_0) on es12kxe-7-srv failed (target: 7 vs. rc: 6): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:   notice: Transition aborted by 
lrm_rsc_op.ost0033-es03a_last_failure_0: Event failed
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 221 
(ost0033-es03a_monitor_0) on es12kxe-7-srv failed (target: 7 vs. rc: 6): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:    error: Result of probe operation 
for ost0033-es03a on es12kxe-7-srv: Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:   notice: Transition aborted by 
operation es12kxe-3-srv_monitor_30000 'create' on md12k-4-srv: Old event
...
ep 20 08:56:41 md12k-1-srv attrd[2524]:   notice: Node md12k-3-srv state is now 
member
Sep 20 08:56:41 md12k-1-srv cib[2511]:   notice: Node md12k-3-srv state is now 
member
Sep 20 08:56:41 md12k-1-srv pacemakerd[2398]:   notice: Node md12k-3-srv state 
is now member
Sep 20 08:56:41 md12k-1-srv crmd[11375]:   notice: Node md12k-3-srv state is 
now member
Sep 20 08:56:41 md12k-1-srv stonith-ng[2522]:   notice: Node md12k-3-srv state 
is now member
Sep 20 08:56:41 md12k-1-srv crmd[11375]:  warning: No reason to expect node 2 
to be down
Sep 20 08:56:41 md12k-1-srv crmd[11375]:   notice: Stonith/shutdown of 
md12k-3-srv not matched
=====
I cannot check if that is a true until the first issue is fixed.

* There are repeated probe results with rc 6 (PCMK_OCF_NOT_CONFIGURED)
and 189 (PCMK_OCF_CONNECTION_DIED) if probes are running on remote nodes
after their previous *home* node rejoins. Also there are some with
rc 1 (PCMK_OCF_UNKNOWN_ERROR):
=====
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 221 
(ost0033-es03a_monitor_0) on es12kxe-7-srv failed (target: 7 vs. rc: 6): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 221 
(ost0033-es03a_monitor_0) on es12kxe-7-srv failed (target: 7 vs. rc: 6): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 194 
(ost0033-es02a_monitor_0) on es12kxe-3-srv failed (target: 7 vs. rc: 6): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 194 
(ost0033-es02a_monitor_0) on es12kxe-3-srv failed (target: 7 vs. rc: 6): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 195 
(ost0033-es03a_monitor_0) on es12kxe-3-srv failed (target: 7 vs. rc: 189): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 195 
(ost0033-es03a_monitor_0) on es12kxe-3-srv failed (target: 7 vs. rc: 189): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 196 
(ost0033-es04a_monitor_0) on es12kxe-3-srv failed (target: 7 vs. rc: 189): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 196 
(ost0033-es04a_monitor_0) on es12kxe-3-srv failed (target: 7 vs. rc: 189): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 246 
(ost0033-es04a_monitor_0) on es7700-3-srv failed (target: 7 vs. rc: 189): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 246 
(ost0033-es04a_monitor_0) on es7700-3-srv failed (target: 7 vs. rc: 189): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 247 
(ost0034-es01a_monitor_0) on es7700-3-srv failed (target: 7 vs. rc: 189): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 247 
(ost0034-es01a_monitor_0) on es7700-3-srv failed (target: 7 vs. rc: 189): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 245 
(ost0033-es03a_monitor_0) on es7700-3-srv failed (target: 7 vs. rc: 1): Error
Sep 20 08:55:41 md12k-1-srv crmd[11375]:  warning: Action 245 
(ost0033-es03a_monitor_0) on es7700-3-srv failed (target: 7 vs. rc: 1): Error
=====
rc=6 ones look absolutely incorrect.
rc=1 one appears from a probe op finished with rc=7 (NOT_RUNNING).
rc=189 ones seem to be expected.

Hope this info has some value,
Best regards,
Vladislav

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] pcmk_remote evaluation (continued)

Reply via email to