Hi, as 1.1.17 received a lot of care in pcmk_remote, I decided to try it again in rather big setup (less then previous, so I'm not hit by IPC disconnects here).
>From the first runs there are still some severe issues when cluster nodes are >fenced. The following results are obtained by killing the DC node (md12k-3-srv) which was hosting remote resources for nodes es7700-3-srv, es12kxe-3-srv and es12kxe-7-srv. After the fence new DC (md12k-1-srv) has moved that resources the following way: ===== Sep 20 08:53:28 md12k-1-srv pengine[2525]: notice: Move es12kxe-3-srv (Started md12k-3-srv -> md12k-4-srv) Sep 20 08:53:28 md12k-1-srv pengine[2525]: notice: Move es12kxe-7-srv (Started md12k-3-srv -> md12k-1-srv) Sep 20 08:53:28 md12k-1-srv pengine[2525]: notice: Move es7700-3-srv (Started md12k-3-srv -> md12k-2-srv) ===== * The first issue I found (and I expect that to be a reason for some other issues) is that pacemaker_remote does not drop an old crmds' connection after new crmd connects. As IPC proxy connections are in the hash table, there is a 50% chance that remoted tries to reach an old crmd to f.e. proxy checks of node attributes when resources are reprobed. That leads to timeouts of that resources' probes with consequent reaction from a cluster. A solution here could be to drop old IPC proxy connection as soon as new one is established. * I _suspect_ that the issue above could lead to following lines in a cluster node logs. I didn't check, but I suspect that when remoted decides to disconnect an old connection after fenced node goes up and TCP connections are reset - it disconnects a new one too. At least this issue happens at the same time fenced node rejoins a cluster. These logs are for the case no resources operate node attributes (I removed that resources from the CIB and set a stickiness for all others). ===== Sep 20 08:55:41 md12k-1-srv crmd[11375]: error: Timed out (10000 ms) while waiting for remote data Sep 20 08:55:41 md12k-1-srv crmd[11375]: error: Unable to receive expected reply, disconnecting. Sep 20 08:55:41 md12k-1-srv crmd[11375]: error: Remote lrmd server disconnected while waiting for reply with id 9823. Sep 20 08:55:41 md12k-1-srv crmd[11375]: error: Unexpected disconnect on remote-node es12kxe-7-srv Sep 20 08:55:41 md12k-1-srv crmd[11375]: error: Result of monitor operation for es12kxe-7-srv on md12k-1-srv: Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: error: Couldn't perform lrmd_rsc_info operation (timeout=0): -107: Success (0) Sep 20 08:55:41 md12k-1-srv crmd[11375]: error: LRMD disconnected Sep 20 08:55:41 md12k-1-srv crmd[11375]: error: Could not add resource ost0033-es03a to LRM es12kxe-7-srv Sep 20 08:55:41 md12k-1-srv crmd[11375]: error: Invalid resource definition for ost0033-es03a Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.13" subt="request" reference="lrm_invoke-tengine-150 5897731-3872" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_host_to="md12k-1-srv" acl_target="hacluster" crm_user="hacluster"> Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: bad input <crm_xml> Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: bad input <rsc_op id="221" operation="monitor" operation_key="ost0033-es03a_monitor_0" on_node="es12kxe-7-srv" on_node_uuid= "es12kxe-7-srv" router_node="md12k-1-srv" transition-key="221:23:7:7df0b5ef-5bfa-4d21-82e3-4dc964d6cffd"> Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: bad input <primitive id="ost0033-es03a" class="ocf" provider="ddn" type="lustre-server"/> Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: bad input <attributes CRM_meta_on_node="es12kxe-7-srv" CRM_meta_on_node_uuid="es12kxe-7-srv" CRM_meta_op_target_rc="7" CRM _meta_record_pending="false" CRM_meta_timeout="300000" crm_feature_set="3.0.13" device="/dev/ddn/es03a_ost0033" directory="/lustre/es03a/ost0033" lustre_resource_type="ost" manage_d irectory="1"/> Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: bad input </rsc_op> Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: bad input </crm_xml> Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: bad input </create_request_adv> Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Resource ost0033-es03a no longer exists in the lrmd Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 221 (ost0033-es03a_monitor_0) on es12kxe-7-srv failed (target: 7 vs. rc: 6): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: notice: Transition aborted by lrm_rsc_op.ost0033-es03a_last_failure_0: Event failed Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 221 (ost0033-es03a_monitor_0) on es12kxe-7-srv failed (target: 7 vs. rc: 6): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: error: Result of probe operation for ost0033-es03a on es12kxe-7-srv: Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: notice: Transition aborted by operation es12kxe-3-srv_monitor_30000 'create' on md12k-4-srv: Old event ... ep 20 08:56:41 md12k-1-srv attrd[2524]: notice: Node md12k-3-srv state is now member Sep 20 08:56:41 md12k-1-srv cib[2511]: notice: Node md12k-3-srv state is now member Sep 20 08:56:41 md12k-1-srv pacemakerd[2398]: notice: Node md12k-3-srv state is now member Sep 20 08:56:41 md12k-1-srv crmd[11375]: notice: Node md12k-3-srv state is now member Sep 20 08:56:41 md12k-1-srv stonith-ng[2522]: notice: Node md12k-3-srv state is now member Sep 20 08:56:41 md12k-1-srv crmd[11375]: warning: No reason to expect node 2 to be down Sep 20 08:56:41 md12k-1-srv crmd[11375]: notice: Stonith/shutdown of md12k-3-srv not matched ===== I cannot check if that is a true until the first issue is fixed. * There are repeated probe results with rc 6 (PCMK_OCF_NOT_CONFIGURED) and 189 (PCMK_OCF_CONNECTION_DIED) if probes are running on remote nodes after their previous *home* node rejoins. Also there are some with rc 1 (PCMK_OCF_UNKNOWN_ERROR): ===== Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 221 (ost0033-es03a_monitor_0) on es12kxe-7-srv failed (target: 7 vs. rc: 6): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 221 (ost0033-es03a_monitor_0) on es12kxe-7-srv failed (target: 7 vs. rc: 6): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 194 (ost0033-es02a_monitor_0) on es12kxe-3-srv failed (target: 7 vs. rc: 6): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 194 (ost0033-es02a_monitor_0) on es12kxe-3-srv failed (target: 7 vs. rc: 6): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 195 (ost0033-es03a_monitor_0) on es12kxe-3-srv failed (target: 7 vs. rc: 189): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 195 (ost0033-es03a_monitor_0) on es12kxe-3-srv failed (target: 7 vs. rc: 189): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 196 (ost0033-es04a_monitor_0) on es12kxe-3-srv failed (target: 7 vs. rc: 189): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 196 (ost0033-es04a_monitor_0) on es12kxe-3-srv failed (target: 7 vs. rc: 189): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 246 (ost0033-es04a_monitor_0) on es7700-3-srv failed (target: 7 vs. rc: 189): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 246 (ost0033-es04a_monitor_0) on es7700-3-srv failed (target: 7 vs. rc: 189): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 247 (ost0034-es01a_monitor_0) on es7700-3-srv failed (target: 7 vs. rc: 189): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 247 (ost0034-es01a_monitor_0) on es7700-3-srv failed (target: 7 vs. rc: 189): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 245 (ost0033-es03a_monitor_0) on es7700-3-srv failed (target: 7 vs. rc: 1): Error Sep 20 08:55:41 md12k-1-srv crmd[11375]: warning: Action 245 (ost0033-es03a_monitor_0) on es7700-3-srv failed (target: 7 vs. rc: 1): Error ===== rc=6 ones look absolutely incorrect. rc=1 one appears from a probe op finished with rc=7 (NOT_RUNNING). rc=189 ones seem to be expected. Hope this info has some value, Best regards, Vladislav _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org