Re: [ClusterLabs] CIB: op-status=4 ?
Thanks, your explanation is very helpful considering that it happens rarely and only on the first boot after VMs are created. On Mon, May 22, 2017 at 9:34 PM, Ken Gaillot wrote: > On 05/19/2017 02:03 PM, Radoslaw Garbacz wrote: > > Hi, > > > > I have some more information regarding this issue (pacemaker debug logs). > > > > Firstly, I have not mentioned probably important facts: > > 1) this happen rarely > > 2) this happen only on first boot > > 3) turning on debug in corosync/pacemaker significantly reduced > > frequency of this happening, i.e. without debug every ~7 cluster > > creation, with debug every ~66 cluster creation. > > > > This is a 3 nodes cluster on Azure Cloud and it does not seem like the > > resource agent is reporting an error, because all nodes logs proper "not > > running" results: > > > > The resource in question name is "dbx_head_head". > > > > node1) > > May 19 13:15:41 [6872] olegdbx39-vm-0 stonith-ng:debug: > > xml_patch_version_check:Can apply patch 2.5.32 to 2.5.31 > > head.ocf.sh <http://head.ocf.sh>(dbx_head_head)[7717]: > > 2017/05/19_13:15:42 DEBUG: head_monitor: return 7 > > May 19 13:15:42 [6873] olegdbx39-vm-0 lrmd:debug: > > operation_finished:dbx_head_head_monitor_0:7717 - exited with rc=7 > > May 19 13:15:42 [6873] olegdbx39-vm-0 lrmd:debug: > > operation_finished:dbx_head_head_monitor_0:7717:stderr [ -- empty > -- ] > > May 19 13:15:42 [6873] olegdbx39-vm-0 lrmd:debug: > > operation_finished:dbx_head_head_monitor_0:7717:stdout [ -- empty > -- ] > > May 19 13:15:42 [6873] olegdbx39-vm-0 lrmd:debug: > > log_finished:finished - rsc:dbx_head_head action:monitor call_id:14 > > pid:7717 exit-code:7 exec-time:932ms queue-time:0ms > > > > > > node2) > > May 19 13:15:41 [6266] olegdbx39-vm02 stonith-ng:debug: > > xml_patch_version_check:Can apply patch 2.5.31 to 2.5.30 > > head.ocf.sh <http://head.ocf.sh>(dbx_head_head)[6485]: > > 2017/05/19_13:15:41 DEBUG: head_monitor: return 7 > > May 19 13:15:41 [6267] olegdbx39-vm02 lrmd:debug: > > operation_finished:dbx_head_head_monitor_0:6485 - exited with rc=7 > > May 19 13:15:41 [6267] olegdbx39-vm02 lrmd:debug: > > operation_finished:dbx_head_head_monitor_0:6485:stderr [ -- empty > -- ] > > May 19 13:15:41 [6267] olegdbx39-vm02 lrmd:debug: > > operation_finished:dbx_head_head_monitor_0:6485:stdout [ -- empty > -- ] > > May 19 13:15:41 [6267] olegdbx39-vm02 lrmd:debug: > > log_finished:finished - rsc:dbx_head_head action:monitor call_id:14 > > pid:6485 exit-code:7 exec-time:790ms queue-time:0ms > > May 19 13:15:41 [6266] olegdbx39-vm02 stonith-ng:debug: > > xml_patch_version_check:Can apply patch 2.5.32 to 2.5.31 > > May 19 13:15:41 [6266] olegdbx39-vm02 stonith-ng:debug: > > xml_patch_version_check:Can apply patch 2.5.33 to 2.5.32 > > > > > > node3) > > == the logs here are different - there is no probing, just stop attempt > > (with proper exit code) == > > > > == reporting not existing resource == > > > > May 19 13:15:29 [6293] olegdbx39-vm03 lrmd:debug: > > process_lrmd_message:Processed lrmd_rsc_info operation from > > d2c8a871-410a-4006-be52-ee684c0a5f38: rc=0, reply=0, notify=0 > > May 19 13:15:29 [6293] olegdbx39-vm03 lrmd:debug: > > process_lrmd_message:Processed lrmd_rsc_exec operation from > > d2c8a871-410a-4006-be52-ee684c0a5f38: rc=10, reply=1, notify=0 > > May 19 13:15:29 [6293] olegdbx39-vm03 lrmd:debug: > > log_execute:executing - rsc:dbx_first_datas action:monitor call_id:10 > > May 19 13:15:29 [6293] olegdbx39-vm03 lrmd: info: > > process_lrmd_get_rsc_info:Resource 'dbx_head_head' not found (2 > > active resources) > > FYI, this is normal. It just means the lrmd hasn't been asked to do > anything with this resource before, so it's not found in the lrmd's memory. > > > May 19 13:15:29 [6293] olegdbx39-vm03 lrmd:debug: > > process_lrmd_message:Processed lrmd_rsc_info operation from > > d2c8a871-410a-4006-be52-ee684c0a5f38: rc=0, reply=0, notify=0 > > May 19 13:15:29 [6293] olegdbx39-vm03 lrmd: info: > > process_lrmd_rsc_register:Added 'dbx_head_head' to the rsc list (3 > > active resources) > > May 19 13:15:40 [6293] olegdbx39-vm03 lrmd:debug: > > process_lrmd_message:
Re: [ClusterLabs] CIB: op-status=4 ?
r for dbx_mounts_nodes:0 on olegdbx39-vm03: unknown (189) May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_nfs_mounts_datas on olegdbx39-vm03 to dbx_nfs_mounts_datas:0 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: determine_op_status:dbx_nfs_mounts_datas_monitor_0 on olegdbx39-vm03 returned 'unknown' (189) instead of the expected value: 'not running' (7) May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon: warning: unpack_rsc_op_failure:Processing failed op monitor for dbx_nfs_mounts_datas:0 on olegdbx39-vm03: unknown (189) May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_ready_primary on olegdbx39-vm03 to dbx_ready_primary:0 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_first_datas on olegdbx39-vm-0 to dbx_first_datas:1 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_swap_nodes on olegdbx39-vm-0 to dbx_swap_nodes:0 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_mounts_nodes on olegdbx39-vm-0 to dbx_mounts_nodes:1 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_bind_mounts_nodes on olegdbx39-vm-0 to dbx_bind_mounts_nodes:1 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_nfs_mounts_datas on olegdbx39-vm-0 to dbx_nfs_mounts_datas:0 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_nfs_nodes on olegdbx39-vm-0 to dbx_nfs_nodes:0 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_ready_primary on olegdbx39-vm-0 to dbx_ready_primary:0 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_first_datas on olegdbx39-vm02 to dbx_first_datas:1 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_swap_nodes on olegdbx39-vm02 to dbx_swap_nodes:0 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_nfs_mounts_datas on olegdbx39-vm02 to dbx_nfs_mounts_datas:0 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_mounts_nodes on olegdbx39-vm02 to dbx_mounts_nodes:1 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_bind_mounts_nodes on olegdbx39-vm02 to dbx_bind_mounts_nodes:1 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_nfs_nodes on olegdbx39-vm02 to dbx_nfs_nodes:0 May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug: find_anonymous_clone:Internally renamed dbx_ready_primary on olegdbx39-vm02 to dbx_ready_primary:0 [...] Thanks in advance, On Thu, May 18, 2017 at 4:37 PM, Ken Gaillot wrote: > On 05/17/2017 06:10 PM, Radoslaw Garbacz wrote: > > Hi, > > > > I have a question regarding ' 'op-status > > attribute getting value 4. > > > > In my case I have a strange behavior, when resources get those "monitor" > > operation entries in the CIB with op-status=4, and they do not seem to > > be called (exec-time=0). > > > > What does 'op-status' = 4 mean? > > The action had an error status > > > > > I would appreciate some elaboration regarding this, since this is > > interpreted by pacemaker as an error, which causes logs: > > crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from > > re-starting anywhere: operation monitor failed 'not configured' (6) > > The rc-code="6" is the more interesting number; it's the result returned > by the resource agent. As you can see above, it means "not configured". > What that means exactly is up to the resource agent's interpretation. > > > and I am pretty sure the resource agent was not called (no logs, > > exec-time=0) > > Normally this could only come from the resource agent. > > However there are two cases where pacemaker generates this error itself: > if the resource definition in the CIB is invalid; and if your version of > pacemaker was compiled with support for reading sensitive parameter > values from a file, but that file could not be read. > > It doesn't sound like your case is either one of those though, since > they would prevent the resource from even starting. Most likely it's > coming from the resource agent. I'd look at the resource agent source > code and see where
Re: [ClusterLabs] CIB: op-status=4 ?
Thanks, On Thu, May 18, 2017 at 4:37 PM, Ken Gaillot wrote: > On 05/17/2017 06:10 PM, Radoslaw Garbacz wrote: > > Hi, > > > > I have a question regarding ' 'op-status > > attribute getting value 4. > > > > In my case I have a strange behavior, when resources get those "monitor" > > operation entries in the CIB with op-status=4, and they do not seem to > > be called (exec-time=0). > > > > What does 'op-status' = 4 mean? > > The action had an error status > > > > > I would appreciate some elaboration regarding this, since this is > > interpreted by pacemaker as an error, which causes logs: > > crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from > > re-starting anywhere: operation monitor failed 'not configured' (6) > > The rc-code="6" is the more interesting number; it's the result returned > by the resource agent. As you can see above, it means "not configured". > What that means exactly is up to the resource agent's interpretation. > > > and I am pretty sure the resource agent was not called (no logs, > > exec-time=0) > > Normally this could only come from the resource agent. > > However there are two cases where pacemaker generates this error itself: > if the resource definition in the CIB is invalid; and if your version of > pacemaker was compiled with support for reading sensitive parameter > values from a file, but that file could not be read. > > It doesn't sound like your case is either one of those though, since > they would prevent the resource from even starting. Most likely it's > coming from the resource agent. I'd look at the resource agent source > code and see where it can return OCF_ERR_CONFIGURED. > > > There are two aspects of this: > > > > 1) harmless (pacemaker seems to not bother about it), which I guess > > indicates cancelled monitoring operations: > > op-status=4, rc-code=189 > > This error means the connection between the crmd and lrmd daemons was > lost -- most commonly, that shows up for operations that were pending at > shutdown. > > > > > * Example: > > > operation_key="dbx_first_datas_monitor_0" operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.12" > > transition-key="38:0:7:c8b63d9d-9c70-4f99-aa1b-e993de6e4739" > > transition-magic="4:189;38:0:7:c8b63d9d-9c70-4f99-aa1b-e993de6e4739" > > on_node="olegdbx61-vm01" call-id="10" rc-code="189" op-status="4" > > interval="0" last-run="1495057378" last-rc-change="1495057378" > > exec-time="0" queue-time="0" op-digest="f6bd1386a336e8e6ee25ecb651a9ef > b6"/> > > > > > > 2) error level one (op-status=4, rc-code=6), which generates logs: > > crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from > > re-starting anywhere: operation monitor failed 'not configured' (6) > > > > * Example: > > > operation_key="dbx_head_head_monitor_0" operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.12" > > transition-key="39:0:7:c8b63d9d-9c70-4f99-aa1b-e993de6e4739" > > transition-magic="4:6;39:0:7:c8b63d9d-9c70-4f99-aa1b-e993de6e4739" > > on_node="olegdbx61-vm01" call-id="9" rc-code="6" > > op-status="4" interval="0" last-run="1495057389" > > last-rc-change="1495057389" exec-time="0" queue-time="0" > > op-digest="60cdc9db1c5b77e8dba698d3d0c8cda8"/> > > > > > > Could it be some hardware (VM hyperviser) issue? > > > > > > Thanks in advance, > > > > -- > > Best Regards, > > > > Radoslaw Garbacz > > XtremeData Incorporated > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Best Regards, Radoslaw Garbacz XtremeData Incorporated ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] CIB: op-status=4 ?
Hi, I have a question regarding ' 'op-status attribute getting value 4. In my case I have a strange behavior, when resources get those "monitor" operation entries in the CIB with op-status=4, and they do not seem to be called (exec-time=0). What does 'op-status' = 4 mean? I would appreciate some elaboration regarding this, since this is interpreted by pacemaker as an error, which causes logs: crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from re-starting anywhere: operation monitor failed 'not configured' (6) and I am pretty sure the resource agent was not called (no logs, exec-time=0) There are two aspects of this: 1) harmless (pacemaker seems to not bother about it), which I guess indicates cancelled monitoring operations: op-status=4, rc-code=189 * Example: 2) error level one (op-status=4, rc-code=6), which generates logs: crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from re-starting anywhere: operation monitor failed 'not configured' (6) * Example: Could it be some hardware (VM hyperviser) issue? Thanks in advance, -- Best Regards, Radoslaw Garbacz XtremeData Incorporated ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] pacemaker daemon shutdown time with lost remote node
Hi, I have a question regarding pacemaker daemon shutdown procedure/configuration. In my case, when a remote node is lost pacemaker needs exactly 10minutes to shutdown, during which there is nothing logged. So my questions: 1. What is pacemaker doing at this time? 2. How to make it shorter? Changed Pacemaker Configuration: - cluster-delay - dc-deadtime Pacemaker Logs: Apr 28 17:38:08 [17689] ip-10-41-177-183 pacemakerd: notice: crm_signal_dispatch: Caught 'Terminated' signal | 15 (invoking handler) Apr 28 17:38:08 [17689] ip-10-41-177-183 pacemakerd: notice: pcmk_shutdown_worker:Shutting down Pacemaker Apr 28 17:38:08 [17689] ip-10-41-177-183 pacemakerd: notice: stop_child: Stopping crmd | sent signal 15 to process 17698 Apr 28 17:48:07 [17695] ip-10-41-177-183 lrmd: info: cancel_recurring_action: Cancelling ocf operation monitor_head_monitor_191000 Apr 28 17:48:07 [17695] ip-10-41-177-183 lrmd: info: log_execute: executing - rsc:monitor_head action:stop call_id:130 [...] Apr 28 17:48:07 [17689] ip-10-41-177-183 pacemakerd: info: main: Exiting pacemakerd Apr 28 17:48:07 [17689] ip-10-41-177-183 pacemakerd: info: crm_xml_cleanup: Cleaning up memory from libxml2 Pacemaker built from github: 1.16 Help greatly appreciated. -- Best Regards, Radoslaw Garbacz XtremeData Incorporated ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] nodes ID assignment issue
Hi, I have a question regarding building CIB nodes scope and specifically assignment to node IDs. It seems like the preexisting scope is not honored and nodes can get replaced based on check-in order. I pre-create the nodes scope because it is faster, then setting parameters for all the nodes later (when the number of nodes is large). >From the listings below, one can see that node with ID=1 was replaced with another node (uname), however not the options. This situation causes problems when resource assignment is based on rules involving node options. Is there a way to prevent this rearrangement of 'uname', if not whether there is a way to make the options follow 'uname', or maybe the problem is somewhere else - corosync configuration perhaps? Is the corosync 'nodeid' enforced to be also CIB node 'id'? Thanks in advance, Below is CIB committed before nodes check-in: And automatic changes after nodes check-in: -- Best Regards, Radoslaw Garbacz XtremeData Incorporated ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] cloned resources ordering and remote nodes problem
Thank you, however in my case this parameter does not change the described behavior. I have a more detail example: order: res_A-clone -> res_B-clone -> res_C when "res_C" is not on the node, which had "res_A" instance failed, it will not be restarted, only "res_A" and "res_B" all instances will. I implemented a workaround by modifying "res_C" I made it also cloned, and now it is restarted. My Pacemaker 1.1.16-1.el6 System: CentOS 6 Regards, ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] cloned resources ordering and remote nodes problem
Hi, I have a question regarding resources order settings. Having cloned resources: "res_1-clone", "res_2-clone", and defined order: first "res_1-clone" then "res_2-clone" When I have a monitoring failure on a remote node with "res_1" (an instance of "res_1-clone") which causes all dependent resources to be restarted, only instances on this remote node are being restarted, not the ones on other nodes. Is it an intentional behavior and if so, is there a way to make all instances of the cloned resource to be restarted in such a case? I can provide more details regarding the CIB configuration when needed. Pacemaker 1.1.16-1.el6 OS: Linux CentOS 6 Thanks in advance, -- Best Regards, Radoslaw Garbacz XtremeData Incorporated ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] cloned resource not deployed on all matching nodes
Thanks, On Tue, Mar 28, 2017 at 2:37 PM, Ken Gaillot wrote: > On 03/28/2017 01:26 PM, Radoslaw Garbacz wrote: > > Hi, > > > > I have a situation when a cloned resource is being deployed only on some > > of the nodes, even though this resource is similar to others, which are > > being deployed according to location rules properly. > > > > Please take a look at the configuration below and let me know if there > > is anything to do to make the resource "dbx_nfs_mounts_datas" (which is > > a primitive of "dbx_nfs_mounts_datas-clone") being deployed on all 4 > > nodes matching its location rules. > > Look in your logs for "pengine:" messages. They will list the decisions > made about where to start resources, then have a message about > "Calculated transition ... saving inputs in ..." with a file name. > > You can run crm_simulate on that file to see why the decisions were > made. The output is somewhat difficult to follow, but "crm_simulate -Ssx > $FILENAME" will show every score that went into the decision. > > > > > > > Thanks in advance, > > > > > > > > * Configuration: > > ** Nodes: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ** Resource in question: > > > > http://dbx_mounts.ocf.sh>" class="ocf" provider="dbxcl"> > >> id="dbx_nfs_mounts_datas-instance_attributes"> > > ... > > > > > > ... > > > > > > > >> id="dbx_nfs_mounts_datas-meta_attributes-target-role"/> > >> id="dbx_nfs_mounts_datas-meta_attributes-clone-max"/> > > > > > > > > > > > > ** Resource location > >> rsc="dbx_nfs_mounts_datas"> > > > id="on_nodes_dbx_nfs_mounts_datas-INFINITY" boolean-op="and"> > >> id="on_nodes_dbx_nfs_mounts_datas-INFINITY-0-expr" value="Active"/> > >> id="on_nodes_dbx_nfs_mounts_datas-INFINITY-1-expr" value="AD"/> > > > > > id="on_nodes_dbx_nfs_mounts_datas--INFINITY" boolean-op="or"> > >> id="on_nodes_dbx_nfs_mounts_datas--INFINITY-0-expr" value="Active"/> > >> id="on_nodes_dbx_nfs_mounts_datas--INFINITY-1-expr" value="AD"/> > > > > > > > > > > > > ** Status on properly deployed node: > >> type="dbx_mounts.ocf.sh <http://dbx_mounts.ocf.sh>" class="ocf" > > provider="dbxcl"> > > > operation_key="dbx_nfs_mounts_datas_start_0" operation="start" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.12" > > transition-key="156:0:0:d817e2a2-50fb-4462-bd6b-118d1d7b8ecd" > > transition-magic="0:0;156:0:0:d817e2a2-50fb-4462-bd6b-118d1d7b8ecd" > > on_node="ip-10-180-227-53" call-id="85" rc-code="0" op-status="0" > > interval="0" last-run="1490720995" last-rc-change="1490720995" > > exec-time="733" queue-time="0" > > op-digest="e95785e3e2d043b0bda24c5bd4655317" op-force-restart="" > > op-restart-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/> > > > operation_key="dbx_nfs_mounts_datas_monitor_137000" operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.12" > > transition-key="157:0:0:d817e2a2-50fb-4462-bd6b-118d1d7b8ecd" > > transition-magic="0:0;157:0:0:d817e2a2-50fb-4462-bd6b-118d1d7b8ecd" > > on_node="ip-10-
[ClusterLabs] cloned resource not deployed on all matching nodes
Hi, I have a situation when a cloned resource is being deployed only on some of the nodes, even though this resource is similar to others, which are being deployed according to location rules properly. Please take a look at the configuration below and let me know if there is anything to do to make the resource "dbx_nfs_mounts_datas" (which is a primitive of "dbx_nfs_mounts_datas-clone") being deployed on all 4 nodes matching its location rules. Thanks in advance, * Configuration: ** Nodes: ** Resource in question: ... ... ** Resource location ** Status on properly deployed node: ** Status on not properly deployed node: -- Best Regards, Radoslaw Garbacz XtremeData Incorporated ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: CIB configuration: role with many expressions - error 203
Thanks, just found that out as well. On Wed, Mar 22, 2017 at 9:39 AM, Ken Gaillot wrote: > On 03/22/2017 09:26 AM, Radoslaw Garbacz wrote: > > I have tried also as 'boolean_op', sorry did not mention this in the > > original post (just as a remark the documentation for pacemaker has both > > forms). > > *smacks forehead* > > Yep, the documentation needs to be fixed. You were right the first time, > it's "boolean-op" with a dash. > > Looking at your example again, I think the problem is that you're using > the same ID for both expressions. The ID must be unique. > > > > > To make it work I have to remove additional "" and leave > > only one. > > > > To summarize: > > - having no "boolean..." attribute and a single "expression" - works > > - having "boolean-op" and a single "expression" - works > > > > - having "boolean_op" and a single "expression" - does not work > > - having either "boolean-op" or "boolean_op" or no such phrase at all > > with more than one "expression" - does not work > > > > > > > > I have found the reason: expressions IDs within a rule is the same, once > > I made it unique it works. > > > > > > Thanks, > > > > > > On Wed, Mar 22, 2017 at 2:06 AM, Ulrich Windl > > > <mailto:ulrich.wi...@rz.uni-regensburg.de>> wrote: > > > > >>> Ken Gaillot mailto:kgail...@redhat.com>> > > schrieb am 22.03.2017 um 00:18 in Nachricht > > <94b7e5fd-cb65-4775-71df-ca8983629...@redhat.com > > <mailto:94b7e5fd-cb65-4775-71df-ca8983629...@redhat.com>>: > > > On 03/21/2017 11:20 AM, Radoslaw Garbacz wrote: > > >> Hi, > > >> > > >> I have a problem when creating rules with many expressions: > > >> > > >> > > >> > >> boolean-op="and"> > > >>type="string" > > >> id="on_nodes_dbx_first_head-expr" value="Active"/> > > >>type="string" > > >> id="on_nodes_dbx_first_head-expr" value="AH"/> > > >> > > >> > > >> > > >> Result: > > >> Call cib_replace failed (-203): Update does not conform to the > > >> configured schema > > >> > > >> Everything works when I remove "boolean-op" attribute and leave > only one > > >> expression. > > >> What do I do wrong when creating rules? > > > > > > boolean_op > > > > > > Underbar not dash :-) > > > > Good spotting, but I think a more useful error message would be > > desired ;-) > > > > > > > >> > > >> > > >> Pacemaker 1.1.16-1.el6 > > >> Written by Andrew Beekhof > > >> > > >> > > >> Thank in advance for any help, > > >> > > >> -- > > >> Best Regards, > > >> > > >> Radoslaw Garbacz > > >> XtremeData Incorporated > > > > > > ___ > > > Users mailing list: Users@clusterlabs.org > > <mailto:Users@clusterlabs.org> > > > http://lists.clusterlabs.org/mailman/listinfo/users > > <http://lists.clusterlabs.org/mailman/listinfo/users> > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> > > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > > > > ___ > > Users mailing list: Users@clusterlabs.org Users@clusterlabs.org> > > http://lists.clusterlabs.org/mailman/listinfo/users > > <http://lists.clusterlabs.org/mailman/listinfo/users> > > > > Project Home: http://www.clusterlabs.org > > Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> > > Bugs: http://bugs.clusterlabs.org > > > > > > > > > > -- > > Best Regards, > > > > Radoslaw Garbacz > > XtremeData Incorporated > > > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Best Regards, Radoslaw Garbacz XtremeData Incorporated ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: CIB configuration: role with many expressions - error 203
I have tried also as 'boolean_op', sorry did not mention this in the original post (just as a remark the documentation for pacemaker has both forms). To make it work I have to remove additional "" and leave only one. To summarize: - having no "boolean..." attribute and a single "expression" - works - having "boolean-op" and a single "expression" - works - having "boolean_op" and a single "expression" - does not work - having either "boolean-op" or "boolean_op" or no such phrase at all with more than one "expression" - does not work I have found the reason: expressions IDs within a rule is the same, once I made it unique it works. Thanks, On Wed, Mar 22, 2017 at 2:06 AM, Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > >>> Ken Gaillot schrieb am 22.03.2017 um 00:18 in > Nachricht > <94b7e5fd-cb65-4775-71df-ca8983629...@redhat.com>: > > On 03/21/2017 11:20 AM, Radoslaw Garbacz wrote: > >> Hi, > >> > >> I have a problem when creating rules with many expressions: > >> > >> > >> >> boolean-op="and"> > >>>> id="on_nodes_dbx_first_head-expr" value="Active"/> > >>>> id="on_nodes_dbx_first_head-expr" value="AH"/> > >> > >> > >> > >> Result: > >> Call cib_replace failed (-203): Update does not conform to the > >> configured schema > >> > >> Everything works when I remove "boolean-op" attribute and leave only one > >> expression. > >> What do I do wrong when creating rules? > > > > boolean_op > > > > Underbar not dash :-) > > Good spotting, but I think a more useful error message would be desired ;-) > > > > >> > >> > >> Pacemaker 1.1.16-1.el6 > >> Written by Andrew Beekhof > >> > >> > >> Thank in advance for any help, > >> > >> -- > >> Best Regards, > >> > >> Radoslaw Garbacz > >> XtremeData Incorporated > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Best Regards, Radoslaw Garbacz XtremeData Incorporated ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] CIB configuration: role with many expressions - error 203
Hi, I have a problem when creating rules with many expressions: Result: Call cib_replace failed (-203): Update does not conform to the configured schema Everything works when I remove "boolean-op" attribute and leave only one expression. What do I do wrong when creating rules? Pacemaker 1.1.16-1.el6 Written by Andrew Beekhof Thank in advance for any help, -- Best Regards, Radoslaw Garbacz XtremeData Incorporated ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: emergency stop does not honor resources ordering constraints (?)
Thank you, lost of quorum could indeed be an intentional behavior, however I experience the same situation when there is a monitoring failure or when parameter "no-quorum-policy" is set to "ignore", i.e. - normal pacemaker service stop or 'crm_resources' stop for all resources: A -> B -> C - lost quorum (with 'no-quorum-policy=ignore') or 'crm_resources' stop for all resources, when on of the resources reported "monitor" error: not ordered stop I will double check my tests, however it would be helpful to know if, by chance, it is as it suppose to be. On Wed, Dec 7, 2016 at 1:40 AM, Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > >>> Radoslaw Garbacz schrieb am > 06.12.2016 um > 18:50 in Nachricht > : > > Hi, > > > > I have encountered a problem with pacemaker resources shutdown in case of > > (seems like) any emergency situation, when order constraints are not > > honored. > > I would be grateful for any information, whether this behavior is > > intentional or should not happen (i.e. some testing issue rather then > > pacemaker behavior). It would also be helpful to know if there is any > > configuration parameter altering this, or whether there can be any reason > > (cluster event) triggering not ordered resources stop. > > > > Thanks, > > > > To illustrate the issue I provide an example below and my collected data. > > My environment uses resources cloning feature - maybe this contributes to > > my tests outcome. > > > > > > * Example: > > - having resources ordered with constraints: A -> B -> C > > - when stopping with 'crm_resources' command (all at once) resources are > > stopped: C, B, A > > - when stopping by terminating pacemaker resources are stopped: C, B, A > > - when there is a monitoring error or quorum lost: no order is honored > e.g. > > B, C, A > > Hi! > > If the node does not have quorum, it cannot do any cluster operations > (IMHO). Instead it will try to commit suicide, maby with the help of > self-fencing. So I think this case is normal for no quorum. > > Ulrich > > > > > > > > > * Version details: > > Pacemaker 1.1.15-1.1f8e642.git.el6 > > Corosync Cluster Engine, version '2.4.1.2-0da1' > > > > > > > > * My ordering constraints: > > Ordering Constraints: > > dbx_first_primary then dbx_head_head (kind:Mandatory) > > dbx_first_primary-clone then dbx_head_head (kind:Mandatory) > > dbx_head_head then dbx_mounts_nodes (kind:Mandatory) > > dbx_head_head then dbx_mounts_nodes-clone (kind:Mandatory) > > dbx_mounts_nodes then dbx_bind_mounts_nodes (kind:Mandatory) > > dbx_mounts_nodes-clone then dbx_bind_mounts_nodes-clone > (kind:Mandatory) > > dbx_bind_mounts_nodes then dbx_nfs_nodes (kind:Mandatory) > > dbx_bind_mounts_nodes-clone then dbx_nfs_nodes-clone (kind:Mandatory) > > dbx_nfs_nodes then dbx_gss_datas (kind:Mandatory) > > dbx_nfs_nodes-clone then dbx_gss_datas-clone (kind:Mandatory) > > dbx_gss_datas then dbx_nfs_mounts_datas (kind:Mandatory) > > dbx_gss_datas-clone then dbx_nfs_mounts_datas-clone (kind:Mandatory) > > dbx_nfs_mounts_datas then dbx_swap_nodes (kind:Mandatory) > > dbx_nfs_mounts_datas-clone then dbx_swap_nodes-clone (kind:Mandatory) > > dbx_swap_nodes then dbx_sync_head (kind:Mandatory) > > dbx_swap_nodes-clone then dbx_sync_head (kind:Mandatory) > > dbx_sync_head then dbx_dbx_datas (kind:Mandatory) > > dbx_sync_head then dbx_dbx_datas-clone (kind:Mandatory) > > dbx_dbx_datas then dbx_dbx_head (kind:Mandatory) > > dbx_dbx_datas-clone then dbx_dbx_head (kind:Mandatory) > > dbx_dbx_head then dbx_web_head (kind:Mandatory) > > dbx_web_head then dbx_ready_primary (kind:Mandatory) > > dbx_web_head then dbx_ready_primary-clone (kind:Mandatory) > > > > > > > > * Pacemaker stop (OK): > > ready.ocf.sh(dbx_ready_primary)[18639]: 2016/12/06_15:40:32 INFO: > > ready_stop: Stopping resource > > mng.ocf.sh(dbx_mng_head)[20312]:2016/12/06_15:40:44 INFO: > mng_stop: > > Stopping resource > > web.ocf.sh(dbx_web_head)[20310]:2016/12/06_15:40:44 INFO: > > dbxcl_stop: Stopping resource > > dbx.ocf.sh(dbx_dbx_head)[20569]:2016/12/06_15:40:46 INFO: > > dbxcl_stop: Stopping resource > > sync.ocf.sh(dbx_sync_head)[20719]: 2016/12/06_15:40:54 INFO: > > sync_stop: Stopping resource > > swap.ocf.sh(dbx_swap_nodes)[21053]: 2016/12/06_15:40:56 INFO: > > swap_stop: Stopping resource > > nfs.ocf.sh(dbx_nf
[ClusterLabs] emergency stop does not honor resources ordering constraints (?)
Hi, I have encountered a problem with pacemaker resources shutdown in case of (seems like) any emergency situation, when order constraints are not honored. I would be grateful for any information, whether this behavior is intentional or should not happen (i.e. some testing issue rather then pacemaker behavior). It would also be helpful to know if there is any configuration parameter altering this, or whether there can be any reason (cluster event) triggering not ordered resources stop. Thanks, To illustrate the issue I provide an example below and my collected data. My environment uses resources cloning feature - maybe this contributes to my tests outcome. * Example: - having resources ordered with constraints: A -> B -> C - when stopping with 'crm_resources' command (all at once) resources are stopped: C, B, A - when stopping by terminating pacemaker resources are stopped: C, B, A - when there is a monitoring error or quorum lost: no order is honored e.g. B, C, A * Version details: Pacemaker 1.1.15-1.1f8e642.git.el6 Corosync Cluster Engine, version '2.4.1.2-0da1' * My ordering constraints: Ordering Constraints: dbx_first_primary then dbx_head_head (kind:Mandatory) dbx_first_primary-clone then dbx_head_head (kind:Mandatory) dbx_head_head then dbx_mounts_nodes (kind:Mandatory) dbx_head_head then dbx_mounts_nodes-clone (kind:Mandatory) dbx_mounts_nodes then dbx_bind_mounts_nodes (kind:Mandatory) dbx_mounts_nodes-clone then dbx_bind_mounts_nodes-clone (kind:Mandatory) dbx_bind_mounts_nodes then dbx_nfs_nodes (kind:Mandatory) dbx_bind_mounts_nodes-clone then dbx_nfs_nodes-clone (kind:Mandatory) dbx_nfs_nodes then dbx_gss_datas (kind:Mandatory) dbx_nfs_nodes-clone then dbx_gss_datas-clone (kind:Mandatory) dbx_gss_datas then dbx_nfs_mounts_datas (kind:Mandatory) dbx_gss_datas-clone then dbx_nfs_mounts_datas-clone (kind:Mandatory) dbx_nfs_mounts_datas then dbx_swap_nodes (kind:Mandatory) dbx_nfs_mounts_datas-clone then dbx_swap_nodes-clone (kind:Mandatory) dbx_swap_nodes then dbx_sync_head (kind:Mandatory) dbx_swap_nodes-clone then dbx_sync_head (kind:Mandatory) dbx_sync_head then dbx_dbx_datas (kind:Mandatory) dbx_sync_head then dbx_dbx_datas-clone (kind:Mandatory) dbx_dbx_datas then dbx_dbx_head (kind:Mandatory) dbx_dbx_datas-clone then dbx_dbx_head (kind:Mandatory) dbx_dbx_head then dbx_web_head (kind:Mandatory) dbx_web_head then dbx_ready_primary (kind:Mandatory) dbx_web_head then dbx_ready_primary-clone (kind:Mandatory) * Pacemaker stop (OK): ready.ocf.sh(dbx_ready_primary)[18639]: 2016/12/06_15:40:32 INFO: ready_stop: Stopping resource mng.ocf.sh(dbx_mng_head)[20312]:2016/12/06_15:40:44 INFO: mng_stop: Stopping resource web.ocf.sh(dbx_web_head)[20310]:2016/12/06_15:40:44 INFO: dbxcl_stop: Stopping resource dbx.ocf.sh(dbx_dbx_head)[20569]:2016/12/06_15:40:46 INFO: dbxcl_stop: Stopping resource sync.ocf.sh(dbx_sync_head)[20719]: 2016/12/06_15:40:54 INFO: sync_stop: Stopping resource swap.ocf.sh(dbx_swap_nodes)[21053]: 2016/12/06_15:40:56 INFO: swap_stop: Stopping resource nfs.ocf.sh(dbx_nfs_nodes)[21151]: 2016/12/06_15:40:58 INFO: nfs_stop: Stopping resource dbx_mounts.ocf.sh(dbx_bind_mounts_nodes)[21344]:2016/12/06_15:40:59 INFO: dbx_mounts_stop: Stopping resource dbx_mounts.ocf.sh(dbx_mounts_nodes)[21767]: 2016/12/06_15:41:01 INFO: dbx_mounts_stop: Stopping resource head.ocf.sh(dbx_head_head)[22213]: 2016/12/06_15:41:04 INFO: head_stop: Stopping resource first.ocf.sh(dbx_first_primary)[22999]: 2016/12/06_15:41:11 INFO: first_stop: Stopping resource * Quorum lost: sync.ocf.sh(dbx_sync_head)[23099]: 2016/12/06_16:42:04 INFO: sync_stop: Stopping resource nfs.ocf.sh(dbx_nfs_nodes)[23102]: 2016/12/06_16:42:04 INFO: nfs_stop: Stopping resource mng.ocf.sh(dbx_mng_head)[23101]:2016/12/06_16:42:04 INFO: mng_stop: Stopping resource ready.ocf.sh(dbx_ready_primary)[23104]: 2016/12/06_16:42:04 INFO: ready_stop: Stopping resource web.ocf.sh(dbx_web_head)[23344]:2016/12/06_16:42:04 INFO: dbxcl_stop: Stopping resource dbx_mounts.ocf.sh(dbx_bind_mounts_nodes)[23664]:2016/12/06_16:42:05 INFO: dbx_mounts_stop: Stopping resource dbx_mounts.ocf.sh(dbx_mounts_nodes)[24459]: 2016/12/06_16:42:08 INFO: dbx_mounts_stop: Stopping resource head.ocf.sh(dbx_head_head)[25036]: 2016/12/06_16:42:11 INFO: head_stop: Stopping resource swap.ocf.sh(dbx_swap_nodes)[27491]: 2016/12/06_16:43:08 INFO: swap_stop: Stopping resource -- Best Regards, Radoslaw Garbacz XtremeData Incorporation ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker remote - invalid message detected, endian mismatch
d: ( lrmd.c:523 ) warning: send_client_notify:Notification of client remote-lrmd-ip-10-203-186-119:3121/c20a9a4e-b919-4e8a-8167-0cfa846fb24c failed Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: ( services.c:461 )info: cancel_recurring_action:Cancelling ocf operation dbx_nfs_mounts_datas_monitor_137000 Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:361 ) trace: crm_remote_send:Sending len[0]=40, start=6d726c3c Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:237 ) trace: crm_send_tls:Message size: 40 Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:246 ) error: crm_send_tls:Connection terminated rc = -10 Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:237 ) trace: crm_send_tls:Message size: 903 Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:246 ) error: crm_send_tls:Connection terminated rc = -10 Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:364 ) error: crm_remote_send:Failed to send remote msg, rc = -10 Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (lrmd_client.:584 ) error: lrmd_tls_send_msg:Failed to send remote lrmd tls msg, rc = -10 Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: ( lrmd.c:523 ) warning: send_client_notify:Notification of client remote-lrmd-ip-10-203-186-119:3121/c20a9a4e-b919-4e8a-8167-0cfa846fb24c failed Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: ( services.c:461 )info: cancel_recurring_action:Cancelling ocf operation dbx_gss_datas_monitor_127000 On Fri, Sep 30, 2016 at 4:53 PM, Jan Pokorný wrote: > On 30/09/16 11:28 -0500, Radoslaw Garbacz wrote: > > I have posted a question about this error attached to another thread, but > > because it was old and there is no answer I thought it could have been > > missed, so I am sorry for repeating it. > > > > Regarding the problem. > > I have a cluster, and when the cluster gets bigger (around 40 remote > nodes) > > some remote nodes go offline after a while and their logs report some > > message errors, there is no indication about anything wrong in the other > > logs. > > I believe I would have a plausible explanation provided it may happen > (not sure now, perhaps the ipc proxy setup would allow it) that two > messages via the same connection are transmitted, with the second one > being read as part of the first one. > > Could you please try running pacemaker_remoted with > "PCMK_trace_files=remote.c" in the respective "sysconfig" file? > > > Details: > > - 40 ec2 m3.xlarge nodes, 1 corosync ring member, 39 remote > > - maybe irrelevant, but either "cib" or "pengine" process goes to ~100% > CPU > > - it does not happen immediately > > - smaller cluster (~20 remote nodes) does not show any problems > > - pacemaker: 1.1.15-1.1f8e642.git.el6.x86_64 > > - corosync: 2.4.1-1.2.0da1.el6.x86_64 > > - libqb-1.0.0-1.28.4dff.el6.x86_64 > > - CentOS 6 > > > > Logs: > > > > [...] > > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > > crm_abort:crm_remote_header: Triggered assert at remote.c:119 : > > endian == ENDIAN_LOCAL > > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > > crm_remote_header:Invalid message detected, endian mismatch: > > badadbbd is neither 63646330 nor the swab'd 30636463 > > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > > crm_abort:crm_remote_header: Triggered assert at remote.c:119 : > > endian == ENDIAN_LOCAL > > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > > crm_remote_header:Invalid message detected, endian mismatch: > > badadbbd is neither 63646330 nor the swab'd 30636463 > > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > > crm_abort:crm_remote_header: Triggered assert at remote.c:119 : > > endian == ENDIAN_LOCAL > > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > > crm_remote_header:Invalid message detected, endian mismatch: > > badadbbd is neither 63646330 nor the swab'd 30636463 > > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: > > lrmd_remote_client_msg: Client disconnect detected in tls msg > dispatcher. > > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: > > ipc_proxy_remove_provider:ipc proxy connection for client > > ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster > > node disconnected. > > Sep
[ClusterLabs] Pacemaker remote - invalid message detected, endian mismatch
Hi, I have posted a question about this error attached to another thread, but because it was old and there is no answer I thought it could have been missed, so I am sorry for repeating it. Regarding the problem. I have a cluster, and when the cluster gets bigger (around 40 remote nodes) some remote nodes go offline after a while and their logs report some message errors, there is no indication about anything wrong in the other logs. Details: - 40 ec2 m3.xlarge nodes, 1 corosync ring member, 39 remote - maybe irrelevant, but either "cib" or "pengine" process goes to ~100% CPU - it does not happen immediately - smaller cluster (~20 remote nodes) does not show any problems - pacemaker: 1.1.15-1.1f8e642.git.el6.x86_64 - corosync: 2.4.1-1.2.0da1.el6.x86_64 - libqb-1.0.0-1.28.4dff.el6.x86_64 - CentOS 6 Logs: [...] Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_abort:crm_remote_header: Triggered assert at remote.c:119 : endian == ENDIAN_LOCAL Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_remote_header:Invalid message detected, endian mismatch: badadbbd is neither 63646330 nor the swab'd 30636463 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_abort:crm_remote_header: Triggered assert at remote.c:119 : endian == ENDIAN_LOCAL Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_remote_header:Invalid message detected, endian mismatch: badadbbd is neither 63646330 nor the swab'd 30636463 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_abort:crm_remote_header: Triggered assert at remote.c:119 : endian == ENDIAN_LOCAL Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_remote_header:Invalid message detected, endian mismatch: badadbbd is neither 63646330 nor the swab'd 30636463 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher. Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: ipc_proxy_remove_provider:ipc proxy connection for client ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster node disconnected. Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: cancel_recurring_action: Cancelling ocf operation monitor_all_monitor_191000 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_send_tls: Connection terminated rc = -53 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_send_tls: Connection terminated rc = -10 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_remote_send: Failed to send remote msg, rc = -10 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: lrmd_tls_send_msg:Failed to send remote lrmd tls msg, rc = -10 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: warning: send_client_notify: Notification of client remote-lrmd-ip-10-237-223-67:3121/b6034d3a-e296-492f-b296-725735d17e22 failed Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-ip-10-237-223-67:3121 id: b6034d3a-e296-492f-b296- 725735d17e22 Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error: ipc_proxy_accept: No ipc providers available for uid 0 gid 0 Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error: handle_new_connection:Error in connection setup (19626-21815-14): Remote I/O error (121) Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error: ipc_proxy_accept: No ipc providers available for uid 0 gid 0 Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error: handle_new_connection:Error in connection setup (19626-21815-14): Remote I/O error (121) [...] -- Best Regards, Radoslaw Garbacz XtremeData Incorporation ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] pacemaker_remoted XML parse error
Just to add maybe a helpful observation: either "cib" or "pengine" process goes to ~100% CPU when this remote nodes errors happen. On Tue, Sep 27, 2016 at 2:36 PM, Radoslaw Garbacz < radoslaw.garb...@xtremedatainc.com> wrote: > Hi, > > I encountered the same problem with pacemaker built from github at around > August 22. > > Remote nodes go offline occasionally and stay so, their logs show same > errors. The cluster is on AWS ec2 instances, the network works and is an > unlikely reason. > > Have there be any commits on github recently (after August 22) addressing > this issue? > > > Logs: > [...] > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_abort:crm_remote_header: Triggered assert at remote.c:119 : > endian == ENDIAN_LOCAL > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_remote_header:Invalid message detected, endian mismatch: > badadbbd is neither 63646330 nor the swab'd 30636463 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_abort:crm_remote_header: Triggered assert at remote.c:119 : > endian == ENDIAN_LOCAL > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_remote_header:Invalid message detected, endian mismatch: > badadbbd is neither 63646330 nor the swab'd 30636463 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_abort:crm_remote_header: Triggered assert at remote.c:119 : > endian == ENDIAN_LOCAL > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_remote_header:Invalid message detected, endian mismatch: > badadbbd is neither 63646330 nor the swab'd 30636463 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: > lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher. > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: > ipc_proxy_remove_provider:ipc proxy connection for client > ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster > node disconnected. > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: > cancel_recurring_action: Cancelling ocf operation > monitor_all_monitor_191000 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_send_tls: Connection terminated rc = -53 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_send_tls: Connection terminated rc = -10 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_remote_send: Failed to send remote msg, rc = -10 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > lrmd_tls_send_msg:Failed to send remote lrmd tls msg, rc = -10 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: warning: > send_client_notify: Notification of client > remote-lrmd-ip-10-237-223-67:3121/b6034d3a-e296-492f-b296-725735d17e22 > failed > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: notice: > lrmd_remote_client_destroy: LRMD client disconnecting remote client > - name: remote-lrmd-ip-10-237-223-67:3121 id: b6034d3a-e296-492f-b296- > 725735d17e22 > Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error: > ipc_proxy_accept: No ipc providers available for uid 0 gid 0 > Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error: > handle_new_connection:Error in connection setup (19626-21815-14): > Remote I/O error (121) > Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error: > ipc_proxy_accept: No ipc providers available for uid 0 gid 0 > Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error: > handle_new_connection:Error in connection setup (19626-21815-14): > Remote I/O error (121) > [...] > > > > > On Thu, Jun 9, 2016 at 12:24 AM, Narayanamoorthy Srinivasan < > narayanamoort...@gmail.com> wrote: > >> Don't see any issues in network traffic. >> >> Some more logs where the XML tags are incomplete: >> >> 2016-06-09T03:06:03.096449+05:30 d18-fb-7b-18-f1-8e >> pacemaker_remoted[6153]:error: Partial >> > operation="stop" crm-debug-origin="do_update_resource" >> crm_feature_set="3.0.10" transition-key="225:116:0:8fbf >> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;225:116: >> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd" >> call-id="489" rc-code="0" op-status="0" interval="0" last-run="1459491026" >> last-rc-change="1459491026" exec-time=
Re: [ClusterLabs] pacemaker_remoted XML parse error
gt; self-fencing. >>> > Appreciate if someone throws light on what could be the issue and the >>> fix. >>> > >>> > OS - SLES 12 SP1 >>> > Pacemaker Remote version - pacemaker-remote-1.1.13-14.7.x86_64 >>> > >>> > 2016-06-08T14:11:46.009073+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: Entity: line 1: parser >>> > error : AttValue: ' expected >>> > 2016-06-08T14:11:46.009314+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: >>> > key="neutron-ha-tool_monitor_0" operation="monitor" >>> > crm-debug-origin="do_update_ >>> > 2016-06-08T14:11:46.009443+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: >>> > ^ >>> > 2016-06-08T14:11:46.009567+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: Entity: line 1: parser >>> > error : attributes construct error >>> > 2016-06-08T14:11:46.009697+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: >>> > key="neutron-ha-tool_monitor_0" operation="monitor" >>> > crm-debug-origin="do_update_ >>> > 2016-06-08T14:11:46.009824+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: >>> > ^ >>> > 2016-06-08T14:11:46.009948+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: Entity: line 1: parser >>> > error : Couldn't find end of Start Tag lrm_rsc_op line 1 >>> > 2016-06-08T14:11:46.010070+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: >>> > key="neutron-ha-tool_monitor_0" operation="monitor" >>> > crm-debug-origin="do_update_ >>> > 2016-06-08T14:11:46.010191+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: >>> > ^ >>> > 2016-06-08T14:11:46.010460+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: Entity: line 1: parser >>> > error : Premature end of data in tag lrm_resource line 1 >>> > 2016-06-08T14:11:46.010718+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: >>> > key="neutron-ha-tool_monitor_0" operation="monitor" >>> > crm-debug-origin="do_update_ >>> > 2016-06-08T14:11:46.010977+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: >>> > ^ >>> > 2016-06-08T14:11:46.011234+05:30 d18-fb-7b-18-f1-8e >>> > pacemaker_remoted[6190]:error: XML Error: Entity: line 1: parser >>> > error : Premature end of data in tag lrm_resources line 1 >>> > >>> > >>> > -- >>> > Thanks & Regards >>> > Moorthy >>> >>> This sounds like the network traffic between the cluster nodes and the >>> remote nodes is being corrupted. Have there been any network changes >>> lately? Switch/firewall/etc. equipment/settings? MTU? >>> >>> You could try using a packet sniffer such as wireshark to see if the >>> traffic looks abnormal in some way. The payload is XML so it should be >>> more or less readable. >>> >>> >>> ___ >>> Users mailing list: Users@clusterlabs.org >>> http://clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> >> >> -- >> Thanks & Regards >> Moorthy >> > > > > -- > Thanks & Regards > Moorthy > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > -- Best Regards, Radoslaw Garbacz XtremeData Incorporation ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: corosync/pacemaker on ~100 nodes cluser
Indeed, the cluster is quite sluggish when responding to the events, but still acceptable for me - since the priority is to have it running with many nodes. In my case the network is quite heavily used, but the shared storage was limited. The settings, which worked for the 55 nodes I tested, were to make it running, but are not reasonable as a long time solution (hence my post). For me the "pacemaker-remote" seems to be the way to go beyond the 16-ish "corosync" limit. On Thu, Aug 25, 2016 at 1:19 AM, Ulrich Windl < ulrich.wi...@rz.uni-regensburg.de> wrote: > Hi! > > I have two questions: > 1) TOTEM being a ring protocol will have to pass each message to every > node, one after the other, right? Wouldn't a significant delay in message > processing happen? > 2) If you use some shared storage (shared disks), how do you provide > sufficient bandwidth? I'm assuming that 99 of 100 nodes don't have an > idle/standby role in the cluster. > > Regards, > Ulrich > > >>> Radoslaw Garbacz schrieb am > 24.08.2016 um > 19:49 in Nachricht > : > > Hi, > > > > Thank you for the advice. Indeed, seems like Pacemaker Remote will solve > my > > big cluster problem. > > > > With regard to your questions about my current solution, I scale corosync > > parameters based on the number of nodes, additionally modifying some of > the > > kernel network parameters. Tests I did let me select certain corosync > > settings, which works, but are possibly not the best (cluster is quite > slow > > when reacting to some quorum related events). > > > > The problem seems to be only related to cluster start, once running, any > > operations such as node lost/reconnect, agents creation/start/stop work > > well. Memory and network seems important with regard to the hardware. > > > > Below are settings I used for my latest test (the largest working > cluster I > > tried): > > * latest pacemaker/corosync > > * 55 c3.4xlarge nodes (amazon cloud) > > * 55 active nodes, 552 resources in a cluster > > * kernel settings: > > net.core.wmem_max=12582912 > > net.core.rmem_max=12582912 > > net.ipv4.tcp_rmem= 10240 87380 12582912 > > net.ipv4.tcp_wmem= 10240 87380 12582912 > > net.ipv4.tcp_window_scaling = 1 > > net.ipv4.tcp_timestamps = 1 > > net.ipv4.tcp_sack = 1 > > net.ipv4.tcp_no_metrics_save = 1 > > net.core.netdev_max_backlog = 5000 > > > > * corosync settings: > > token: 12000 > > consensus: 16000 > > join: 1500 > > send_join: 80 > > merge: 2000 > > downcheck: 2000 > > max_network_delay: 150 # for azure > > > > Best regards, > > > > > > On Tue, Aug 23, 2016 at 12:00 PM, Ken Gaillot > wrote: > > > >> On 08/23/2016 11:46 AM, Klaus Wenninger wrote: > >> > On 08/23/2016 06:26 PM, Radoslaw Garbacz wrote: > >> >> Hi, > >> >> > >> >> I would like to ask for settings (and hardware requirements) to have > >> >> corosync/pacemaker running on about 100 nodes cluster. > >> > Actually I had thought that 16 would be the limit for full > >> > pacemaker-cluster-nodes. > >> > For larger deployments pacemaker-remote should be the way to go. Were > >> > you speaking of a cluster with remote-nodes? > >> > > >> > Regards, > >> > Klaus > >> >> > >> >> For now some nodes get totally frozen (high CPU, high network usage), > >> >> so that even login is not possible. By manipulating > >> >> corosync/pacemaker/kernel parameters I managed to run it on ~40 nodes > >> >> cluster, but I am not sure which parameters are critical, how to make > >> >> it more responsive and how to make the number of nodes even bigger. > >> > >> 16 is a practical limit without special hardware and tuning, so that's > >> often what companies that offer support for clusters will accept. > >> > >> I know people have gone well higher than 16 with a lot of optimization, > >> but I think somewhere between 32 and 64 corosync can't keep up with the > >> messages. Your 40 nodes sounds about right. I'd be curious to hear what > >> you had to do (with hardware, OS tuning, and corosync tuning) to get > >> that far. > >> > >> As Klaus mentioned, Pacemaker Remote is the preferred way to go beyond > >> that currently: > >> > >> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html- > >> single/Pacemaker_Remote/index.html >
Re: [ClusterLabs] corosync/pacemaker on ~100 nodes cluser
Hi, Thank you for the advice. Indeed, seems like Pacemaker Remote will solve my big cluster problem. With regard to your questions about my current solution, I scale corosync parameters based on the number of nodes, additionally modifying some of the kernel network parameters. Tests I did let me select certain corosync settings, which works, but are possibly not the best (cluster is quite slow when reacting to some quorum related events). The problem seems to be only related to cluster start, once running, any operations such as node lost/reconnect, agents creation/start/stop work well. Memory and network seems important with regard to the hardware. Below are settings I used for my latest test (the largest working cluster I tried): * latest pacemaker/corosync * 55 c3.4xlarge nodes (amazon cloud) * 55 active nodes, 552 resources in a cluster * kernel settings: net.core.wmem_max=12582912 net.core.rmem_max=12582912 net.ipv4.tcp_rmem= 10240 87380 12582912 net.ipv4.tcp_wmem= 10240 87380 12582912 net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_timestamps = 1 net.ipv4.tcp_sack = 1 net.ipv4.tcp_no_metrics_save = 1 net.core.netdev_max_backlog = 5000 * corosync settings: token: 12000 consensus: 16000 join: 1500 send_join: 80 merge: 2000 downcheck: 2000 max_network_delay: 150 # for azure Best regards, On Tue, Aug 23, 2016 at 12:00 PM, Ken Gaillot wrote: > On 08/23/2016 11:46 AM, Klaus Wenninger wrote: > > On 08/23/2016 06:26 PM, Radoslaw Garbacz wrote: > >> Hi, > >> > >> I would like to ask for settings (and hardware requirements) to have > >> corosync/pacemaker running on about 100 nodes cluster. > > Actually I had thought that 16 would be the limit for full > > pacemaker-cluster-nodes. > > For larger deployments pacemaker-remote should be the way to go. Were > > you speaking of a cluster with remote-nodes? > > > > Regards, > > Klaus > >> > >> For now some nodes get totally frozen (high CPU, high network usage), > >> so that even login is not possible. By manipulating > >> corosync/pacemaker/kernel parameters I managed to run it on ~40 nodes > >> cluster, but I am not sure which parameters are critical, how to make > >> it more responsive and how to make the number of nodes even bigger. > > 16 is a practical limit without special hardware and tuning, so that's > often what companies that offer support for clusters will accept. > > I know people have gone well higher than 16 with a lot of optimization, > but I think somewhere between 32 and 64 corosync can't keep up with the > messages. Your 40 nodes sounds about right. I'd be curious to hear what > you had to do (with hardware, OS tuning, and corosync tuning) to get > that far. > > As Klaus mentioned, Pacemaker Remote is the preferred way to go beyond > that currently: > > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html- > single/Pacemaker_Remote/index.html > > >> Thanks, > >> > >> -- > >> Best Regards, > >> > >> Radoslaw Garbacz > >> XtremeData Incorporation > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Best Regards, Radoslaw Garbacz XtremeData Incorporation ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] corosync/pacemaker on ~100 nodes cluser
Hi, I would like to ask for settings (and hardware requirements) to have corosync/pacemaker running on about 100 nodes cluster. For now some nodes get totally frozen (high CPU, high network usage), so that even login is not possible. By manipulating corosync/pacemaker/kernel parameters I managed to run it on ~40 nodes cluster, but I am not sure which parameters are critical, how to make it more responsive and how to make the number of nodes even bigger. Thanks, -- Best Regards, Radoslaw Garbacz XtremeData Incorporation ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] libqb 0.17.1 - segfault at 1b8
Thank you, On Mon, May 2, 2016 at 4:05 PM, Ken Gaillot wrote: > On 05/02/2016 03:45 PM, Jan Pokorný wrote: > > Hello Radoslaw, > > > > On 02/05/16 11:47 -0500, Radoslaw Garbacz wrote: > >> When testing pacemaker I encountered a start error, which seems to be > >> related to reported libqb segmentation fault. > >> - cluster started and acquired quorum > >> - some nodes failed to connect to CIB, and lost membership as a result > >> - restart solved the problem > >> > >> Segmentation fault reports libqb library in version 0.17.1, a standard > >> package provided for CentOS.6. > > > > Chances are that you are running into this nasty bug: > > https://bugzilla.redhat.com/show_bug.cgi?id=1114852 > > > >> Please let me know if the problem is known, and if there is a remedy > (e.g. > >> using the latest libqb). > > > > Try libqb >= 0.17.2. > > > > [...] > > > >> Logs from /var/log/messages: > >> > >> Apr 22 15:46:41 (...) pacemakerd[90]: notice: Additional logging > >> available in /var/log/pacemaker.log > >> Apr 22 15:46:41 (...) pacemakerd[90]: notice: Configured corosync > to > >> accept connections from group 498: Library error (2) > > > > IIRC, that last line ^ was one of the symptoms. > > Yes, that does look like the culprit. The root cause is libqb being > unable to handle 6-digit PIDs, which we can see in the above logs -- > "[90]". > > As a workaround, you can lower /proc/sys/kernel/pid_max (aka > kernel.pid_max sysctl variable), if you don't want to install a newer > libqb before CentOS 6.8 is released, which will have the fix. > > _______ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Best Regards, Radoslaw Garbacz XtremeData Incorporation ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] libqb 0.17.1 - segfault at 1b8
pr 22 15:46:41 (...) pacemakerd[90]: notice: Stopping pengine: Sent -15 to process 96 Apr 22 15:46:41 (...) pengine[96]: notice: Invoking handler for signal 15: Terminated Apr 22 15:46:41 (...) pacemakerd[90]: notice: Stopping attrd: Sent -15 to process 98 Apr 22 15:46:41 (...) pacemakerd[90]:error: Managed process 98 (attrd) dumped core Apr 22 15:46:41 (...) pacemakerd[90]:error: The attrd process (98) terminated with signal 11 (core=1) Apr 22 15:46:41 (...) pacemakerd[90]: notice: Stopping lrmd: Sent -15 to process 94 Apr 22 15:46:41 (...) lrmd[94]: notice: Invoking handler for signal 15: Terminated Apr 22 15:46:41 (...) pacemakerd[90]: notice: Stopping stonith-ng: Sent -15 to process 93 Apr 22 15:46:41 (...) kernel: [17169.121628] attrd[98]: segfault at 1b8 ip 7f3a98f66181 sp 7ffe33407380 error 4 in libqb.so.0.17.1[7f3a98f57000+21000] Apr 22 15:46:50 (...) stonith-ng[93]:error: Could not connect to the CIB service: Transport endpoint is not connected (-107) Apr 22 15:46:50 (...) stonith-ng[93]: notice: Invoking handler for signal 15: Terminated Apr 22 15:46:50 (...) pacemakerd[90]: notice: Shutdown complete Apr 22 15:46:50 (...) pacemakerd[90]: notice: Attempting to inhibit respawning after fatal error Logs from corosync log: Apr 22 15:46:22 [93582] (...) corosync notice [MAIN ] Corosync Cluster Engine exiting normally Apr 22 15:46:40 [47] (...) corosync notice [MAIN ] Corosync Cluster Engine ('2.3.5.12-a71e'): started and ready to provide service. Apr 22 15:46:40 [47] (...) corosync info[MAIN ] Corosync built-in features: dbus pie relro bindnow Apr 22 15:46:40 [47] (...) corosync notice [TOTEM ] Initializing transport (UDP/IP Unicast). Apr 22 15:46:40 [47] (...) corosync notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none Apr 22 15:46:40 [47] (...) corosync notice [TOTEM ] The network interface [(...)] is now up. Apr 22 15:46:40 [47] (...) corosync notice [SERV ] Service engine loaded: corosync configuration map access [0] Apr 22 15:46:40 [47] (...) corosync info[QB] server name: cmap Apr 22 15:46:40 [47] (...) corosync notice [SERV ] Service engine loaded: corosync configuration service [1] Apr 22 15:46:40 [47] (...) corosync info[QB] server name: cfg Apr 22 15:46:40 [47] (...) corosync notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2] Apr 22 15:46:40 [47] (...) corosync info[QB] server name: cpg Apr 22 15:46:40 [47] (...) corosync notice [SERV ] Service engine loaded: corosync profile loading service [4] Apr 22 15:46:40 [47] (...) corosync notice [QUORUM] Using quorum provider corosync_votequorum Apr 22 15:46:40 [47] (...) corosync notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] Apr 22 15:46:40 [47] (...) corosync info[QB] server name: votequorum Apr 22 15:46:40 [47] (...) corosync notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3] Apr 22 15:46:40 [47] (...) corosync info[QB] server name: quorum Apr 22 15:46:40 [47] (...) corosync notice [TOTEM ] adding new UDPU member {(...)} Apr 22 15:46:40 [47] (...) corosync notice [TOTEM ] adding new UDPU member {(...)} Apr 22 15:46:40 [47] (...) corosync notice [TOTEM ] adding new UDPU member {(...)} Apr 22 15:46:40 [47] (...) corosync notice [TOTEM ] adding new UDPU member {(...)} Apr 22 15:46:40 [47] (...) corosync notice [TOTEM ] A new membership ((...):660) was formed. Members joined: 3 Apr 22 15:46:40 [47] (...) corosync notice [QUORUM] Members[1]: 3 Apr 22 15:46:40 [47] (...) corosync notice [MAIN ] Completed service synchronization, ready to provide service. Apr 22 15:46:40 [47] (...) corosync notice [TOTEM ] A new membership ((...):664) was formed. Members joined: 4 2 1 Apr 22 15:46:40 [47] (...) corosync notice [QUORUM] This node is within the primary component and will provide service. Apr 22 15:46:40 [47] (...) corosync notice [QUORUM] Members[4]: 3 4 2 1 Apr 22 15:46:40 [47] (...) corosync notice [MAIN ] Completed service synchronization, ready to provide service. Apr 22 15:46:41 [47] (...) corosync error [MAIN ] Denied connection attempt from 498:498 Apr 22 15:46:41 [47] (...) corosync error [QB] Invalid IPC credentials (48-95-2). Apr 22 15:46:41 [47] (...) corosync error [MAIN ] Denied connection attempt from 498:498 Apr 22 15:46:41 [47] (...) corosync error [QB] Invalid IPC credentials (48-92-2). Apr 22 15:46:41 [47] (...) corosync error [MAIN ] Denied connection attempt from 498:498 Apr 22 15:46:41 [47] (...) corosync error [QB] Invalid IPC credentials (48-98-2). -- Best Regards, Radoslaw Garbacz Xtreme
Re: [ClusterLabs] required nodes for quorum policy
Thank you Christine and Andrei, I took a look at the corosync quorum policy configuration options, and actually I would need a more conservative approach, i.e. to consider quorum only if all the nodes are present - any node loss is a quorum loss event for me. At present I check it in an agent, but would be helpful if pacemaker took care of this for me. I know that it is not the requirement pacemaker was designed for (i.e. it does not use the full power of this cluster environment), but for now the application we use cannot handle any node loss. On Tue, Nov 10, 2015 at 2:13 AM, Christine Caulfield wrote: > On 09/11/15 22:20, Radoslaw Garbacz wrote: > > Hi, > > > > I have a question regarding the policy to check for cluster quorum for > > corosync+pacemaker. > > > > As far as I know at present it is always (excpected_votes)/2 + 1. Seems > > like "qdiskd" has an option to change it, but it is not clear to me if > > corosync 2.x supports different quorum device. > > corosync 2 does not currently support any other quorum devices. But > watch this space > > > What are my options if I wanted to configure cluster with a different > > quorum policy (compilation options are acceptable)? > > > > Have a read of the votequorum(5) man page, there are options for > auto_tie_breaker, and maybe others, that might be useful to you. > > > Chrissie > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Best Regards, Radoslaw Garbacz XtremeData Incorporation ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [Pacemaker] large cluster - failure recovery
Thank you. Indeed the latest corosync and pacemaker does work with large clusters - some tuning is required though. By working I mean also recovering after a node loss/regain, which was the major issue before, when the corosync worked (established recovered membership), but pacemaker was not able to sync CIB - it still needs some time and CPU power to do so though. It works for me for a 34 nodes cluster with a few hundreds of resources (I haven't tested bigger yet). On Thu, Nov 19, 2015 at 2:43 AM, Cédric Dufour - Idiap Research Institute < cedric.duf...@idiap.ch> wrote: > [coming over from the old mailing list pacema...@oss.clusterlabs.org; > sorry for any thread discrepancy] > > Hello, > > We've also setup a fairly large cluster - 24 nodes / 348 resources > (pacemaker 1.1.12, corosync 1.4.7) - and pacemaker 1.1.12 is definitely the > minimum version you'll want, thanks to changes on how the CIB is handled. > > If you're going to handle a large number (~several hundreds) of resources > as well, you may need to concern yourself with the CIB size as well. > You may want to have a look at pp.17-18 of the document I wrote to > describe our setup: http://cedric.dufour.name/cv/download/idiap_havc2.pdf > > Currently, I would consider that with 24 nodes / 348 resources, we are > close to the limit of what our cluster can handle, the bottleneck being > CPU(core) power for CIB/CRM handling. Our "worst performing nodes" (out of > the 24 in the cluster) are Xeon E7-2830 @ 2.13GHz. > The main issue we currently face in when a DC is taken out and a new one > must be elected: CPU goes 100% for several tens of seconds (even minutes), > during which the cluster is totally unresponsive. Fortunately, resources > themselves just seat tight and remain available (I can't say about those > who would need to be migrated because being collocated with the DC; we > manually avoid that situation when performing maintenance that may affect > the DC) > > I'm looking forwards to migrate to corosync 2+ (there are some backports > available for debian/Jessie) and see it this would allow to push the limit > further. Unfortunately, I can't say for sure as I have only a limited > understanding of how Pacemaker/Corosync work and where CPU is bond to > become a bottleneck. > > [UPDATE] Thanks Ken for the Pacemaker Remote pointer; I'm head on to have > a look at that > > 'Hope it can help, > > Cédric > > On 04/11/15 23:26, Radoslaw Garbacz wrote: > > Thank you, will give it a try. > > On Wed, Nov 4, 2015 at 12:50 PM, Trevor Hemsley > wrote: > >> On 04/11/15 18:41, Radoslaw Garbacz wrote: >> > Details: >> > OS: CentOS 6 >> > Pacemaker: Pacemaker 1.1.9-1512.el6 >> > Corosync: Corosync Cluster Engine, version '2.3.2' >> >> yum update >> >> Pacemaker is currently 1.1.12 and corosync 1.4.7 on CentOS 6. There were >> major improvements in speed with later versions of pacemaker. >> >> Trevor >> >> ___ >> Pacemaker mailing list: pacema...@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > > -- > Best Regards, > > Radoslaw Garbacz > XtremeData Incorporation > > > ___ > Pacemaker mailing list: > Pacemaker@oss.clusterlabs.orghttp://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > -- Best Regards, Radoslaw Garbacz XtremeData Incorporation ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] required nodes for quorum policy
Hi, I have a question regarding the policy to check for cluster quorum for corosync+pacemaker. As far as I know at present it is always (excpected_votes)/2 + 1. Seems like "qdiskd" has an option to change it, but it is not clear to me if corosync 2.x supports different quorum device. What are my options if I wanted to configure cluster with a different quorum policy (compilation options are acceptable)? Thanks in advance, -- Best Regards, Radoslaw Garbacz ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] large cluster - failure recovery
Thank you Ken and Digimer for all your suggestions. On Wed, Nov 4, 2015 at 2:32 PM, Ken Gaillot wrote: > On 11/04/2015 12:55 PM, Digimer wrote: > > On 04/11/15 01:50 PM, Radoslaw Garbacz wrote: > >> Hi, > >> > >> I have a cluster of 32 nodes, and after some tuning was able to have it > >> started and running, > > > > This is not supported by RH for a reasons; it's hard to get the timing > > right. SUSE supports up to 32 nodes, but they must be doing some serious > > magic behind the scenes. > > > > I would *strongly* recommend dividing this up into a few smaller > > clusters... 8 nodes per cluster would be max I'd feel comfortable with. > > You need your cluster to solve more problems than it causes... > > Hi Radoslaw, > > RH supports up to 16. 32 should be possible with recent > pacemaker+corosync versions and careful tuning, but it's definitely > leading-edge. > > An alternative with pacemaker 1.1.10+ (1.1.12+ recommended) is Pacemaker > Remote, which easily scales to dozens of nodes: > > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html > > Pacemaker Remote is a really good approach once you start pushing the > limits of cluster nodes. Probably better than trying to get corosync to > handle more nodes. (There are long-term plans for improving corosync's > scalability, but that doesn't help you now.) > > >> but it does not recover from a node disconnect-connect failure. > >> It regains quorum, but CIB does not recover to a synchronized state and > >> "cibadmin -Q" times out. > >> > >> Is there anything with corosync or pacemaker parameters I can do to make > >> it recover from such a situation > >> (everything works for smaller clusters). > >> > >> In my case it is OK for a node to disconnect (all the major resources > >> are shutdown) > >> and later reconnect the cluster (the running monitoring agent will > >> cleanup and restart major resources if needed), > >> so I do not have STONITH configured. > >> > >> Details: > >> OS: CentOS 6 > >> Pacemaker: Pacemaker 1.1.9-1512.el6 > > > > Upgrade. > > If you can upgrade to the latest CentOS 6.7, you can get a much newer > Pacemaker. But Pacemaker is probably not limiting your cluster nodes; > the newer version's main benefit would be Pacemaker Remote support. (Of > course there are plenty of bug fixes and new features as well.) > > >> Corosync: Corosync Cluster Engine, version '2.3.2' > > > > This is not supported on EL6 at all. Please stick with corosync 1.4 and > > use the cman pluging as the quorum provider. > > CentOS is self-supported anyway, so if you're willing to handle your own > upgrades and such, nothing wrong with compiling. But corosync is up to > 2.3.5 so you're already behind. :) I'd recommend compiling libqb 0.17.2 > if you're compiling recent corosync and/or pacemaker. > > Alternatively, CentOS 7 will have recent versions of everything. > > >> Corosync configuration: > >> token: 1 > >> #token_retransmits_before_loss_const: 10 > >> consensus: 15000 > >> join: 1000 > >> send_join: 80 > >> merge: 1000 > >> downcheck: 2000 > >> #rrp_problem_count_timeout: 5000 > >> max_network_delay: 150 # for azure > >> > >> > >> Some logs: > >> > >> [...] > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not > >> applied to 1.9275.1: current "epoch" is greater than required > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application > >> of an update diff failed (-1006) > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not > >> applied to 1.9275.1: current "epoch" is greater than required > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application > >> of an update diff failed (-1006) > >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: > >> cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not > >> applied to 1.9275.1: current "epoch" is
[ClusterLabs] large cluster - failure recovery
Hi, I have a cluster of 32 nodes, and after some tuning was able to have it started and running, but it does not recover from a node disconnect-connect failure. It regains quorum, but CIB does not recover to a synchronized state and "cibadmin -Q" times out. Is there anything with corosync or pacemaker parameters I can do to make it recover from such a situation (everything works for smaller clusters). In my case it is OK for a node to disconnect (all the major resources are shutdown) and later reconnect the cluster (the running monitoring agent will cleanup and restart major resources if needed), so I do not have STONITH configured. Details: OS: CentOS 6 Pacemaker: Pacemaker 1.1.9-1512.el6 Corosync: Corosync Cluster Engine, version '2.3.2' Corosync configuration: token: 1 #token_retransmits_before_loss_const: 10 consensus: 15000 join: 1000 send_join: 80 merge: 1000 downcheck: 2000 #rrp_problem_count_timeout: 5000 max_network_delay: 150 # for azure Some logs: [...] Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not applied to 1.9275.1: current "epoch" is greater than required Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not applied to 1.9275.1: current "epoch" is greater than required Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not applied to 1.9275.1: current "epoch" is greater than required Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not applied to 1.9275.1: current "epoch" is greater than required Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng: notice: update_cib_cache_cb: [cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) [...] [...] Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: cib_native_perform_op_delegate: Couldn't perform cib_query operation (timeout=120s): Operation already in progress (-114) Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: get_cib_copy: Couldnt retrieve the CIB Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: cib_native_perform_op_delegate: Couldn't perform cib_query operation (timeout=120s): Operation already in progress (-114) Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error: get_cib_copy: Couldnt retrieve the CIB Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\ Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 14 20 31 30 8 25 18 7 4 Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice [MAIN ] Completed service synchronization, ready to provide service. Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\ Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice [QUORUM] Members[32]: 14 20 31 30 8 25 18 7 4 [...] [...] Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice: update_cib_cache_cb:[cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: info: apply_xml_diff: Digest mis-match: expected 01192e5118739b7c33c23f7645da3f45, calculated f8028c0c98526179ea5df0a2ba0d09de Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: warning: cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not applied to 1.15046.2: Failed application of an update diff Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice: update_cib_cache_cb:[cib_diff_notify] Patch aborted: Application of an update diff failed (-1006) Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: notice: cib_process_diff: Diff 1.15046.2 -> 1.15046.3 from local not applied to 1.15046.3: current "num_updates" is greater than required [...] ps. Sorry if should posted on corosync newsgroup, just the CIB synchronization fails, so this group seemed to me the right place. -- Best Regards, Radoslaw Garbacz ___ Users mai