Re: [ClusterLabs] CIB: op-status=4 ?

2017-05-23 Thread Radoslaw Garbacz
Thanks, your explanation is very helpful considering that it happens rarely
and only on the first boot after VMs are created.

On Mon, May 22, 2017 at 9:34 PM, Ken Gaillot  wrote:

> On 05/19/2017 02:03 PM, Radoslaw Garbacz wrote:
> > Hi,
> >
> > I have some more information regarding this issue (pacemaker debug logs).
> >
> > Firstly, I have not mentioned probably important facts:
> > 1) this happen rarely
> > 2) this happen only on first boot
> > 3) turning on debug in corosync/pacemaker significantly reduced
> > frequency of this happening, i.e. without debug every ~7 cluster
> > creation, with debug every ~66 cluster creation.
> >
> > This is a 3 nodes cluster on Azure Cloud and it does not seem like the
> > resource agent is reporting an error, because all nodes logs proper "not
> > running" results:
> >
> > The resource in question name is "dbx_head_head".
> >
> > node1)
> > May 19 13:15:41 [6872] olegdbx39-vm-0 stonith-ng:debug:
> > xml_patch_version_check:Can apply patch 2.5.32 to 2.5.31
> > head.ocf.sh <http://head.ocf.sh>(dbx_head_head)[7717]:
> > 2017/05/19_13:15:42 DEBUG: head_monitor: return 7
> > May 19 13:15:42 [6873] olegdbx39-vm-0   lrmd:debug:
> > operation_finished:dbx_head_head_monitor_0:7717 - exited with rc=7
> > May 19 13:15:42 [6873] olegdbx39-vm-0   lrmd:debug:
> > operation_finished:dbx_head_head_monitor_0:7717:stderr [ -- empty
> -- ]
> > May 19 13:15:42 [6873] olegdbx39-vm-0   lrmd:debug:
> > operation_finished:dbx_head_head_monitor_0:7717:stdout [ -- empty
> -- ]
> > May 19 13:15:42 [6873] olegdbx39-vm-0   lrmd:debug:
> > log_finished:finished - rsc:dbx_head_head action:monitor call_id:14
> > pid:7717 exit-code:7 exec-time:932ms queue-time:0ms
> >
> >
> > node2)
> > May 19 13:15:41 [6266] olegdbx39-vm02 stonith-ng:debug:
> > xml_patch_version_check:Can apply patch 2.5.31 to 2.5.30
> > head.ocf.sh <http://head.ocf.sh>(dbx_head_head)[6485]:
> > 2017/05/19_13:15:41 DEBUG: head_monitor: return 7
> > May 19 13:15:41 [6267] olegdbx39-vm02   lrmd:debug:
> > operation_finished:dbx_head_head_monitor_0:6485 - exited with rc=7
> > May 19 13:15:41 [6267] olegdbx39-vm02   lrmd:debug:
> > operation_finished:dbx_head_head_monitor_0:6485:stderr [ -- empty
> -- ]
> > May 19 13:15:41 [6267] olegdbx39-vm02   lrmd:debug:
> > operation_finished:dbx_head_head_monitor_0:6485:stdout [ -- empty
> -- ]
> > May 19 13:15:41 [6267] olegdbx39-vm02   lrmd:debug:
> > log_finished:finished - rsc:dbx_head_head action:monitor call_id:14
> > pid:6485 exit-code:7 exec-time:790ms queue-time:0ms
> > May 19 13:15:41 [6266] olegdbx39-vm02 stonith-ng:debug:
> > xml_patch_version_check:Can apply patch 2.5.32 to 2.5.31
> > May 19 13:15:41 [6266] olegdbx39-vm02 stonith-ng:debug:
> > xml_patch_version_check:Can apply patch 2.5.33 to 2.5.32
> >
> >
> > node3)
> > ==  the logs here are different - there is no probing, just stop attempt
> > (with proper exit code) ==
> >
> > == reporting not existing resource ==
> >
> > May 19 13:15:29 [6293] olegdbx39-vm03   lrmd:debug:
> > process_lrmd_message:Processed lrmd_rsc_info operation from
> > d2c8a871-410a-4006-be52-ee684c0a5f38: rc=0, reply=0, notify=0
> > May 19 13:15:29 [6293] olegdbx39-vm03   lrmd:debug:
> > process_lrmd_message:Processed lrmd_rsc_exec operation from
> > d2c8a871-410a-4006-be52-ee684c0a5f38: rc=10, reply=1, notify=0
> > May 19 13:15:29 [6293] olegdbx39-vm03   lrmd:debug:
> > log_execute:executing - rsc:dbx_first_datas action:monitor call_id:10
> > May 19 13:15:29 [6293] olegdbx39-vm03   lrmd: info:
> > process_lrmd_get_rsc_info:Resource 'dbx_head_head' not found (2
> > active resources)
>
> FYI, this is normal. It just means the lrmd hasn't been asked to do
> anything with this resource before, so it's not found in the lrmd's memory.
>
> > May 19 13:15:29 [6293] olegdbx39-vm03   lrmd:debug:
> > process_lrmd_message:Processed lrmd_rsc_info operation from
> > d2c8a871-410a-4006-be52-ee684c0a5f38: rc=0, reply=0, notify=0
> > May 19 13:15:29 [6293] olegdbx39-vm03   lrmd: info:
> > process_lrmd_rsc_register:Added 'dbx_head_head' to the rsc list (3
> > active resources)
> > May 19 13:15:40 [6293] olegdbx39-vm03   lrmd:debug:
> > process_lrmd_message:   

Re: [ClusterLabs] CIB: op-status=4 ?

2017-05-19 Thread Radoslaw Garbacz
r for
dbx_mounts_nodes:0 on olegdbx39-vm03: unknown (189)
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_nfs_mounts_datas on
olegdbx39-vm03 to dbx_nfs_mounts_datas:0
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
determine_op_status:dbx_nfs_mounts_datas_monitor_0 on
olegdbx39-vm03 returned 'unknown' (189) instead of the expected value:
'not running' (7)
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:  warning:
unpack_rsc_op_failure:Processing failed op monitor for
dbx_nfs_mounts_datas:0 on olegdbx39-vm03: unknown (189)
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_ready_primary on
olegdbx39-vm03 to dbx_ready_primary:0
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_first_datas on
olegdbx39-vm-0 to dbx_first_datas:1
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_swap_nodes on
olegdbx39-vm-0 to dbx_swap_nodes:0
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_mounts_nodes on
olegdbx39-vm-0 to dbx_mounts_nodes:1
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_bind_mounts_nodes on
olegdbx39-vm-0 to dbx_bind_mounts_nodes:1
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_nfs_mounts_datas on
olegdbx39-vm-0 to dbx_nfs_mounts_datas:0
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_nfs_nodes on olegdbx39-vm-0
to dbx_nfs_nodes:0
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_ready_primary on
olegdbx39-vm-0 to dbx_ready_primary:0
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_first_datas on
olegdbx39-vm02 to dbx_first_datas:1
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_swap_nodes on
olegdbx39-vm02 to dbx_swap_nodes:0
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_nfs_mounts_datas on
olegdbx39-vm02 to dbx_nfs_mounts_datas:0
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_mounts_nodes on
olegdbx39-vm02 to dbx_mounts_nodes:1
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_bind_mounts_nodes on
olegdbx39-vm02 to dbx_bind_mounts_nodes:1
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_nfs_nodes on
olegdbx39-vm02 to dbx_nfs_nodes:0
May 19 13:15:42 [8114] olegdbx39-vm-0crm_mon:debug:
find_anonymous_clone:Internally renamed dbx_ready_primary on
olegdbx39-vm02 to dbx_ready_primary:0
[...]


Thanks in advance,


On Thu, May 18, 2017 at 4:37 PM, Ken Gaillot  wrote:

> On 05/17/2017 06:10 PM, Radoslaw Garbacz wrote:
> > Hi,
> >
> > I have a question regarding ' 'op-status
> > attribute getting value 4.
> >
> > In my case I have a strange behavior, when resources get those "monitor"
> > operation entries in the CIB with op-status=4, and they do not seem to
> > be called (exec-time=0).
> >
> > What does 'op-status' = 4 mean?
>
> The action had an error status
>
> >
> > I would appreciate some elaboration regarding this, since this is
> > interpreted by pacemaker as an error, which causes logs:
> > crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from
> > re-starting anywhere: operation monitor failed 'not configured' (6)
>
> The rc-code="6" is the more interesting number; it's the result returned
> by the resource agent. As you can see above, it means "not configured".
> What that means exactly is up to the resource agent's interpretation.
>
> > and I am pretty sure the resource agent was not called (no logs,
> > exec-time=0)
>
> Normally this could only come from the resource agent.
>
> However there are two cases where pacemaker generates this error itself:
> if the resource definition in the CIB is invalid; and if your version of
> pacemaker was compiled with support for reading sensitive parameter
> values from a file, but that file could not be read.
>
> It doesn't sound like your case is either one of those though, since
> they would prevent the resource from even starting. Most likely it's
> coming from the resource agent. I'd look at the resource agent source
> code and see where 

Re: [ClusterLabs] CIB: op-status=4 ?

2017-05-18 Thread Radoslaw Garbacz
Thanks,

On Thu, May 18, 2017 at 4:37 PM, Ken Gaillot  wrote:

> On 05/17/2017 06:10 PM, Radoslaw Garbacz wrote:
> > Hi,
> >
> > I have a question regarding ' 'op-status
> > attribute getting value 4.
> >
> > In my case I have a strange behavior, when resources get those "monitor"
> > operation entries in the CIB with op-status=4, and they do not seem to
> > be called (exec-time=0).
> >
> > What does 'op-status' = 4 mean?
>
> The action had an error status
>
> >
> > I would appreciate some elaboration regarding this, since this is
> > interpreted by pacemaker as an error, which causes logs:
> > crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from
> > re-starting anywhere: operation monitor failed 'not configured' (6)
>
> The rc-code="6" is the more interesting number; it's the result returned
> by the resource agent. As you can see above, it means "not configured".
> What that means exactly is up to the resource agent's interpretation.
>
> > and I am pretty sure the resource agent was not called (no logs,
> > exec-time=0)
>
> Normally this could only come from the resource agent.
>
> However there are two cases where pacemaker generates this error itself:
> if the resource definition in the CIB is invalid; and if your version of
> pacemaker was compiled with support for reading sensitive parameter
> values from a file, but that file could not be read.
>
> It doesn't sound like your case is either one of those though, since
> they would prevent the resource from even starting. Most likely it's
> coming from the resource agent. I'd look at the resource agent source
> code and see where it can return OCF_ERR_CONFIGURED.
>
> > There are two aspects of this:
> >
> > 1) harmless (pacemaker seems to not bother about it), which I guess
> > indicates cancelled monitoring operations:
> > op-status=4, rc-code=189
>
> This error means the connection between the crmd and lrmd daemons was
> lost -- most commonly, that shows up for operations that were pending at
> shutdown.
>
> >
> > * Example:
> >  > operation_key="dbx_first_datas_monitor_0" operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.12"
> > transition-key="38:0:7:c8b63d9d-9c70-4f99-aa1b-e993de6e4739"
> > transition-magic="4:189;38:0:7:c8b63d9d-9c70-4f99-aa1b-e993de6e4739"
> > on_node="olegdbx61-vm01" call-id="10" rc-code="189" op-status="4"
> > interval="0" last-run="1495057378" last-rc-change="1495057378"
> > exec-time="0" queue-time="0" op-digest="f6bd1386a336e8e6ee25ecb651a9ef
> b6"/>
> >
> >
> > 2) error level one (op-status=4, rc-code=6), which generates logs:
> > crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from
> > re-starting anywhere: operation monitor failed 'not configured' (6)
> >
> > * Example:
> >  > operation_key="dbx_head_head_monitor_0" operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.12"
> > transition-key="39:0:7:c8b63d9d-9c70-4f99-aa1b-e993de6e4739"
> > transition-magic="4:6;39:0:7:c8b63d9d-9c70-4f99-aa1b-e993de6e4739"
> > on_node="olegdbx61-vm01" call-id="9" rc-code="6"
> > op-status="4" interval="0" last-run="1495057389"
> > last-rc-change="1495057389" exec-time="0" queue-time="0"
> > op-digest="60cdc9db1c5b77e8dba698d3d0c8cda8"/>
> >
> >
> > Could it be some hardware (VM hyperviser) issue?
> >
> >
> > Thanks in advance,
> >
> > --
> > Best Regards,
> >
> > Radoslaw Garbacz
> > XtremeData Incorporated
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporated
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] CIB: op-status=4 ?

2017-05-17 Thread Radoslaw Garbacz
Hi,

I have a question regarding ' 'op-status
attribute getting value 4.

In my case I have a strange behavior, when resources get those "monitor"
operation entries in the CIB with op-status=4, and they do not seem to be
called (exec-time=0).

What does 'op-status' = 4 mean?

I would appreciate some elaboration regarding this, since this is
interpreted by pacemaker as an error, which causes logs:
crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from
re-starting anywhere: operation monitor failed 'not configured' (6)

and I am pretty sure the resource agent was not called (no logs,
exec-time=0)

There are two aspects of this:

1) harmless (pacemaker seems to not bother about it), which I guess
indicates cancelled monitoring operations:
op-status=4, rc-code=189

* Example:



2) error level one (op-status=4, rc-code=6), which generates logs:
crm_mon:error: unpack_rsc_op:Preventing dbx_head_head from
re-starting anywhere: operation monitor failed 'not configured' (6)

* Example:



Could it be some hardware (VM hyperviser) issue?


Thanks in advance,

-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporated
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] pacemaker daemon shutdown time with lost remote node

2017-04-28 Thread Radoslaw Garbacz
Hi,

I have a question regarding pacemaker daemon shutdown
procedure/configuration.

In my case, when a remote node is lost pacemaker needs exactly 10minutes to
shutdown, during which there is nothing logged.
So my questions:
1. What is pacemaker doing at this time?
2. How to make it shorter?


Changed Pacemaker Configuration:
- cluster-delay
- dc-deadtime


Pacemaker Logs:
Apr 28 17:38:08 [17689] ip-10-41-177-183 pacemakerd:   notice:
crm_signal_dispatch: Caught 'Terminated' signal | 15 (invoking handler)
Apr 28 17:38:08 [17689] ip-10-41-177-183 pacemakerd:   notice:
pcmk_shutdown_worker:Shutting down Pacemaker
Apr 28 17:38:08 [17689] ip-10-41-177-183 pacemakerd:   notice:
stop_child:  Stopping crmd | sent signal 15 to process 17698
Apr 28 17:48:07 [17695] ip-10-41-177-183   lrmd: info:
cancel_recurring_action: Cancelling ocf operation
monitor_head_monitor_191000
Apr 28 17:48:07 [17695] ip-10-41-177-183   lrmd: info:
log_execute: executing - rsc:monitor_head action:stop call_id:130
[...]
Apr 28 17:48:07 [17689] ip-10-41-177-183 pacemakerd: info: main:
Exiting pacemakerd
Apr 28 17:48:07 [17689] ip-10-41-177-183 pacemakerd: info:
crm_xml_cleanup: Cleaning up memory from libxml2


Pacemaker built from github: 1.16


Help greatly appreciated.

-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporated
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] nodes ID assignment issue

2017-04-13 Thread Radoslaw Garbacz
Hi,

I have a question regarding building CIB nodes scope and specifically
assignment to node IDs.
It seems like the preexisting scope is not honored and nodes can get
replaced based on check-in order.

I pre-create the nodes scope because it is faster, then setting parameters
for all the nodes later (when the number of nodes is large).

>From the listings below, one can see that node with ID=1 was replaced with
another node (uname), however not the options. This situation causes
problems when resource assignment is based on rules involving node options.

Is there a way to prevent this rearrangement of 'uname', if not whether
there is a way to make the options follow 'uname', or maybe the problem is
somewhere else - corosync configuration perhaps?
Is the corosync 'nodeid' enforced to be also CIB node 'id'?


Thanks in advance,


Below is CIB committed before nodes check-in:


  

  
  
  

  
  

  
  
  

  
  

  
  
  

  
  

  
  
  

  
  

  
  
  

  




And automatic changes after nodes check-in:


  

  
  
  

  
  

  
  
  

  
  

  
  
  

  
  

  
  
  

  
  

  
  
  

  




-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporated
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] cloned resources ordering and remote nodes problem

2017-04-13 Thread Radoslaw Garbacz
Thank you, however in my case this parameter does not change the described
behavior.

I have a more detail example:
order: res_A-clone -> res_B-clone -> res_C
when "res_C" is not on the node, which had "res_A" instance failed, it will
not be restarted, only "res_A" and "res_B" all instances will.

I implemented a workaround by modifying "res_C" I made it also cloned, and
now it is restarted.


My Pacemaker 1.1.16-1.el6
System: CentOS 6

Regards,


​
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] cloned resources ordering and remote nodes problem

2017-04-06 Thread Radoslaw Garbacz
Hi,


I have a question regarding resources order settings.

Having cloned resources: "res_1-clone", "res_2-clone",
 and defined order:  first "res_1-clone" then "res_2-clone"

When I have a monitoring failure on a remote node with "res_1" (an instance
of "res_1-clone") which causes all dependent resources to be restarted,
only instances on this remote node are being restarted, not the ones on
other nodes.

Is it an intentional behavior and if so, is there a way to make all
instances of the cloned resource to be restarted in such a case?

I can provide more details regarding the CIB configuration when needed.

Pacemaker 1.1.16-1.el6
OS: Linux CentOS 6


Thanks in advance,

-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporated
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] cloned resource not deployed on all matching nodes

2017-03-28 Thread Radoslaw Garbacz
Thanks,

On Tue, Mar 28, 2017 at 2:37 PM, Ken Gaillot  wrote:

> On 03/28/2017 01:26 PM, Radoslaw Garbacz wrote:
> > Hi,
> >
> > I have a situation when a cloned resource is being deployed only on some
> > of the nodes, even though this resource is similar to others, which are
> > being deployed according to location rules properly.
> >
> > Please take a look at the configuration below and let me know if there
> > is anything to do to make the resource "dbx_nfs_mounts_datas" (which is
> > a primitive of "dbx_nfs_mounts_datas-clone") being deployed on all 4
> > nodes matching its location rules.
>
> Look in your logs for "pengine:" messages. They will list the decisions
> made about where to start resources, then have a message about
> "Calculated transition ... saving inputs in ..." with a file name.
>
> You can run crm_simulate on that file to see why the decisions were
> made. The output is somewhat difficult to follow, but "crm_simulate -Ssx
> $FILENAME" will show every score that went into the decision.
>
> >
> >
> > Thanks in advance,
> >
> >
> >
> > * Configuration:
> > ** Nodes:
> > 
> >   
> > 
> >   
> >   
> >   
> > 
> >   
> >   
> > 
> >   
> >   
> >   
> > 
> >   
> >   
> > 
> >   
> >   
> >   
> > 
> >   
> >   
> > 
> >   
> >   
> >   
> > 
> >   
> >   
> > 
> >   
> >   
> >   
> > 
> >   
> > 
> >
> >
> >
> > ** Resource in question:
> >   
> > http://dbx_mounts.ocf.sh>" class="ocf" provider="dbxcl">
> >> id="dbx_nfs_mounts_datas-instance_attributes">
> >  ...
> >   
> >   
> >  ...
> >   
> > 
> > 
> >> id="dbx_nfs_mounts_datas-meta_attributes-target-role"/>
> >> id="dbx_nfs_mounts_datas-meta_attributes-clone-max"/>
> > 
> >   
> >
> >
> >
> > ** Resource location
> >> rsc="dbx_nfs_mounts_datas">
> >  > id="on_nodes_dbx_nfs_mounts_datas-INFINITY" boolean-op="and">
> >> id="on_nodes_dbx_nfs_mounts_datas-INFINITY-0-expr" value="Active"/>
> >> id="on_nodes_dbx_nfs_mounts_datas-INFINITY-1-expr" value="AD"/>
> > 
> >  > id="on_nodes_dbx_nfs_mounts_datas--INFINITY" boolean-op="or">
> >> id="on_nodes_dbx_nfs_mounts_datas--INFINITY-0-expr" value="Active"/>
> >> id="on_nodes_dbx_nfs_mounts_datas--INFINITY-1-expr" value="AD"/>
> > 
> >   
> >
> >
> >
> > ** Status on properly deployed node:
> >> type="dbx_mounts.ocf.sh <http://dbx_mounts.ocf.sh>" class="ocf"
> > provider="dbxcl">
> >  > operation_key="dbx_nfs_mounts_datas_start_0" operation="start"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.12"
> > transition-key="156:0:0:d817e2a2-50fb-4462-bd6b-118d1d7b8ecd"
> > transition-magic="0:0;156:0:0:d817e2a2-50fb-4462-bd6b-118d1d7b8ecd"
> > on_node="ip-10-180-227-53" call-id="85" rc-code="0" op-status="0"
> > interval="0" last-run="1490720995" last-rc-change="1490720995"
> > exec-time="733" queue-time="0"
> > op-digest="e95785e3e2d043b0bda24c5bd4655317" op-force-restart=""
> > op-restart-digest="f2317cad3d54cec5d7d7aa7d0bf35cf8"/>
> >  > operation_key="dbx_nfs_mounts_datas_monitor_137000" operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.12"
> > transition-key="157:0:0:d817e2a2-50fb-4462-bd6b-118d1d7b8ecd"
> > transition-magic="0:0;157:0:0:d817e2a2-50fb-4462-bd6b-118d1d7b8ecd"
> > on_node="ip-10-

[ClusterLabs] cloned resource not deployed on all matching nodes

2017-03-28 Thread Radoslaw Garbacz
Hi,

I have a situation when a cloned resource is being deployed only on some of
the nodes, even though this resource is similar to others, which are being
deployed according to location rules properly.

Please take a look at the configuration below and let me know if there is
anything to do to make the resource "dbx_nfs_mounts_datas" (which is a
primitive of "dbx_nfs_mounts_datas-clone") being deployed on all 4 nodes
matching its location rules.


Thanks in advance,



* Configuration:
** Nodes:

  

  
  
  

  
  

  
  
  

  
  

  
  
  

  
  

  
  
  

  
  

  
  
  

  




** Resource in question:
  

  
 ...
  
  
 ...
  


  
  

  



** Resource location
  

  
  


  
  

  



** Status on properly deployed node:
  


  



** Status on not properly deployed node:
  

  



-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporated
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: CIB configuration: role with many expressions - error 203

2017-03-22 Thread Radoslaw Garbacz
Thanks, just found that out as well.

On Wed, Mar 22, 2017 at 9:39 AM, Ken Gaillot  wrote:

> On 03/22/2017 09:26 AM, Radoslaw Garbacz wrote:
> > I have tried also as 'boolean_op', sorry did not mention this in the
> > original post (just as a remark the documentation for pacemaker has both
> > forms).
>
> *smacks forehead*
>
> Yep, the documentation needs to be fixed. You were right the first time,
> it's "boolean-op" with a dash.
>
> Looking at your example again, I think the problem is that you're using
> the same ID for both expressions. The ID must be unique.
>
> >
> > To make it work I have to remove additional "" and leave
> > only one.
> >
> > To summarize:
> > - having no "boolean..." attribute and a single "expression" - works
> > - having "boolean-op" and a single "expression" - works
> >
> > - having "boolean_op" and a single "expression" - does not work
> > - having either "boolean-op" or "boolean_op" or no such phrase at all
> > with more than one "expression" - does not work
> >
> >
> >
> > I have found the reason: expressions IDs within a rule is the same, once
> > I made it unique it works.
> >
> >
> > Thanks,
> >
> >
> > On Wed, Mar 22, 2017 at 2:06 AM, Ulrich Windl
> >  > <mailto:ulrich.wi...@rz.uni-regensburg.de>> wrote:
> >
> > >>> Ken Gaillot mailto:kgail...@redhat.com>>
> > schrieb am 22.03.2017 um 00:18 in Nachricht
> > <94b7e5fd-cb65-4775-71df-ca8983629...@redhat.com
> > <mailto:94b7e5fd-cb65-4775-71df-ca8983629...@redhat.com>>:
> > > On 03/21/2017 11:20 AM, Radoslaw Garbacz wrote:
> > >> Hi,
> > >>
> > >> I have a problem when creating rules with many expressions:
> > >>
> > >>  
> > >>  > >> boolean-op="and">
> > >>type="string"
> > >> id="on_nodes_dbx_first_head-expr" value="Active"/>
> > >>type="string"
> > >> id="on_nodes_dbx_first_head-expr" value="AH"/>
> > >> 
> > >>   
> > >>
> > >> Result:
> > >> Call cib_replace failed (-203): Update does not conform to the
> > >> configured schema
> > >>
> > >> Everything works when I remove "boolean-op" attribute and leave
> only one
> > >> expression.
> > >> What do I do wrong when creating rules?
> > >
> > > boolean_op
> > >
> > > Underbar not dash :-)
> >
> > Good spotting, but I think a more useful error message would be
> > desired ;-)
> >
> > >
> > >>
> > >>
> > >> Pacemaker 1.1.16-1.el6
> > >> Written by Andrew Beekhof
> > >>
> > >>
> > >> Thank in advance for any help,
> > >>
> > >> --
> > >> Best Regards,
> > >>
> > >> Radoslaw Garbacz
> > >> XtremeData Incorporated
> > >
> > > ___
> > > Users mailing list: Users@clusterlabs.org
> > <mailto:Users@clusterlabs.org>
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > <http://lists.clusterlabs.org/mailman/listinfo/users>
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> > > Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org  Users@clusterlabs.org>
> > http://lists.clusterlabs.org/mailman/listinfo/users
> > <http://lists.clusterlabs.org/mailman/listinfo/users>
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
> > Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> >
> > --
> > Best Regards,
> >
> > Radoslaw Garbacz
> > XtremeData Incorporated
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporated
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: CIB configuration: role with many expressions - error 203

2017-03-22 Thread Radoslaw Garbacz
I have tried also as 'boolean_op', sorry did not mention this in the
original post (just as a remark the documentation for pacemaker has both
forms).

To make it work I have to remove additional "" and leave only
one.

To summarize:
- having no "boolean..." attribute and a single "expression" - works
- having "boolean-op" and a single "expression" - works

- having "boolean_op" and a single "expression" - does not work
- having either "boolean-op" or "boolean_op" or no such phrase at all with
more than one "expression" - does not work



I have found the reason: expressions IDs within a rule is the same, once I
made it unique it works.


Thanks,


On Wed, Mar 22, 2017 at 2:06 AM, Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Ken Gaillot  schrieb am 22.03.2017 um 00:18 in
> Nachricht
> <94b7e5fd-cb65-4775-71df-ca8983629...@redhat.com>:
> > On 03/21/2017 11:20 AM, Radoslaw Garbacz wrote:
> >> Hi,
> >>
> >> I have a problem when creating rules with many expressions:
> >>
> >>  
> >>  >> boolean-op="and">
> >>>> id="on_nodes_dbx_first_head-expr" value="Active"/>
> >>>> id="on_nodes_dbx_first_head-expr" value="AH"/>
> >> 
> >>   
> >>
> >> Result:
> >> Call cib_replace failed (-203): Update does not conform to the
> >> configured schema
> >>
> >> Everything works when I remove "boolean-op" attribute and leave only one
> >> expression.
> >> What do I do wrong when creating rules?
> >
> > boolean_op
> >
> > Underbar not dash :-)
>
> Good spotting, but I think a more useful error message would be desired ;-)
>
> >
> >>
> >>
> >> Pacemaker 1.1.16-1.el6
> >> Written by Andrew Beekhof
> >>
> >>
> >> Thank in advance for any help,
> >>
> >> --
> >> Best Regards,
> >>
> >> Radoslaw Garbacz
> >> XtremeData Incorporated
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporated
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] CIB configuration: role with many expressions - error 203

2017-03-21 Thread Radoslaw Garbacz
Hi,

I have a problem when creating rules with many expressions:

 

  
  

  

Result:
Call cib_replace failed (-203): Update does not conform to the configured
schema

Everything works when I remove "boolean-op" attribute and leave only one
expression.
What do I do wrong when creating rules?


Pacemaker 1.1.16-1.el6
Written by Andrew Beekhof


Thank in advance for any help,

-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporated
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: emergency stop does not honor resources ordering constraints (?)

2016-12-09 Thread Radoslaw Garbacz
Thank you, lost of quorum could indeed be an intentional behavior, however
I experience the same situation when there is a monitoring failure or when
parameter "no-quorum-policy" is set to "ignore", i.e.
- normal pacemaker service stop or 'crm_resources' stop for all resources:
A -> B -> C
- lost quorum (with 'no-quorum-policy=ignore') or 'crm_resources' stop for
all resources, when on of the resources reported "monitor" error: not
ordered stop

I will double check my tests, however it would be helpful to know if, by
chance, it is as it suppose to be.


On Wed, Dec 7, 2016 at 1:40 AM, Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> >>> Radoslaw Garbacz  schrieb am
> 06.12.2016 um
> 18:50 in Nachricht
> :
> > Hi,
> >
> > I have encountered a problem with pacemaker resources shutdown in case of
> > (seems like) any emergency situation, when order constraints are not
> > honored.
> > I would be grateful for any information, whether this behavior is
> > intentional or should not happen (i.e. some testing issue rather then
> > pacemaker behavior). It would also be helpful to know if there is any
> > configuration parameter altering this, or whether there can be any reason
> > (cluster event) triggering not ordered resources stop.
> >
> > Thanks,
> >
> > To illustrate the issue I provide an example below and my collected data.
> > My environment uses resources cloning feature - maybe this contributes to
> > my tests outcome.
> >
> >
> > * Example:
> > - having resources ordered with constraints: A -> B -> C
> > - when stopping with 'crm_resources' command (all at once) resources are
> > stopped: C, B, A
> > - when stopping by terminating pacemaker resources are stopped: C, B, A
> > - when there is a monitoring error or quorum lost: no order is honored
> e.g.
> > B, C, A
>
> Hi!
>
> If the node does not have quorum, it cannot do any cluster operations
> (IMHO). Instead it will try to commit suicide, maby with the help of
> self-fencing. So I think this case is normal for no quorum.
>
> Ulrich
>
> >
> >
> >
> > * Version details:
> > Pacemaker 1.1.15-1.1f8e642.git.el6
> > Corosync Cluster Engine, version '2.4.1.2-0da1'
> >
> >
> >
> > * My ordering constraints:
> > Ordering Constraints:
> >   dbx_first_primary then dbx_head_head (kind:Mandatory)
> >   dbx_first_primary-clone then dbx_head_head (kind:Mandatory)
> >   dbx_head_head then dbx_mounts_nodes (kind:Mandatory)
> >   dbx_head_head then dbx_mounts_nodes-clone (kind:Mandatory)
> >   dbx_mounts_nodes then dbx_bind_mounts_nodes (kind:Mandatory)
> >   dbx_mounts_nodes-clone then dbx_bind_mounts_nodes-clone
> (kind:Mandatory)
> >   dbx_bind_mounts_nodes then dbx_nfs_nodes (kind:Mandatory)
> >   dbx_bind_mounts_nodes-clone then dbx_nfs_nodes-clone (kind:Mandatory)
> >   dbx_nfs_nodes then dbx_gss_datas (kind:Mandatory)
> >   dbx_nfs_nodes-clone then dbx_gss_datas-clone (kind:Mandatory)
> >   dbx_gss_datas then dbx_nfs_mounts_datas (kind:Mandatory)
> >   dbx_gss_datas-clone then dbx_nfs_mounts_datas-clone (kind:Mandatory)
> >   dbx_nfs_mounts_datas then dbx_swap_nodes (kind:Mandatory)
> >   dbx_nfs_mounts_datas-clone then dbx_swap_nodes-clone (kind:Mandatory)
> >   dbx_swap_nodes then dbx_sync_head (kind:Mandatory)
> >   dbx_swap_nodes-clone then dbx_sync_head (kind:Mandatory)
> >   dbx_sync_head then dbx_dbx_datas (kind:Mandatory)
> >   dbx_sync_head then dbx_dbx_datas-clone (kind:Mandatory)
> >   dbx_dbx_datas then dbx_dbx_head (kind:Mandatory)
> >   dbx_dbx_datas-clone then dbx_dbx_head (kind:Mandatory)
> >   dbx_dbx_head then dbx_web_head (kind:Mandatory)
> >   dbx_web_head then dbx_ready_primary (kind:Mandatory)
> >   dbx_web_head then dbx_ready_primary-clone (kind:Mandatory)
> >
> >
> >
> > * Pacemaker stop (OK):
> > ready.ocf.sh(dbx_ready_primary)[18639]: 2016/12/06_15:40:32 INFO:
> > ready_stop: Stopping resource
> > mng.ocf.sh(dbx_mng_head)[20312]:2016/12/06_15:40:44 INFO:
> mng_stop:
> > Stopping resource
> > web.ocf.sh(dbx_web_head)[20310]:2016/12/06_15:40:44 INFO:
> > dbxcl_stop: Stopping resource
> > dbx.ocf.sh(dbx_dbx_head)[20569]:2016/12/06_15:40:46 INFO:
> > dbxcl_stop: Stopping resource
> > sync.ocf.sh(dbx_sync_head)[20719]:  2016/12/06_15:40:54 INFO:
> > sync_stop: Stopping resource
> > swap.ocf.sh(dbx_swap_nodes)[21053]: 2016/12/06_15:40:56 INFO:
> > swap_stop: Stopping resource
> > nfs.ocf.sh(dbx_nf

[ClusterLabs] emergency stop does not honor resources ordering constraints (?)

2016-12-06 Thread Radoslaw Garbacz
Hi,

I have encountered a problem with pacemaker resources shutdown in case of
(seems like) any emergency situation, when order constraints are not
honored.
I would be grateful for any information, whether this behavior is
intentional or should not happen (i.e. some testing issue rather then
pacemaker behavior). It would also be helpful to know if there is any
configuration parameter altering this, or whether there can be any reason
(cluster event) triggering not ordered resources stop.

Thanks,

To illustrate the issue I provide an example below and my collected data.
My environment uses resources cloning feature - maybe this contributes to
my tests outcome.


* Example:
- having resources ordered with constraints: A -> B -> C
- when stopping with 'crm_resources' command (all at once) resources are
stopped: C, B, A
- when stopping by terminating pacemaker resources are stopped: C, B, A
- when there is a monitoring error or quorum lost: no order is honored e.g.
B, C, A



* Version details:
Pacemaker 1.1.15-1.1f8e642.git.el6
Corosync Cluster Engine, version '2.4.1.2-0da1'



* My ordering constraints:
Ordering Constraints:
  dbx_first_primary then dbx_head_head (kind:Mandatory)
  dbx_first_primary-clone then dbx_head_head (kind:Mandatory)
  dbx_head_head then dbx_mounts_nodes (kind:Mandatory)
  dbx_head_head then dbx_mounts_nodes-clone (kind:Mandatory)
  dbx_mounts_nodes then dbx_bind_mounts_nodes (kind:Mandatory)
  dbx_mounts_nodes-clone then dbx_bind_mounts_nodes-clone (kind:Mandatory)
  dbx_bind_mounts_nodes then dbx_nfs_nodes (kind:Mandatory)
  dbx_bind_mounts_nodes-clone then dbx_nfs_nodes-clone (kind:Mandatory)
  dbx_nfs_nodes then dbx_gss_datas (kind:Mandatory)
  dbx_nfs_nodes-clone then dbx_gss_datas-clone (kind:Mandatory)
  dbx_gss_datas then dbx_nfs_mounts_datas (kind:Mandatory)
  dbx_gss_datas-clone then dbx_nfs_mounts_datas-clone (kind:Mandatory)
  dbx_nfs_mounts_datas then dbx_swap_nodes (kind:Mandatory)
  dbx_nfs_mounts_datas-clone then dbx_swap_nodes-clone (kind:Mandatory)
  dbx_swap_nodes then dbx_sync_head (kind:Mandatory)
  dbx_swap_nodes-clone then dbx_sync_head (kind:Mandatory)
  dbx_sync_head then dbx_dbx_datas (kind:Mandatory)
  dbx_sync_head then dbx_dbx_datas-clone (kind:Mandatory)
  dbx_dbx_datas then dbx_dbx_head (kind:Mandatory)
  dbx_dbx_datas-clone then dbx_dbx_head (kind:Mandatory)
  dbx_dbx_head then dbx_web_head (kind:Mandatory)
  dbx_web_head then dbx_ready_primary (kind:Mandatory)
  dbx_web_head then dbx_ready_primary-clone (kind:Mandatory)



* Pacemaker stop (OK):
ready.ocf.sh(dbx_ready_primary)[18639]: 2016/12/06_15:40:32 INFO:
ready_stop: Stopping resource
mng.ocf.sh(dbx_mng_head)[20312]:2016/12/06_15:40:44 INFO: mng_stop:
Stopping resource
web.ocf.sh(dbx_web_head)[20310]:2016/12/06_15:40:44 INFO:
dbxcl_stop: Stopping resource
dbx.ocf.sh(dbx_dbx_head)[20569]:2016/12/06_15:40:46 INFO:
dbxcl_stop: Stopping resource
sync.ocf.sh(dbx_sync_head)[20719]:  2016/12/06_15:40:54 INFO:
sync_stop: Stopping resource
swap.ocf.sh(dbx_swap_nodes)[21053]: 2016/12/06_15:40:56 INFO:
swap_stop: Stopping resource
nfs.ocf.sh(dbx_nfs_nodes)[21151]:   2016/12/06_15:40:58 INFO: nfs_stop:
Stopping resource
dbx_mounts.ocf.sh(dbx_bind_mounts_nodes)[21344]:2016/12/06_15:40:59
INFO: dbx_mounts_stop: Stopping resource
dbx_mounts.ocf.sh(dbx_mounts_nodes)[21767]: 2016/12/06_15:41:01 INFO:
dbx_mounts_stop: Stopping resource
head.ocf.sh(dbx_head_head)[22213]:  2016/12/06_15:41:04 INFO:
head_stop: Stopping resource
first.ocf.sh(dbx_first_primary)[22999]: 2016/12/06_15:41:11 INFO:
first_stop: Stopping resource



* Quorum lost:
sync.ocf.sh(dbx_sync_head)[23099]:  2016/12/06_16:42:04 INFO:
sync_stop: Stopping resource
nfs.ocf.sh(dbx_nfs_nodes)[23102]:   2016/12/06_16:42:04 INFO: nfs_stop:
Stopping resource
mng.ocf.sh(dbx_mng_head)[23101]:2016/12/06_16:42:04 INFO: mng_stop:
Stopping resource
ready.ocf.sh(dbx_ready_primary)[23104]: 2016/12/06_16:42:04 INFO:
ready_stop: Stopping resource
web.ocf.sh(dbx_web_head)[23344]:2016/12/06_16:42:04 INFO:
dbxcl_stop: Stopping resource
dbx_mounts.ocf.sh(dbx_bind_mounts_nodes)[23664]:2016/12/06_16:42:05
INFO: dbx_mounts_stop: Stopping resource
dbx_mounts.ocf.sh(dbx_mounts_nodes)[24459]: 2016/12/06_16:42:08 INFO:
dbx_mounts_stop: Stopping resource
head.ocf.sh(dbx_head_head)[25036]:  2016/12/06_16:42:11 INFO:
head_stop: Stopping resource
swap.ocf.sh(dbx_swap_nodes)[27491]: 2016/12/06_16:43:08 INFO:
swap_stop: Stopping resource


-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker remote - invalid message detected, endian mismatch

2016-10-18 Thread Radoslaw Garbacz
d: (
lrmd.c:523   ) warning: send_client_notify:Notification of client
remote-lrmd-ip-10-203-186-119:3121/c20a9a4e-b919-4e8a-8167-0cfa846fb24c
failed
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
services.c:461   )info: cancel_recurring_action:Cancelling ocf
operation dbx_nfs_mounts_datas_monitor_137000
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:361   )   trace: crm_remote_send:Sending len[0]=40,
start=6d726c3c
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:237   )   trace: crm_send_tls:Message size: 40
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:246   )   error: crm_send_tls:Connection terminated rc = -10
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:237   )   trace: crm_send_tls:Message size: 903
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:246   )   error: crm_send_tls:Connection terminated rc = -10
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
remote.c:364   )   error: crm_remote_send:Failed to send remote msg, rc
= -10
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted:
(lrmd_client.:584   )   error: lrmd_tls_send_msg:Failed to send remote
lrmd tls msg, rc = -10
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
lrmd.c:523   ) warning: send_client_notify:Notification of client
remote-lrmd-ip-10-203-186-119:3121/c20a9a4e-b919-4e8a-8167-0cfa846fb24c
failed
Oct 18 16:05:15 [10504] ip-10-203-186-119 pacemaker_remoted: (
services.c:461   )info: cancel_recurring_action:Cancelling ocf
operation dbx_gss_datas_monitor_127000






On Fri, Sep 30, 2016 at 4:53 PM, Jan Pokorný  wrote:

> On 30/09/16 11:28 -0500, Radoslaw Garbacz wrote:
> > I have posted a question about this error attached to another thread, but
> > because it was old and there is no answer I thought it could have been
> > missed, so I am sorry for repeating it.
> >
> > Regarding the problem.
> > I have a cluster, and when the cluster gets bigger (around 40 remote
> nodes)
> > some remote nodes go offline after a while and their logs report some
> > message errors, there is no indication about anything wrong in the other
> > logs.
>
> I believe I would have a plausible explanation provided it may happen
> (not sure now, perhaps the ipc proxy setup would allow it) that two
> messages via the same connection are transmitted, with the second one
> being read as part of the first one.
>
> Could you please try running pacemaker_remoted with
> "PCMK_trace_files=remote.c" in the respective "sysconfig" file?
>
> > Details:
> > - 40 ec2 m3.xlarge nodes, 1 corosync ring member, 39 remote
> > - maybe irrelevant, but either "cib" or "pengine" process goes to ~100%
> CPU
> > - it does not happen immediately
> > - smaller cluster (~20 remote nodes) does not show any problems
> > - pacemaker: 1.1.15-1.1f8e642.git.el6.x86_64
> > - corosync: 2.4.1-1.2.0da1.el6.x86_64
> > - libqb-1.0.0-1.28.4dff.el6.x86_64
> > - CentOS 6
> >
> > Logs:
> >
> > [...]
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> > crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
> > endian == ENDIAN_LOCAL
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> > crm_remote_header:Invalid message detected, endian mismatch:
> > badadbbd is neither 63646330 nor the swab'd 30636463
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> > crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
> > endian == ENDIAN_LOCAL
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> > crm_remote_header:Invalid message detected, endian mismatch:
> > badadbbd is neither 63646330 nor the swab'd 30636463
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> > crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
> > endian == ENDIAN_LOCAL
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> > crm_remote_header:Invalid message detected, endian mismatch:
> > badadbbd is neither 63646330 nor the swab'd 30636463
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
> > lrmd_remote_client_msg:   Client disconnect detected in tls msg
> dispatcher.
> > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
> > ipc_proxy_remove_provider:ipc proxy connection for client
> > ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster
> > node disconnected.
> > Sep

[ClusterLabs] Pacemaker remote - invalid message detected, endian mismatch

2016-09-30 Thread Radoslaw Garbacz
Hi,

I have posted a question about this error attached to another thread, but
because it was old and there is no answer I thought it could have been
missed, so I am sorry for repeating it.

Regarding the problem.
I have a cluster, and when the cluster gets bigger (around 40 remote nodes)
some remote nodes go offline after a while and their logs report some
message errors, there is no indication about anything wrong in the other
logs.

Details:
- 40 ec2 m3.xlarge nodes, 1 corosync ring member, 39 remote
- maybe irrelevant, but either "cib" or "pengine" process goes to ~100% CPU
- it does not happen immediately
- smaller cluster (~20 remote nodes) does not show any problems
- pacemaker: 1.1.15-1.1f8e642.git.el6.x86_64
- corosync: 2.4.1-1.2.0da1.el6.x86_64
- libqb-1.0.0-1.28.4dff.el6.x86_64
- CentOS 6

Logs:

[...]
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
endian == ENDIAN_LOCAL
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
crm_remote_header:Invalid message detected, endian mismatch:
badadbbd is neither 63646330 nor the swab'd 30636463
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
endian == ENDIAN_LOCAL
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
crm_remote_header:Invalid message detected, endian mismatch:
badadbbd is neither 63646330 nor the swab'd 30636463
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
endian == ENDIAN_LOCAL
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
crm_remote_header:Invalid message detected, endian mismatch:
badadbbd is neither 63646330 nor the swab'd 30636463
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
lrmd_remote_client_msg:   Client disconnect detected in tls msg dispatcher.
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
ipc_proxy_remove_provider:ipc proxy connection for client
ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster
node disconnected.
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
cancel_recurring_action:  Cancelling ocf operation
monitor_all_monitor_191000
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
crm_send_tls: Connection terminated rc = -53
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
crm_send_tls: Connection terminated rc = -10
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
crm_remote_send:  Failed to send remote msg, rc = -10
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
lrmd_tls_send_msg:Failed to send remote lrmd tls msg, rc = -10
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:  warning:
send_client_notify:   Notification of client
remote-lrmd-ip-10-237-223-67:3121/b6034d3a-e296-492f-b296-725735d17e22
failed
Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:   notice:
lrmd_remote_client_destroy:   LRMD client disconnecting remote client -
name: remote-lrmd-ip-10-237-223-67:3121 id: b6034d3a-e296-492f-b296-
725735d17e22
Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error:
ipc_proxy_accept: No ipc providers available for uid 0 gid 0
Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error:
handle_new_connection:Error in connection setup (19626-21815-14):
Remote I/O error (121)
Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error:
ipc_proxy_accept: No ipc providers available for uid 0 gid 0
Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error:
handle_new_connection:Error in connection setup (19626-21815-14):
Remote I/O error (121)
[...]



-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker_remoted XML parse error

2016-09-28 Thread Radoslaw Garbacz
Just to add maybe a helpful observation: either "cib" or "pengine" process
goes to ~100% CPU when this remote nodes errors happen.

On Tue, Sep 27, 2016 at 2:36 PM, Radoslaw Garbacz <
radoslaw.garb...@xtremedatainc.com> wrote:

> Hi,
>
> I encountered the same problem with pacemaker built from github at around
> August 22.
>
> Remote nodes go offline occasionally and stay so, their logs show same
> errors. The cluster is on AWS ec2 instances, the network works and is an
> unlikely reason.
>
> Have there be any commits on github recently (after August 22) addressing
> this issue?
>
>
> Logs:
> [...]
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
> endian == ENDIAN_LOCAL
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_remote_header:Invalid message detected, endian mismatch:
> badadbbd is neither 63646330 nor the swab'd 30636463
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
> endian == ENDIAN_LOCAL
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_remote_header:Invalid message detected, endian mismatch:
> badadbbd is neither 63646330 nor the swab'd 30636463
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_abort:crm_remote_header: Triggered assert at remote.c:119 :
> endian == ENDIAN_LOCAL
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_remote_header:Invalid message detected, endian mismatch:
> badadbbd is neither 63646330 nor the swab'd 30636463
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
> lrmd_remote_client_msg:   Client disconnect detected in tls msg dispatcher.
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
> ipc_proxy_remove_provider:ipc proxy connection for client
> ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster
> node disconnected.
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info:
> cancel_recurring_action:  Cancelling ocf operation
> monitor_all_monitor_191000
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_send_tls: Connection terminated rc = -53
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_send_tls: Connection terminated rc = -10
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> crm_remote_send:  Failed to send remote msg, rc = -10
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> lrmd_tls_send_msg:Failed to send remote lrmd tls msg, rc = -10
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:  warning:
> send_client_notify:   Notification of client
> remote-lrmd-ip-10-237-223-67:3121/b6034d3a-e296-492f-b296-725735d17e22
> failed
> Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:   notice:
> lrmd_remote_client_destroy:   LRMD client disconnecting remote client
> - name: remote-lrmd-ip-10-237-223-67:3121 id: b6034d3a-e296-492f-b296-
> 725735d17e22
> Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> ipc_proxy_accept: No ipc providers available for uid 0 gid 0
> Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> handle_new_connection:Error in connection setup (19626-21815-14):
> Remote I/O error (121)
> Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> ipc_proxy_accept: No ipc providers available for uid 0 gid 0
> Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error:
> handle_new_connection:Error in connection setup (19626-21815-14):
> Remote I/O error (121)
> [...]
>
>
>
>
> On Thu, Jun 9, 2016 at 12:24 AM, Narayanamoorthy Srinivasan <
> narayanamoort...@gmail.com> wrote:
>
>> Don't see any issues in network traffic.
>>
>> Some more logs where the XML tags are incomplete:
>>
>> 2016-06-09T03:06:03.096449+05:30 d18-fb-7b-18-f1-8e
>> pacemaker_remoted[6153]:error: Partial
>> > operation="stop" crm-debug-origin="do_update_resource"
>> crm_feature_set="3.0.10" transition-key="225:116:0:8fbf
>> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;225:116:
>> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd"
>> call-id="489" rc-code="0" op-status="0" interval="0" last-run="1459491026"
>> last-rc-change="1459491026" exec-time=

Re: [ClusterLabs] pacemaker_remoted XML parse error

2016-09-27 Thread Radoslaw Garbacz
gt; self-fencing.
>>> > Appreciate if someone throws light on what could be the issue and the
>>> fix.
>>> >
>>> > OS - SLES 12 SP1
>>> > Pacemaker Remote version - pacemaker-remote-1.1.13-14.7.x86_64
>>> >
>>> > 2016-06-08T14:11:46.009073+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error: Entity: line 1: parser
>>> > error : AttValue: ' expected
>>> > 2016-06-08T14:11:46.009314+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error:
>>> > key="neutron-ha-tool_monitor_0" operation="monitor"
>>> > crm-debug-origin="do_update_
>>> > 2016-06-08T14:11:46.009443+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error:
>>> >  ^
>>> > 2016-06-08T14:11:46.009567+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error: Entity: line 1: parser
>>> > error : attributes construct error
>>> > 2016-06-08T14:11:46.009697+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error:
>>> > key="neutron-ha-tool_monitor_0" operation="monitor"
>>> > crm-debug-origin="do_update_
>>> > 2016-06-08T14:11:46.009824+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error:
>>> >  ^
>>> > 2016-06-08T14:11:46.009948+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error: Entity: line 1: parser
>>> > error : Couldn't find end of Start Tag lrm_rsc_op line 1
>>> > 2016-06-08T14:11:46.010070+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error:
>>> > key="neutron-ha-tool_monitor_0" operation="monitor"
>>> > crm-debug-origin="do_update_
>>> > 2016-06-08T14:11:46.010191+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error:
>>> >  ^
>>> > 2016-06-08T14:11:46.010460+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error: Entity: line 1: parser
>>> > error : Premature end of data in tag lrm_resource line 1
>>> > 2016-06-08T14:11:46.010718+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error:
>>> > key="neutron-ha-tool_monitor_0" operation="monitor"
>>> > crm-debug-origin="do_update_
>>> > 2016-06-08T14:11:46.010977+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error:
>>> >  ^
>>> > 2016-06-08T14:11:46.011234+05:30 d18-fb-7b-18-f1-8e
>>> > pacemaker_remoted[6190]:error: XML Error: Entity: line 1: parser
>>> > error : Premature end of data in tag lrm_resources line 1
>>> >
>>> >
>>> > --
>>> > Thanks & Regards
>>> > Moorthy
>>>
>>> This sounds like the network traffic between the cluster nodes and the
>>> remote nodes is being corrupted. Have there been any network changes
>>> lately? Switch/firewall/etc. equipment/settings? MTU?
>>>
>>> You could try using a packet sniffer such as wireshark to see if the
>>> traffic looks abnormal in some way. The payload is XML so it should be
>>> more or less readable.
>>>
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>>
>> --
>> Thanks & Regards
>> Moorthy
>>
>
>
>
> --
> Thanks & Regards
> Moorthy
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: corosync/pacemaker on ~100 nodes cluser

2016-09-02 Thread Radoslaw Garbacz
Indeed, the cluster is quite sluggish when responding to the events, but
still acceptable for me - since the priority is to have it running with
many nodes. In my case the network is quite heavily used, but the shared
storage was limited. The settings, which worked for the 55 nodes I tested,
were to make it running, but are not reasonable as a long time solution
(hence my post). For me the "pacemaker-remote" seems to be the way to go
beyond the 16-ish "corosync" limit.


On Thu, Aug 25, 2016 at 1:19 AM, Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> Hi!
>
> I have two questions:
> 1) TOTEM being a ring protocol will have to pass each message to every
> node, one after the other, right? Wouldn't a significant delay in message
> processing happen?
> 2) If you use some shared storage (shared disks), how do you provide
> sufficient bandwidth? I'm assuming that 99 of 100 nodes don't have an
> idle/standby role in the cluster.
>
> Regards,
> Ulrich
>
> >>> Radoslaw Garbacz  schrieb am
> 24.08.2016 um
> 19:49 in Nachricht
> :
> > Hi,
> >
> > Thank you for the advice. Indeed, seems like Pacemaker Remote will solve
> my
> > big cluster problem.
> >
> > With regard to your questions about my current solution, I scale corosync
> > parameters based on the number of nodes, additionally modifying some of
> the
> > kernel network parameters. Tests I did let me select certain corosync
> > settings, which works, but are possibly not the best (cluster is quite
> slow
> > when reacting to some quorum related events).
> >
> > The problem seems to be only related to cluster start, once running, any
> > operations such as node lost/reconnect, agents creation/start/stop work
> > well. Memory and network seems important with regard to the hardware.
> >
> > Below are settings I used for my latest test (the largest working
> cluster I
> > tried):
> > * latest pacemaker/corosync
> > * 55 c3.4xlarge nodes (amazon cloud)
> > * 55 active nodes, 552 resources in a cluster
> > * kernel settings:
> > net.core.wmem_max=12582912
> > net.core.rmem_max=12582912
> > net.ipv4.tcp_rmem= 10240 87380 12582912
> > net.ipv4.tcp_wmem= 10240 87380 12582912
> > net.ipv4.tcp_window_scaling = 1
> > net.ipv4.tcp_timestamps = 1
> > net.ipv4.tcp_sack = 1
> > net.ipv4.tcp_no_metrics_save = 1
> > net.core.netdev_max_backlog = 5000
> >
> > * corosync settings:
> > token: 12000
> > consensus: 16000
> > join: 1500
> > send_join: 80
> > merge: 2000
> > downcheck: 2000
> > max_network_delay: 150 # for azure
> >
> > Best regards,
> >
> >
> > On Tue, Aug 23, 2016 at 12:00 PM, Ken Gaillot 
> wrote:
> >
> >> On 08/23/2016 11:46 AM, Klaus Wenninger wrote:
> >> > On 08/23/2016 06:26 PM, Radoslaw Garbacz wrote:
> >> >> Hi,
> >> >>
> >> >> I would like to ask for settings (and hardware requirements) to have
> >> >> corosync/pacemaker running on about 100 nodes cluster.
> >> > Actually I had thought that 16 would be the limit for full
> >> > pacemaker-cluster-nodes.
> >> > For larger deployments pacemaker-remote should be the way to go. Were
> >> > you speaking of a cluster with remote-nodes?
> >> >
> >> > Regards,
> >> > Klaus
> >> >>
> >> >> For now some nodes get totally frozen (high CPU, high network usage),
> >> >> so that even login is not possible. By manipulating
> >> >> corosync/pacemaker/kernel parameters I managed to run it on ~40 nodes
> >> >> cluster, but I am not sure which parameters are critical, how to make
> >> >> it more responsive and how to make the number of nodes even bigger.
> >>
> >> 16 is a practical limit without special hardware and tuning, so that's
> >> often what companies that offer support for clusters will accept.
> >>
> >> I know people have gone well higher than 16 with a lot of optimization,
> >> but I think somewhere between 32 and 64 corosync can't keep up with the
> >> messages. Your 40 nodes sounds about right. I'd be curious to hear what
> >> you had to do (with hardware, OS tuning, and corosync tuning) to get
> >> that far.
> >>
> >> As Klaus mentioned, Pacemaker Remote is the preferred way to go beyond
> >> that currently:
> >>
> >> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-
> >> single/Pacemaker_Remote/index.html
>

Re: [ClusterLabs] corosync/pacemaker on ~100 nodes cluser

2016-08-24 Thread Radoslaw Garbacz
Hi,

Thank you for the advice. Indeed, seems like Pacemaker Remote will solve my
big cluster problem.

With regard to your questions about my current solution, I scale corosync
parameters based on the number of nodes, additionally modifying some of the
kernel network parameters. Tests I did let me select certain corosync
settings, which works, but are possibly not the best (cluster is quite slow
when reacting to some quorum related events).

The problem seems to be only related to cluster start, once running, any
operations such as node lost/reconnect, agents creation/start/stop work
well. Memory and network seems important with regard to the hardware.

Below are settings I used for my latest test (the largest working cluster I
tried):
* latest pacemaker/corosync
* 55 c3.4xlarge nodes (amazon cloud)
* 55 active nodes, 552 resources in a cluster
* kernel settings:
net.core.wmem_max=12582912
net.core.rmem_max=12582912
net.ipv4.tcp_rmem= 10240 87380 12582912
net.ipv4.tcp_wmem= 10240 87380 12582912
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 5000

* corosync settings:
token: 12000
consensus: 16000
join: 1500
send_join: 80
merge: 2000
downcheck: 2000
max_network_delay: 150 # for azure

Best regards,


On Tue, Aug 23, 2016 at 12:00 PM, Ken Gaillot  wrote:

> On 08/23/2016 11:46 AM, Klaus Wenninger wrote:
> > On 08/23/2016 06:26 PM, Radoslaw Garbacz wrote:
> >> Hi,
> >>
> >> I would like to ask for settings (and hardware requirements) to have
> >> corosync/pacemaker running on about 100 nodes cluster.
> > Actually I had thought that 16 would be the limit for full
> > pacemaker-cluster-nodes.
> > For larger deployments pacemaker-remote should be the way to go. Were
> > you speaking of a cluster with remote-nodes?
> >
> > Regards,
> > Klaus
> >>
> >> For now some nodes get totally frozen (high CPU, high network usage),
> >> so that even login is not possible. By manipulating
> >> corosync/pacemaker/kernel parameters I managed to run it on ~40 nodes
> >> cluster, but I am not sure which parameters are critical, how to make
> >> it more responsive and how to make the number of nodes even bigger.
>
> 16 is a practical limit without special hardware and tuning, so that's
> often what companies that offer support for clusters will accept.
>
> I know people have gone well higher than 16 with a lot of optimization,
> but I think somewhere between 32 and 64 corosync can't keep up with the
> messages. Your 40 nodes sounds about right. I'd be curious to hear what
> you had to do (with hardware, OS tuning, and corosync tuning) to get
> that far.
>
> As Klaus mentioned, Pacemaker Remote is the preferred way to go beyond
> that currently:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-
> single/Pacemaker_Remote/index.html
>
> >> Thanks,
> >>
> >> --
> >> Best Regards,
> >>
> >> Radoslaw Garbacz
> >> XtremeData Incorporation
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] corosync/pacemaker on ~100 nodes cluser

2016-08-23 Thread Radoslaw Garbacz
Hi,

I would like to ask for settings (and hardware requirements) to have
corosync/pacemaker running on about 100 nodes cluster.

For now some nodes get totally frozen (high CPU, high network usage), so
that even login is not possible. By manipulating corosync/pacemaker/kernel
parameters I managed to run it on ~40 nodes cluster, but I am not sure
which parameters are critical, how to make it more responsive and how to
make the number of nodes even bigger.

Thanks,

-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] libqb 0.17.1 - segfault at 1b8

2016-05-03 Thread Radoslaw Garbacz
Thank you,

On Mon, May 2, 2016 at 4:05 PM, Ken Gaillot  wrote:

> On 05/02/2016 03:45 PM, Jan Pokorný wrote:
> > Hello Radoslaw,
> >
> > On 02/05/16 11:47 -0500, Radoslaw Garbacz wrote:
> >> When testing pacemaker I encountered a start error, which seems to be
> >> related to reported libqb segmentation fault.
> >> - cluster started and acquired quorum
> >> - some nodes failed to connect to CIB, and lost membership as a result
> >> - restart solved the problem
> >>
> >> Segmentation fault reports libqb library in version 0.17.1, a standard
> >> package provided for CentOS.6.
> >
> > Chances are that you are running into this nasty bug:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1114852
> >
> >> Please let me know if the problem is known, and if  there is a remedy
> (e.g.
> >> using the latest libqb).
> >
> > Try libqb >= 0.17.2.
> >
> > [...]
> >
> >> Logs from /var/log/messages:
> >>
> >> Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Additional logging
> >> available in /var/log/pacemaker.log
> >> Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Configured corosync
> to
> >> accept connections from group 498: Library error (2)
> >
> > IIRC, that last line ^ was one of the symptoms.
>
> Yes, that does look like the culprit. The root cause is libqb being
> unable to handle 6-digit PIDs, which we can see in the above logs --
> "[90]".
>
> As a workaround, you can lower /proc/sys/kernel/pid_max (aka
> kernel.pid_max sysctl variable), if you don't want to install a newer
> libqb before CentOS 6.8 is released, which will have the fix.
>
> _______
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] libqb 0.17.1 - segfault at 1b8

2016-05-02 Thread Radoslaw Garbacz
pr 22 15:46:41 (...) pacemakerd[90]:   notice: Stopping pengine: Sent
-15 to process 96
Apr 22 15:46:41 (...) pengine[96]:   notice: Invoking handler for
signal 15: Terminated
Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Stopping attrd: Sent
-15 to process 98
Apr 22 15:46:41 (...) pacemakerd[90]:error: Managed process 98
(attrd) dumped core
Apr 22 15:46:41 (...) pacemakerd[90]:error: The attrd process
(98) terminated with signal 11 (core=1)
Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Stopping lrmd: Sent -15
to process 94
Apr 22 15:46:41 (...) lrmd[94]:   notice: Invoking handler for signal
15: Terminated
Apr 22 15:46:41 (...) pacemakerd[90]:   notice: Stopping stonith-ng:
Sent -15 to process 93
Apr 22 15:46:41 (...) kernel: [17169.121628] attrd[98]: segfault at 1b8
ip 7f3a98f66181 sp 7ffe33407380 error 4 in
libqb.so.0.17.1[7f3a98f57000+21000]
Apr 22 15:46:50 (...) stonith-ng[93]:error: Could not connect to
the CIB service: Transport endpoint is not connected (-107)
Apr 22 15:46:50 (...) stonith-ng[93]:   notice: Invoking handler for
signal 15: Terminated
Apr 22 15:46:50 (...) pacemakerd[90]:   notice: Shutdown complete
Apr 22 15:46:50 (...) pacemakerd[90]:   notice: Attempting to inhibit
respawning after fatal error




Logs from corosync log:

Apr 22 15:46:22 [93582] (...) corosync notice  [MAIN  ] Corosync Cluster
Engine exiting normally
Apr 22 15:46:40 [47] (...) corosync notice  [MAIN  ] Corosync Cluster
Engine ('2.3.5.12-a71e'): started and ready to provide service.
Apr 22 15:46:40 [47] (...) corosync info[MAIN  ] Corosync built-in
features: dbus pie relro bindnow
Apr 22 15:46:40 [47] (...) corosync notice  [TOTEM ] Initializing
transport (UDP/IP Unicast).
Apr 22 15:46:40 [47] (...) corosync notice  [TOTEM ] Initializing
transmit/receive security (NSS) crypto: none hash: none
Apr 22 15:46:40 [47] (...) corosync notice  [TOTEM ] The network
interface [(...)] is now up.
Apr 22 15:46:40 [47] (...) corosync notice  [SERV  ] Service engine
loaded: corosync configuration map access [0]
Apr 22 15:46:40 [47] (...) corosync info[QB] server name: cmap
Apr 22 15:46:40 [47] (...) corosync notice  [SERV  ] Service engine
loaded: corosync configuration service [1]
Apr 22 15:46:40 [47] (...) corosync info[QB] server name: cfg
Apr 22 15:46:40 [47] (...) corosync notice  [SERV  ] Service engine
loaded: corosync cluster closed process group service v1.01 [2]
Apr 22 15:46:40 [47] (...) corosync info[QB] server name: cpg
Apr 22 15:46:40 [47] (...) corosync notice  [SERV  ] Service engine
loaded: corosync profile loading service [4]
Apr 22 15:46:40 [47] (...) corosync notice  [QUORUM] Using quorum
provider corosync_votequorum
Apr 22 15:46:40 [47] (...) corosync notice  [SERV  ] Service engine
loaded: corosync vote quorum service v1.0 [5]
Apr 22 15:46:40 [47] (...) corosync info[QB] server name:
votequorum
Apr 22 15:46:40 [47] (...) corosync notice  [SERV  ] Service engine
loaded: corosync cluster quorum service v0.1 [3]
Apr 22 15:46:40 [47] (...) corosync info[QB] server name: quorum
Apr 22 15:46:40 [47] (...) corosync notice  [TOTEM ] adding new UDPU
member {(...)}
Apr 22 15:46:40 [47] (...) corosync notice  [TOTEM ] adding new UDPU
member {(...)}
Apr 22 15:46:40 [47] (...) corosync notice  [TOTEM ] adding new UDPU
member {(...)}
Apr 22 15:46:40 [47] (...) corosync notice  [TOTEM ] adding new UDPU
member {(...)}
Apr 22 15:46:40 [47] (...) corosync notice  [TOTEM ] A new membership
((...):660) was formed. Members joined: 3
Apr 22 15:46:40 [47] (...) corosync notice  [QUORUM] Members[1]: 3
Apr 22 15:46:40 [47] (...) corosync notice  [MAIN  ] Completed service
synchronization, ready to provide service.
Apr 22 15:46:40 [47] (...) corosync notice  [TOTEM ] A new membership
((...):664) was formed. Members joined: 4 2 1
Apr 22 15:46:40 [47] (...) corosync notice  [QUORUM] This node is
within the primary component and will provide service.
Apr 22 15:46:40 [47] (...) corosync notice  [QUORUM] Members[4]: 3 4 2 1
Apr 22 15:46:40 [47] (...) corosync notice  [MAIN  ] Completed service
synchronization, ready to provide service.
Apr 22 15:46:41 [47] (...) corosync error   [MAIN  ] Denied connection
attempt from 498:498
Apr 22 15:46:41 [47] (...) corosync error   [QB] Invalid IPC
credentials (48-95-2).
Apr 22 15:46:41 [47] (...) corosync error   [MAIN  ] Denied connection
attempt from 498:498
Apr 22 15:46:41 [47] (...) corosync error   [QB] Invalid IPC
credentials (48-92-2).
Apr 22 15:46:41 [47] (...) corosync error   [MAIN  ] Denied connection
attempt from 498:498
Apr 22 15:46:41 [47] (...) corosync error   [QB] Invalid IPC
credentials (48-98-2).



-- 
Best Regards,

Radoslaw Garbacz
Xtreme

Re: [ClusterLabs] required nodes for quorum policy

2015-11-19 Thread Radoslaw Garbacz
Thank you Christine and Andrei,

I took a look at the corosync quorum policy configuration options, and
actually I would need a more conservative approach, i.e. to consider quorum
only if all the nodes are present - any node loss is a quorum loss event
for me. At present I check it in an agent, but would be helpful if
pacemaker took care of this for me.

I know that it is not the requirement pacemaker was designed for (i.e. it
does not use the full power of this cluster environment), but for now the
application we use cannot handle any node loss.


On Tue, Nov 10, 2015 at 2:13 AM, Christine Caulfield 
wrote:

> On 09/11/15 22:20, Radoslaw Garbacz wrote:
> > Hi,
> >
> > I have a question regarding the policy to check for cluster quorum for
> > corosync+pacemaker.
> >
> > As far as I know at present it is always (excpected_votes)/2 + 1. Seems
> > like "qdiskd" has an option to change it, but it is not clear to me if
> > corosync 2.x supports different quorum device.
>
> corosync 2 does not currently support any other quorum devices. But
> watch this space 
>
> > What are my options if I wanted to configure cluster with a different
> > quorum policy (compilation options are acceptable)?
> >
>
> Have a read of the votequorum(5) man page, there are options for
> auto_tie_breaker, and maybe others, that might be useful to you.
>
>
> Chrissie
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Pacemaker] large cluster - failure recovery

2015-11-19 Thread Radoslaw Garbacz
Thank you.

Indeed the latest corosync and pacemaker does work with large clusters -
some tuning is required though.
By working I mean also recovering after a node loss/regain, which was the
major issue before, when the corosync worked (established recovered
membership), but pacemaker was not able to sync CIB - it still needs some
time and CPU power to do so though.

It works for me for a 34 nodes cluster with a few hundreds of resources (I
haven't tested bigger yet).



On Thu, Nov 19, 2015 at 2:43 AM, Cédric Dufour - Idiap Research Institute <
cedric.duf...@idiap.ch> wrote:

> [coming over from the old mailing list pacema...@oss.clusterlabs.org;
> sorry for any thread discrepancy]
>
> Hello,
>
> We've also setup a fairly large cluster - 24 nodes / 348 resources
> (pacemaker 1.1.12, corosync 1.4.7) - and pacemaker 1.1.12 is definitely the
> minimum version you'll want, thanks to changes on how the CIB is handled.
>
> If you're going to handle a large number (~several hundreds) of resources
> as well, you may need to concern yourself with the CIB size as well.
> You may want to have a look at pp.17-18 of the document I wrote to
> describe our setup: http://cedric.dufour.name/cv/download/idiap_havc2.pdf
>
> Currently, I would consider that with 24 nodes / 348 resources, we are
> close to the limit of what our cluster can handle, the bottleneck being
> CPU(core) power for CIB/CRM handling. Our "worst performing nodes" (out of
> the 24 in the cluster) are Xeon E7-2830 @ 2.13GHz.
> The main issue we currently face in when a DC is taken out and a new one
> must be elected: CPU goes 100% for several tens of seconds (even minutes),
> during which the cluster is totally unresponsive. Fortunately, resources
> themselves just seat tight and remain available (I can't say about those
> who would need to be migrated because being collocated with the DC; we
> manually avoid that situation when performing maintenance that may affect
> the DC)
>
> I'm looking forwards to migrate to corosync 2+ (there are some backports
> available for debian/Jessie) and see it this would allow to push the limit
> further. Unfortunately, I can't say for sure as I have only a limited
> understanding of how Pacemaker/Corosync work and where CPU is bond to
> become a bottleneck.
>
> [UPDATE] Thanks Ken for the Pacemaker Remote pointer; I'm head on to have
> a look at that
>
> 'Hope it can help,
>
> Cédric
>
> On 04/11/15 23:26, Radoslaw Garbacz wrote:
>
> Thank you, will give it a try.
>
> On Wed, Nov 4, 2015 at 12:50 PM, Trevor Hemsley 
> wrote:
>
>> On 04/11/15 18:41, Radoslaw Garbacz wrote:
>> > Details:
>> > OS: CentOS 6
>> > Pacemaker: Pacemaker 1.1.9-1512.el6
>> > Corosync: Corosync Cluster Engine, version '2.3.2'
>>
>> yum update
>>
>> Pacemaker is currently 1.1.12 and corosync 1.4.7 on CentOS 6. There were
>> major improvements in speed with later versions of pacemaker.
>>
>> Trevor
>>
>> ___
>> Pacemaker mailing list: pacema...@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
>
> --
> Best Regards,
>
> Radoslaw Garbacz
> XtremeData Incorporation
>
>
> ___
> Pacemaker mailing list: 
> Pacemaker@oss.clusterlabs.orghttp://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


-- 
Best Regards,

Radoslaw Garbacz
XtremeData Incorporation
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] required nodes for quorum policy

2015-11-09 Thread Radoslaw Garbacz
Hi,

I have a question regarding the policy to check for cluster quorum for
corosync+pacemaker.

As far as I know at present it is always (excpected_votes)/2 + 1. Seems
like "qdiskd" has an option to change it, but it is not clear to me if
corosync 2.x supports different quorum device.

What are my options if I wanted to configure cluster with a different
quorum policy (compilation options are acceptable)?

Thanks in advance,

-- 
Best Regards,

Radoslaw Garbacz
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] large cluster - failure recovery

2015-11-04 Thread Radoslaw Garbacz
Thank you Ken and Digimer for all your suggestions.

On Wed, Nov 4, 2015 at 2:32 PM, Ken Gaillot  wrote:

> On 11/04/2015 12:55 PM, Digimer wrote:
> > On 04/11/15 01:50 PM, Radoslaw Garbacz wrote:
> >> Hi,
> >>
> >> I have a cluster of 32 nodes, and after some tuning was able to have it
> >> started and running,
> >
> > This is not supported by RH for a reasons; it's hard to get the timing
> > right. SUSE supports up to 32 nodes, but they must be doing some serious
> > magic behind the scenes.
> >
> > I would *strongly* recommend dividing this up into a few smaller
> > clusters... 8 nodes per cluster would be max I'd feel comfortable with.
> > You need your cluster to solve more problems than it causes...
>
> Hi Radoslaw,
>
> RH supports up to 16. 32 should be possible with recent
> pacemaker+corosync versions and careful tuning, but it's definitely
> leading-edge.
>
> An alternative with pacemaker 1.1.10+ (1.1.12+ recommended) is Pacemaker
> Remote, which easily scales to dozens of nodes:
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html
>
> Pacemaker Remote is a really good approach once you start pushing the
> limits of cluster nodes. Probably better than trying to get corosync to
> handle more nodes. (There are long-term plans for improving corosync's
> scalability, but that doesn't help you now.)
>
> >> but it does not recover from a node disconnect-connect failure.
> >> It regains quorum, but CIB does not recover to a synchronized state and
> >> "cibadmin -Q" times out.
> >>
> >> Is there anything with corosync or pacemaker parameters I can do to make
> >> it recover from such a situation
> >> (everything works for smaller clusters).
> >>
> >> In my case it is OK for a node to disconnect (all the major resources
> >> are shutdown)
> >> and later reconnect the cluster (the running monitoring agent will
> >> cleanup and restart major resources if needed),
> >> so I do not have STONITH configured.
> >>
> >> Details:
> >> OS: CentOS 6
> >> Pacemaker: Pacemaker 1.1.9-1512.el6
> >
> > Upgrade.
>
> If you can upgrade to the latest CentOS 6.7, you can get a much newer
> Pacemaker. But Pacemaker is probably not limiting your cluster nodes;
> the newer version's main benefit would be Pacemaker Remote support. (Of
> course there are plenty of bug fixes and new features as well.)
>
> >> Corosync: Corosync Cluster Engine, version '2.3.2'
> >
> > This is not supported on EL6 at all. Please stick with corosync 1.4 and
> > use the cman pluging as the quorum provider.
>
> CentOS is self-supported anyway, so if you're willing to handle your own
> upgrades and such, nothing wrong with compiling. But corosync is up to
> 2.3.5 so you're already behind. :) I'd recommend compiling libqb 0.17.2
> if you're compiling recent corosync and/or pacemaker.
>
> Alternatively, CentOS 7 will have recent versions of everything.
>
> >> Corosync configuration:
> >> token: 1
> >> #token_retransmits_before_loss_const: 10
> >> consensus: 15000
> >> join: 1000
> >> send_join: 80
> >> merge: 1000
> >> downcheck: 2000
> >> #rrp_problem_count_timeout: 5000
> >> max_network_delay: 150 # for azure
> >>
> >>
> >> Some logs:
> >>
> >> [...]
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not
> >> applied to 1.9275.1: current "epoch" is greater than required
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
> >> of an update diff failed (-1006)
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not
> >> applied to 1.9275.1: current "epoch" is greater than required
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application
> >> of an update diff failed (-1006)
> >> Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
> >> cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not
> >> applied to 1.9275.1: current "epoch" is 

[ClusterLabs] large cluster - failure recovery

2015-11-04 Thread Radoslaw Garbacz
Hi,

I have a cluster of 32 nodes, and after some tuning was able to have it
started and running,
but it does not recover from a node disconnect-connect failure.
It regains quorum, but CIB does not recover to a synchronized state and
"cibadmin -Q" times out.

Is there anything with corosync or pacemaker parameters I can do to make it
recover from such a situation
(everything works for smaller clusters).

In my case it is OK for a node to disconnect (all the major resources are
shutdown)
and later reconnect the cluster (the running monitoring agent will cleanup
and restart major resources if needed),
so I do not have STONITH configured.

Details:
OS: CentOS 6
Pacemaker: Pacemaker 1.1.9-1512.el6
Corosync: Corosync Cluster Engine, version '2.3.2'


Corosync configuration:
token: 1
#token_retransmits_before_loss_const: 10
consensus: 15000
join: 1000
send_join: 80
merge: 1000
downcheck: 2000
#rrp_problem_count_timeout: 5000
max_network_delay: 150 # for azure


Some logs:

[...]
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
cib_process_diff: Diff 1.9254.1 -> 1.9255.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
cib_process_diff: Diff 1.9255.1 -> 1.9256.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
cib_process_diff: Diff 1.9256.1 -> 1.9257.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
cib_process_diff: Diff 1.9257.1 -> 1.9258.1 from local not applied
to 1.9275.1: current "epoch" is greater than required
Nov 04 17:50:18 [7985] ip-10-142-181-98 stonith-ng:   notice:
update_cib_cache_cb:  [cib_diff_notify] Patch aborted: Application of
an update diff failed (-1006)
[...]

[...]
Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error:
cib_native_perform_op_delegate: Couldn't perform cib_query
operation (timeout=120s): Operation already in progress (-114)
Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error:
get_cib_copy:   Couldnt retrieve the CIB
Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error:
cib_native_perform_op_delegate: Couldn't perform cib_query
operation (timeout=120s): Operation already in progress (-114)
Nov 04 17:43:24 [12176] ip-10-109-145-175crm_mon:error:
get_cib_copy:   Couldnt retrieve the CIB
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
Members[32]: 14 20 31 30 8 25 18 7 4
Nov 04 17:47:40 [10599] ip-10-109-145-175 corosync notice  [MAIN  ]
Completed service synchronization, ready to provide service.
Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
Members[32]: 3 27 11 29 23 21 24 9 17 12 32 13 2 10 16 15 6 28 19 1 22 26 5\
Nov 04 18:06:55 [10599] ip-10-109-145-175 corosync notice  [QUORUM]
Members[32]: 14 20 31 30 8 25 18 7 4
[...]

[...]
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:
update_cib_cache_cb:[cib_diff_notify] Patch aborted: Application of an
update diff failed (-1006)
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng: info:
apply_xml_diff: Digest mis-match: expected
01192e5118739b7c33c23f7645da3f45, calculated
f8028c0c98526179ea5df0a2ba0d09de
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:  warning:
cib_process_diff:   Diff 1.15046.2 -> 1.15046.3 from local not applied
to 1.15046.2: Failed application of an update diff
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:
update_cib_cache_cb:[cib_diff_notify] Patch aborted: Application of an
update diff failed (-1006)
Nov 04 18:21:15 [17749] ip-10-178-149-131 stonith-ng:   notice:
cib_process_diff:   Diff 1.15046.2 -> 1.15046.3 from local not applied
to 1.15046.3: current "num_updates" is greater than required
[...]


ps. Sorry if should posted on corosync newsgroup, just the CIB
synchronization fails, so this group seemed to me the right place.

-- 
Best Regards,

Radoslaw Garbacz
___
Users mai