[ClusterLabs] Failover caused by internal error?

2016-11-25 Thread Sven Moeller
Hi,

today we've encountered a FailOver on our NFS Cluster. First suspicion was a 
hardware outtage. It was not. The failing node has been fenced (reboot). The 
Failover went as expected. So far so good. But by digging in the Logs of the 
failed node I found error messages regarding lrmd was not repsonding, crmd 
could not recover from internal error and generik pacemaker error (201). See 
logs below.

It seems that one of two corosync rings were flapping. But this shouldn't be 
the cause for such an behavior?

The Cluster is running on openSUSE 13.2, following packages are installed:

# rpm -qa | grep -Ei "(cluster|pacemaker|coro)"
pacemaker-1.1.12.git20140904.266d5c2-1.5.x86_64
cluster-glue-1.0.12-14.2.1.x86_64
corosync-2.3.4-1.2.x86_64
pacemaker-cts-1.1.12.git20140904.266d5c2-1.5.x86_64
libpacemaker3-1.1.12.git20140904.266d5c2-1.5.x86_64
libcorosync4-2.3.4-1.2.x86_64
pacemaker-cli-1.1.12.git20140904.266d5c2-1.5.x86_64

Storage devices are connected via fibre channel using multipath.

Regards,
Sven

2016-11-25T10:42:49.499255+01:00 nfs2 systemd[1]: Reloading.
2016-11-25T10:42:56.53+01:00 nfs2 corosync[30260]:   [TOTEM ] Marking 
ringid 1 interface 10.x.x.x FAULTY
2016-11-25T10:42:57.334657+01:00 nfs2 corosync[30260]:   [TOTEM ] Automatically 
recovered ring 1
2016-11-25T10:43:39.507268+01:00 nfs2 crmd[7661]:   notice: process_lrm_event: 
Operation NFS-Server_monitor_3: unknown error (node=nfs2, call=103, rc=1, 
cib-update=54, confirmed=false)
2016-11-25T10:43:39.521944+01:00 nfs2 crmd[7661]:error: crm_ipc_read: 
Connection to lrmd failed
2016-11-25T10:43:39.524644+01:00 nfs2 crmd[7661]:error: 
mainloop_gio_callback: Connection to lrmd[0x1128200] closed (I/O condition=17)
2016-11-25T10:43:39.525093+01:00 nfs2 pacemakerd[30267]:error: 
pcmk_child_exit: Child process lrmd (7660) exited: Operation not permitted (1)
2016-11-25T10:43:39.525554+01:00 nfs2 pacemakerd[30267]:   notice: 
pcmk_process_exit: Respawning failed child process: lrmd
2016-11-25T10:43:39.525956+01:00 nfs2 crmd[7661]: crit: 
lrm_connection_destroy: LRM Connection failed
2016-11-25T10:43:39.526383+01:00 nfs2 crmd[7661]:error: do_log: FSA: Input 
I_ERROR from lrm_connection_destroy() received in state S_NOT_DC
2016-11-25T10:43:39.526784+01:00 nfs2 crmd[7661]:   notice: 
do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR 
cause=C_FSA_INTERNAL origin=lrm_connection_destroy ]
2016-11-25T10:43:39.527186+01:00 nfs2 crmd[7661]:  warning: do_recover: 
Fast-tracking shutdown in response to errors
2016-11-25T10:43:39.527569+01:00 nfs2 crmd[7661]:error: do_log: FSA: Input 
I_TERMINATE from do_recover() received in state S_RECOVERY
2016-11-25T10:43:39.527952+01:00 nfs2 crmd[7661]:error: 
lrm_state_verify_stopped: 1 resources were active at shutdown.
2016-11-25T10:43:39.528330+01:00 nfs2 crmd[7661]:   notice: do_lrm_control: 
Disconnected from the LRM
2016-11-25T10:43:39.528732+01:00 nfs2 crmd[7661]:   notice: 
terminate_cs_connection: Disconnecting from Corosync
2016-11-25T10:43:39.547847+01:00 nfs2 lrmd[29607]:   notice: crm_add_logfile: 
Additional logging available in /var/log/pacemaker.log
2016-11-25T10:43:39.637693+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
List: 7c0
2016-11-25T10:43:39.638403+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
List: 7c0
2016-11-25T10:43:39.641012+01:00 nfs2 crmd[7661]:error: crmd_fast_exit: 
Could not recover from internal error
2016-11-25T10:43:39.649180+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
List: 7c4
2016-11-25T10:43:39.649926+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
List: 7c4
2016-11-25T10:43:39.651809+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
List: 7c9
2016-11-25T10:43:39.652751+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
List: 7c9
2016-11-25T10:43:39.659130+01:00 nfs2 pacemakerd[30267]:error: 
pcmk_child_exit: Child process crmd (7661) exited: Generic Pacemaker error (201)
2016-11-25T10:43:39.660663+01:00 nfs2 pacemakerd[30267]:   notice: 
pcmk_process_exit: Respawning failed child process: crmd
2016-11-25T10:43:39.661114+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
List: 7ca
2016-11-25T10:43:39.662825+01:00 nfs2 corosync[30260]:   [TOTEM ] Retransmit 
List: 7cb
2016-11-25T10:43:39.672065+01:00 nfs2 crmd[29609]:   notice: crm_add_logfile: 
Additional logging available in /var/log/pacemaker.log
2016-11-25T10:43:39.673427+01:00 nfs2 crmd[29609]:   notice: main: CRM Git 
Version: 1.1.12.git20140904.266d5c2
2016-11-25T10:43:39.684597+01:00 nfs2 crmd[29609]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
2016-11-25T10:43:39.703718+01:00 nfs2 crmd[29609]:   notice: get_node_name: 
Could not obtain a node name for corosync nodeid 168230914
2016-11-25T10:43:39.713944+01:00 nfs2 crmd[29609]:   notice: get_node_name: 
Defaulting to uname -n for the local corosync node name
2016-11-25T10:43:39.724509+01:00 nfs2 stonithd[30270]:   notice: 
can_fence_host_with_device: 

Re: [ClusterLabs] Adding and removing a node dyamically

2015-10-02 Thread Sven Moeller
Hi,

what do mean with add or remove? Do you want to remove a node from a cluster 
completely, not being a cluster member any more? Or do you want to remove it 
just for maintenance temporarely?

Regards,
sven

Am 02.10.2015 09:47 schrieb Vijay Partha :
>
> Hi,
>  
> I would like to add and remove a node dynamically in pacemaker. What commands 
> are to be given for this to be done.
>  
> Thanking you
>
> -- 
> With Regards
> P.Vijay
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org