On 11/04/2013, at 7:57 PM, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote:
>>>> Andrew Beekhof <and...@beekhof.net> schrieb am 11.04.2013 um 01:05 in >>>> Nachricht > <0550d2cf-e56d-4693-97cf-43c46df64...@beekhof.net>: > >> On 10/04/2013, at 11:54 PM, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> >> wrote: >> >>> Hi! >>> >>> I had a situation when one node was periodically fenced when there was a >> busy network. The node bing fenced tried to restart crmd after some problem, >> and shortly after rejoining the cluster, it was fenced. >> >> >> The message "Apr 5 14:14:14 h01 crmd: [13080]: ERROR: do_recover: Action >> A_RECOVER (0000000001000000) not supported" is normal but should really be >> changed as it is misleading. >> >> The "real" error is above it: >> >>> Apr 5 14:14:14 h01 crmd: [13080]: ERROR: tengine_stonith_notify: We were >> alegedly just fenced by h05 for h05! >> >> The rest is pacemaker saying "holly heck" and trying to get out of there >> asap. >> What agent are you using for fencing? Doesn't sound very reliable. >> > [...] > > We use sbd, and it works very reliable; in fact it fences the nodes more > often than we like ;-) > > I guess some issue with cLVM and mirroring that floods the network and slows > down the machine. Then some bugs in pacemaker seem to surface ;-) > > One example (note the delicate ordering of tokens ;-): > crmd: [9801]: info: delete_resource: Removing resource prm_xen_v02 for > 20705_crm_resource (root) on h01 > crmd: [9801]: WARN: decode_transition_key: Bad UUID (crm-resource-20705) in > sscanf result (3) for 0:0:crm-resource-20705 No, thats not an ordering issue. We fixed that bug in crm_resource a while ago. _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems