On 11/04/2013, at 7:57 PM, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> 
wrote:

>>>> Andrew Beekhof <and...@beekhof.net> schrieb am 11.04.2013 um 01:05 in 
>>>> Nachricht
> <0550d2cf-e56d-4693-97cf-43c46df64...@beekhof.net>:
> 
>> On 10/04/2013, at 11:54 PM, Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> 
>> wrote:
>> 
>>> Hi!
>>> 
>>> I had a situation when one node was periodically fenced when there was a 
>> busy network. The node bing fenced tried to restart crmd after some problem, 
>> and shortly after rejoining the cluster, it was fenced.
>> 
>> 
>> The message "Apr  5 14:14:14 h01 crmd: [13080]: ERROR: do_recover: Action 
>> A_RECOVER (0000000001000000) not supported" is normal but should really be 
>> changed as it is misleading.
>> 
>> The "real" error is above it:
>> 
>>> Apr  5 14:14:14 h01 crmd: [13080]: ERROR: tengine_stonith_notify: We were 
>> alegedly just fenced by h05 for h05!
>> 
>> The rest is pacemaker saying "holly heck" and trying to get out of there 
>> asap.
>> What agent are you using for fencing?  Doesn't sound very reliable.
>> 
> [...]
> 
> We use sbd, and it works very reliable; in fact it fences the nodes more 
> often than we like ;-)
> 
> I guess some issue with cLVM and mirroring that floods the network and slows 
> down the machine. Then some bugs in pacemaker seem to surface ;-)
> 
> One example (note the delicate ordering of tokens ;-):
> crmd: [9801]: info: delete_resource: Removing resource prm_xen_v02 for 
> 20705_crm_resource (root) on h01
> crmd: [9801]: WARN: decode_transition_key: Bad UUID (crm-resource-20705) in 
> sscanf result (3) for 0:0:crm-resource-20705

No, thats not an ordering issue.
We fixed that bug in crm_resource a while ago.

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to