Zemke, Kai wrote:
> Hi,
> 
>  
> 
> I'm running a two node failover cluster. Yesterday the cluster tried to 
> manage a state transition. In the log files I found the following entries:
> 
>  
> 
> heartbeat[6905]: 2009/02/10_21:45:55 WARN: node nagios-drbd2: is dead
> 
> heartbeat[6905]: 2009/02/10_21:45:55 info: Link nagios-drbd2:eth1 dead.
> 
>  
> 
> A few minutes later the node that was still alive tried to take over the 
> resources and created the following entries in the log file ( the resource 
> "ipaddress" is an example, there are a lot more entries for the other 
> resources that were running on the cluster ):
> 
>  
> 
> pengine[7370]: 2009/02/10_21:45:59 WARN: custom_action: Action 
> resource_nagios_ipaddress_stop_0 on nagios-drbd2 is unrunnable (offline)
> 
> pengine[7370]: 2009/02/10_21:45:59 WARN: custom_action: Marking node 
> nagios-drbd2 unclean
> 
>  
> 
> Further more there a several entries telling:
> 
>  
> 
> stonithd[6916]: 2009/02/10_21:46:30 ERROR: Failed to STONITH the node 
> nagios-drbd2: optype=RESET, op_result=TIMEOUT
> 
>  
> 
> The stonith is running via ssh on a direct link between the to nodes. Since 
> Node2 was down the shutdown command never reached its destination.

Which is why ssh stonith is not meant for production.

> My Questions are:
> 
> Why did the alive cluster try to stop resources on a cluster node that is 
> considered as dead?
> 
> Why did STONITH try to shut down a node that is considered down? ( for safety 
> reasons I think )

It is considered dead, but that does not have to be a fact. By shooting
it, the cluster makes the assumption a fact (turn it off or reboot it).

> Shouldn't the resources just be started on the alive node without any further 
> action?

Not until the cluster "knows" the other node is dead. Who knows what's
going on there if it cannot be communicated with.

> Did I miss something in the default behaviour of heartbeat? Maybe a timeout?
> 
> Would a hardware STONITH device solve such problems in the future?

Yes.

Regards
Dominik
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to