Hi,

On Sun, Feb 24, 2008 at 10:51:25PM +0100, Johan Hoeke wrote:
> Dejan Muhamedagic wrote:
> 
> > 
> > On Fri, Feb 22, 2008 at 05:29:08PM +0100, Johan Hoeke wrote:
> >> Dejan Muhamedagic wrote:
> >>> Hi,
> >> <snip>
> >>> But the stonith resource is monitored? How did it fail?
> 
> an incomplete iptables config was pushed by mistake, it caused errors
> during a upgrade from 2.1.2 to 2.1.3:
> 
> (2 node cluster, hosts cauchy and condorcet)
> 
> 08:51 condorcet is updated to heartbeat 2.1.3 and is coming up from a
> reboot. bad iptable rules are activated, condorcet can't see cauchy's
> heartbeat:
> 
> Feb 13 08:51:58 condorcet heartbeat: [3752]: WARN: node cauchy.uvt.nl:
> is dead
> 
> 08:52 condorcet shoots cauchy
> 
> *The story would have ended here if the stonith action was power off or
> if heartbeat wasn't started on reboot, but alas. We will choose one of
> the two options to avoid future trouble.*
> 
> Feb 13 08:52:36 condorcet pengine: [4549]: WARN: stage6: Scheduling Node
> cauchy.uvt.nl for STONITH
> 
> Feb 13 08:52:44 condorcet tengine: [4548]: info: te_fence_node:
> Executing reboot fencing operation (34) on cauchy.uvt.nl (timeout=50000)
> 
> Feb 13 08:52:44 condorcet stonithd: [4542]: info:
> stonith_operate_locally::2375: sending fencing op (RESET) for
> cauchy.uvt.nl to device external (rsc_id=R_ilo_cauchy:0, pid=4752)
> 
> Feb 13 08:52:44 condorcet pengine: [4549]: notice: StartRsc:
> condorcet.uvt.nl  Start R_san_oradata
> 
> condorcet starts the resource that mounts the SAN disk
> *I would very much prefer that it waits until the stonith is done
> and has succeeded before it does this!*

This is obviusly a bug. It has already been reported and
supposedly fixed before 2.1.3. Please attach this report and
reopen:

http://developerbugs.linux-foundation.org/show_bug.cgi?id=1768

> Feb 13 08:52:44 condorcet pengine: [4549]: notice: StartRsc:
> condorcet.uvt.nl  Start R_san_oradata
> 
> *only now has the stonith succeeded 08:52:49*
> 
> Feb 13 08:52:49 condorcet stonithd: [4542]: info: Succeeded to STONITH
> the node cauchy.uvt.nl: optype=RESET. whodoit: condorcet.uvt.nl
> 
> condorcet continues to mount the SAN partition, as it should:
> 
> Feb 13 08:52:44 condorcet Filesystem[4805]: [4835]: INFO: Running start
> for /dev/mapper/san-oradata on /var/oracle/oradata
> 
> BUT:
> 
> 08:55 cauchy comes up, bad iptables settings activate, cauchy can't see
> heartbeat from condorcet:
> 
> Feb 13 08:55:42 cauchy heartbeat: [3801]: WARN: node condorcet.uvt.nl:
> is dead
> 
> 08:56 cauchy wants to mount the f/o attached SAN partition:
> *Ideally, this should only be done only if cauchy is sure that condorcet
> is really dead! iow, after the stonith has succeeded*
> 
> Feb 13 08:56:23 cauchy pengine: [4596]: notice: StartRsc:  cauchy.uvt.nl
>        Start R_san_oradata
> 
> The stonith action starts after the resource for the SAN partition:
> 
> Feb 13 08:56:30 cauchy stonithd: [4589]: info: client tengine [pid:
> 4595] want a STONITH operation RESET to node condorcet.uvt.nl.
> 
> Feb 13 08:56:30 cauchy tengine: [4595]: info: te_fence_node: Executing
> reboot fencing operation (32) on condorcet.uvt.nl (timeout=50000)
> 
> *at this moment in time the filesystem is corrupted because it is
> mounted on both nodes at the same time*
> 
> Feb 13 08:56:30 cauchy stonithd: [4589]: info: client tengine [pid:
> 4595] want a STONITH operation RESET to node condorcet.uvt.nl.
> 
> Feb 13 08:56:30 cauchy tengine: [4595]: info: te_fence_node: Executing
> reboot fencing operation (32) on condorcet.uvt.nl (timeout=50000)
> 
> Feb 13 08:56:30 cauchy Filesystem[4823]: [4852]: INFO: Running start for
> /dev/mapper/san-oradata on /var/oracle/oradata
> 
> this action times out:
> 
> Feb 13 08:57:20 cauchy stonithd: [4589]: ERROR: Failed to STONITH the
> node condorcet.uvt.nl: optype=RESET, op_result=TIMEOUT
> Feb 13 08:57:20 cauchy tengine: [4595]: ERROR: tengine_stonith_callback:
> Stonith of condorcet.uvt.nl failed (2)... aborting transition.

And here the CRM waited for the stonith to finish. Strange.

> but that is no longer relevant.
> 
> conclusion:
> 
> As Dejan mentioned, setting heartbeat not to start automatically, or
> changing the stonith action to power off would have saved the day.

This should be documented as best practice for two node clusters.
IIRC, there has already been discussion on the list on this issue.

> I am curious about the timing of some of the actions though,
> particularly that a node seems to continue with it's start actions even
> though the success or failure of the stonith action has not been
> confirmed. Could be that i'm interpreting the logs iincorrectly.

Your interpretation's right.

You should also try ciblint to check the cib.

Thanks,

Dejan

> thanks and regards,
> 
> Johan





> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to