Hi Keisuke-san,
On Thu, Mar 18, 2010 at 08:34:54PM +0900, Keisuke MORI wrote:
> Hi Dejan,
> Thank you for your comments.
>
> 2010/3/16 Dejan Muhamedagic <[email protected]>:
> >> I'm going to describe the issue of the Subject: and would like to
> >> suggest some changes to the agents package (and possibly Pacemaker, too).
> >
> > Which part of pacemaker? As far as I can see only resource agents
> > and init scripts (heartbeat/corosync/openais) are involved.
>
> Pacemaker also creates /var/run/heartbeat/rsctmp at the invocation,
> so I thought that if we changed the directory it may affect to Pacemaker.
> http://hg.clusterlabs.org/pacemaker/1.0/file/e27fa62efc86/lib/ais/plugin.c#l586
Missed that one somehow. It probably got in there because openais
wasn't creating the temporary directory. But it's a kludge and
should move to the init script.
> >> When a node crashed and was rebooted, a stale stat file is
> >> left over the reboot and hence the RA misbehaves as if the
> >> resource was already started when the cluster is launched again
> >> for the recovery.
> >
> > What exactly did you observe, i.e. in which way resource agents
> > misbehaved.
>
> The RA returns (unexpected) OCF_SUCCSS for the probe
> even though the RA was not started yet on the node.
>
> Steps to reproduce:
> - suppose we have 2 nodes cluster, running one Dummy resource on node1.
> - do 'reboot -f' on node1 (induce a node failure)
> - Pacemaker moves the resource to node2 (expected)
> - do '/etc/init.d/corosync start' on node1 (restart the cluster for
> the recovery)
> - Pacemaker probes for the resource on node1, and it does succeed
> (unexpected)
>
> Here is an excerpt from the log on the node2(DC) when probed.
> The full hb_report log is attached.
> {{{
> Mar 18 19:12:34 millvalley pengine: [3180]: notice: unpack_rsc_op:
> Operation prmDummy_monitor_0 found resource prmDummy active on node1
> Mar 18 19:12:34 millvalley pengine: [3180]: ERROR: native_add_running:
> Resource ocf::Dummy:prmDummy appears to be active on 2 nodes.
> Mar 18 19:12:34 millvalley pengine: [3180]: WARN: See
> http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
> information.
> }}}
Yes, that's bad enough.
> >> FHS defines that any files under a subdirectory of /var/run
> >> should be removed at the OS bootup time.
> >
> > I hope that the standard is really followed.
>
> FHS is a part of LSB, so I believe the most of the major Linux
> distributions follow this (and even Solaris does, too).
> At least RHEL5 exactly behaves so as long as I can see.
>
>
> >> Unfortunately the second level subdirectory is out of the scope and
> >> you can not rely on the removal (and that's the case of
> >> /var/run/heartbeat/rsctmp).
> >
> > OK. Yes, the scheme you suggest is probably better than what we
> > currently have.
>
> Great, so will you agree with me on the changes?
Yes, I'd agree.
> Then probably I should write the patch for the changes.
>
> Or if you have any better solution or any improvements
> we're going to discuss more.
We'd need to coordinate this with all projects (corosync,
pacemaker, heartbeat, glue, agents). That would probably be the
most difficult part.
Cheers,
Dejan
>
> Regards,
>
> Keisuke MORI
> _______________________________________________________
> Linux-HA-Dev: [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/