Re: [Pacemaker] Exec Failure issues.

2011-10-21 Thread Florian Haas
On 2011-10-19 11:56, James Horsfall (CTR) wrote:
> The allow-migrate was just something I was trying, and as you point out
> produced that error unimplemented feature. Without the allow-migrate
> option I just get "exec timeout error" which is doubly as frustrating,
> where its simply failing to unload the ip addresses from the failed
> node. It knows it's supposed to migrate,

Correct.

> it starts to

Semi-correct. It only attempts to shut down the resource on one node. It
never tries to start on the other.

> and then goes nuts

Far from it.

> into the error timeout "unmanaged" state.

Working perfectly as designed. If a resource fails to stop, the cluster
has to assume that it still has access to shared resources. It is thus
unsafe to recover the resource on another node, and the cluster freezes
that resource.

> At that point I have to
> essentially restart the nodes to clear the error or clear the fail
> counts from the "IPS" element just to watch it explode all over again.
> 
>  The setup of stonith seems extreme by its description.

STONITH serves three purposes:

1. remove a node from the cluster that has stopped responding;
2. remove a node from a cluster that is not relinquishing resources
despite being told to;
3. lock access to shared resources while said removal is pending.

You're hitting #2, and you don't have STONITH configured where you should.

> Will this
> properly bring nodes back online after cables are re-plugged?

Of course.

> I'm not
> sharing data I'm just setting up a network pass through server or just a
> router in a sense.

Note, what follows is not safe for general use. If anyone pulls this
from the list archives and implements it in their own cluster, then
don't blame me for any unexpected results including a meteor hit, unless
you've asked at http://www.hastexo.com/help and we've actually given you
a green light that this is OK to use.

See
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-operations.html.
What you can do -- but let me reiterate, my recommendation is strongly
against this, you really should set up STONITH instead -- is to add
"on-fail=ignore" to the definition of your "stop" operation for those
resources. Which will, of course, possibly lead to duplicate IP
addresses flying around your cluster. Have I mentioned that I recommend
against this? Well, I recommend against this.

Note also that I've never seen an exec timeout error in the stop op for
IPaddr2 in the wild, except for really trivial setup errors, and I've
deployed or reviewed scores of clusters using it. So another possibility
is that your testing procedure is so far off anything that would ever
happen in the real world that you're simply creating an uncaught error
condition.

I'd have to look at the logs though to be sure; procedure for submitting
those is explained at the "help" URL I mentioned above.

Hope this helps,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Problem] The attrd does not sometimes stop.

2011-10-21 Thread Alan Robertson

On 10/20/2011 07:30 PM, renayama19661...@ybb.ne.jp wrote:

Hi Alan,

Thank you for comment.

We reproduce a problem, too and are going to send a report.
However, the problem does not reappear for the moment.
I gather that the folks on the test team for my project have it happen 
fairly often when they're in a certain stage of testing.  I expect to 
get some hb_report output from them in a week or two.  I have put in a 
link to Andrew's bug system from ours so that hopefully when the time 
comes we will be able to remember what to do ;-)


We had not narrowed it down to attrd being the component that didn't 
stop - but looking at the logs for what they did report, it seemed like 
the likely suspect.  I had already decided that it looked like the most 
likely candidate before I saw your email.


They had put in a workaround of just killing everything - which of 
course works ;-).  At the place where it hung, all the resources were 
already stopped, so it was safe - just a bit of overkill (beyond the 
minimum necessary).



--
Alan Robertson

"Openness is the foundation and preservative of friendship...  Let me claim from you 
at all times your undisguised opinions." - William Wilberforce

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Anybody successful with SBD on Debian?

2011-10-21 Thread mark - pacemaker list
Hello,

I'm trying to get stonith via SBD working on Debian Squeeze.  All of the
components seem to be there, and I've very carefully followed the guide at
http://www.linux-ha.org/wiki/SBD_Fencing . Where things seem to fall down is
that there's nothing at all in corosync's init script to start SBD, and
according to the SBD page it should be started by the cluster's init script.
 What I end up with is Pacemaker showing that my stonith-SBD primitive is
alive and running, when in reality there is no sbd process on any node at
all.


Last updated: Fri Oct 21 11:17:16 2011
Stack: openais
Current DC: xen2 - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
3 Resources configured.


Online: [ xen2 xen1 ]

 historyslave3 (ocf::heartbeat:Xen): Started xen2
 historydb (ocf::heartbeat:Xen): Started xen1
 stonith-SBD (stonith:external/sbd): Started xen2


Has anybody else encountered this and solved it already?  I like the idea of
corosync's script handling it so the cluster stack doesn't even start if the
SBD device is unavailable.  However, since the debian init script doesn't
know of SBD, is it safe to simply start it manually on boot?  If so, I
suppose it needs an init script rather than rc.local, so it gets started
before corosync?

Thanks,
Mark
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker