Re: [ClusterLabs] [Pacemaker] Beginner | Resources stuck unloading

2015-12-18 Thread Tyler Hampton
>But generally, stonith-enabled=false can lead to error recovery problems
and make trouble harder to diagnose. If you can take the time to get
stonith working, it should at least stop your first problem from causing
further problems.

Yeah, I feel like a lot of my post-failover cluster state is because I
haven't implemented fencing yet. They're VMs running on a Proxmox instance
and I was hoping to get a proof of concept of fail over working before I
implemented STONITH. I'm mostly figuring out how to recover cluster state
at the moment.

>If you're using corosync 2, you can set "two_node: 1" in corosync.conf,
and delete the no-quorum-policy=ignore setting in Pacemaker. It won't
make a huge difference, but corosync 2 can handle it better now.

I'll look into doing this. I am running Corosync 2 and Pacemaker 1.1.10 as
per what is provided via Ubuntu 14.04's repositories.

>If you are doing a planned failover, a better way would be to put the
node into standby mode first, then stop pacemaker.

Yeah, figured this out later. I had a higher success rate with failing over
resources.

Right now it's just so difficult to get the cluster back to two online
nodes with one node running resources. I've tried a ground zero approach
where I kill every process and every service Pacemaker is supposed to
handle and then start up everything again. I've tried clearing node state
and this and that but I get a lot of NODE: OFFLINE and crmd refusing to
stop itself. There are a lot of tutorials around getting stuff running but
not a lot of guides on when your cluster is fubar.

Thanks so much for your advice.

On Sun, Dec 13, 2015 at 10:18 PM, Tyler Hampton 
wrote:

> Hi!
>
> I'm currently trying to semi-follow Sebastien Han's blog post on
> implementing HA with Ceph rbd volumes and I am hitting some walls. The
> difference between what I'm trying to do and the blog post is that I'm
> trying to implement an active/passive instead of an active/active.
>
> I am able to get the two nodes to recognize each other and for a single
> node to assume resources. However, the setup is fairly finnicky (I'm
> assuming due to my ignorance) and I can't get it to work most of the time.
>
> When I do get a pair and try to fail over (service pacemaker stop) the
> node that I'm stopping pacemaker on fails to unload its controlled
> resources and goes into a loop. A 'proper' failover has only happened twice.
>
> pacemaker stop output (with log output):
> https://gist.github.com/howdoicomputer/d88e224f6fead4623efc
>
> resource configuration:
> https://gist.github.com/howdoicomputer/a6f846eb54c3024a5be9
>
> Any help is greatly appreciated.
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Pacemaker] Beginner | Resources stuck unloading

2015-12-16 Thread Ken Gaillot
On 12/14/2015 12:18 AM, Tyler Hampton wrote:
> Hi!
> 
> I'm currently trying to semi-follow Sebastien Han's blog post on
> implementing HA with Ceph rbd volumes and I am hitting some walls. The
> difference between what I'm trying to do and the blog post is that I'm
> trying to implement an active/passive instead of an active/active.
> 
> I am able to get the two nodes to recognize each other and for a single
> node to assume resources. However, the setup is fairly finnicky (I'm
> assuming due to my ignorance) and I can't get it to work most of the time.
> 
> When I do get a pair and try to fail over (service pacemaker stop) the node
> that I'm stopping pacemaker on fails to unload its controlled resources and
> goes into a loop. A 'proper' failover has only happened twice.
> 
> pacemaker stop output (with log output):
> https://gist.github.com/howdoicomputer/d88e224f6fead4623efc
> 
> resource configuration:
> https://gist.github.com/howdoicomputer/a6f846eb54c3024a5be9
> 
> Any help is greatly appreciated.

Hopefully someone with more ceph or upstart experience can give you more
specifics.

But generally, stonith-enabled=false can lead to error recovery problems
and make trouble harder to diagnose. If you can take the time to get
stonith working, it should at least stop your first problem from causing
further problems.

If you're using corosync 2, you can set "two_node: 1" in corosync.conf,
and delete the no-quorum-policy=ignore setting in Pacemaker. It won't
make a huge difference, but corosync 2 can handle it better now.

If you are doing a planned failover, a better way would be to put the
node into standby mode first, then stop pacemaker. That ensures all
resources are successfully failed over first, and when the node comes
back, it lets you decide when it's ready to host resources again (by
taking it out of standby mode), which gives you time for
administration/troubleshooting/whatever reason you took it down.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org