Re: [Pacemaker] pacemaker shutdown issue after OS change

2014-02-16 Thread Andrew Beekhof

On 31 Jan 2014, at 3:18 am, Pascal BERTON  wrote:

> Hi !
>  
> I recently changed hosting platform versions for my PCMK clusters, from 
> RHEL6.0 equivalent towards SL6.4. Also changed from pcmk 1.1.2+corosync 1.3.3 
>  to pcmk 1.1.10+corosync 1.4.1 that come with SL6.
> Until now, I used to manage my pcmk+corosync layers directly from my own 
> personal init scripts, so I used to globaly disable pacemaker using 
> chkconfig. It was my scripts duty to start up cluster layer at boot and stop 
> it at node shutdown. And it has all worked fine for a couple of years !
> With that new platform, I now see something odd at node shutdown : Although 
> both pacemaker and corosync services are disabled for every runlevel (And BTW 
> if I disable my own service, they won’t start at boot), I can see that 
> pacemaker’s init script is called as soon as I run the shutdown –r command.
> After digging into what’s happening inside, I can clearly see that although 
> disabled, pacemaker is still called immediately, does its job of stopping the 
> cluster and the related resources, and then when my own script runs (1st in 
> the shutdown script list) and tries to interract with the cluster layer, too 
> late, it has already died and its actions are not managed correctly.
>  
> I can workaround that by integrating my own shutdown sequence directly within 
> pacemaker init script, but I‘d prefer making things cleaner by maintaining 
> them separate.
>  
> Is this behavior something expected

No

> (Or is it me ? ‘Wouldn’t be the first time… J )? What launches that disabled 
> script ? Can I change that ?

Very good question.
It could be chkconfig doing something "smart" with the LSB metadata:

# Default-Start:  2 3 4 5
# Default-Stop:   0 1 6

There was a subsequent commit to replace that with:

# Default-Start:
# Default-Stop:

which may help as may running:

/sbin/chkconfig --del pacemaker

instead of

/sbin/chkconfig pacemaker off


>  
> Thanks for your help !
>  
> Regards,
>  
> Pascal.
>  
>  
> 
> 
>   
> Ce courrier électronique ne contient aucun virus ou logiciel malveillant 
> parce que la protection Antivirus avast! est active.
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker shutdown issue

2009-11-30 Thread Tony Bunce
>That looks like a problem in the resource agent. Most probably
>you hit the bug 2219 which has been fixed on November 9.

I applied the patch and that appears to have fixed the problem!  I haven't 
tried a reboot yet but I can migrate between nodes without any issue.


>This has most probably been from the crm shell. It has been relaxed
>in the meantime (see bugzilla ).

That's exactly the problem I was setting.  I switched to the correct monitor 
commands (including the role) and that fixed the problem.
Both the clusterlabs.org and drbd.org show the  syntax without a role specified.

Thanks again for the help.  It looks like there is all kinds of good info in 
bugzilla.  I'll be sure to check that out first when I run into a problem.  (It 
doesn't look like Google or Bing index the bugzilla site)

-Tony

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker shutdown issue

2009-11-30 Thread Dejan Muhamedagic
Hi,

On Mon, Nov 30, 2009 at 12:03:22PM -0500, Tony Bunce wrote:
> >The upgrade should really be transparent. What problems did you
> >encounter with nfsserver?
> 
> Whenever one of the nodes takes over the nfs resource it doesn't startup the 
> first time and gives this error:
> 
> nfs_server_monitor_0 (node=nfs1, call=11, rc=2, status=complete): invalid 
> parameter

That looks like a problem in the resource agent. Most probably
you hit the bug 2219 which has been fixed on November 9.

> If I run this command it starts up instantly and doesn't have any problems
> until the service gets migrated again:
> crm_resource -C -r nfs_server
> 
> 
> Here is that resource from my config:
> primitive nfs_server ocf:heartbeat:nfsserver \
> params nfs_init_script="/etc/init.d/nfs" \
> params nfs_notify_cmd="/sbin/rpc.statd" \
> params nfs_shared_infodir="/var/lib/nfs" \
> params nfs_ip="10.1.1.150" \
> op monitor interval="30s"
> 
> 
> I haven't tested yet but was going to switch from ocf:heartbeat:nfsserver to
> lsb:nfs to see if that fixes the problem.

My recommendation would be to stay with the OCF resource agents
whenever possible. Those should be in general of higher quality
than init scripts.

> I also had something like this in my config:
> primitive drbd_r0 ocf:heartbeat:drbd \
> params drbd_resource="r0" \
>   op monitor="30s"
> 
> That also gave me an error (I think it was "action monitor_0 does not exist").

This has most probably been from the crm shell. It has been relaxed
in the meantime (see bugzilla ).

> I think that needs to be switched to this:
> primitive drbd_r0 ocf:linbit:drbd \
>   params drbd_resource="r0"
>   op monitor interval="29s" role="Master" timeout="30s" \
>   op monitor interval="30s" role="Slave" timeout="30s"

Yes, that's what I expected the right usage would be.

Thanks,

Dejan

> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker shutdown issue

2009-11-30 Thread Tony Bunce
>The upgrade should really be transparent. What problems did you
>encounter with nfsserver?

Whenever one of the nodes takes over the nfs resource it doesn't startup the 
first time and gives this error:

nfs_server_monitor_0 (node=nfs1, call=11, rc=2, status=complete): invalid 
parameter

If I run this command it starts up instantly and doesn't have any problems 
until the service gets migrated again:
crm_resource -C -r nfs_server


Here is that resource from my config:
primitive nfs_server ocf:heartbeat:nfsserver \
params nfs_init_script="/etc/init.d/nfs" \
params nfs_notify_cmd="/sbin/rpc.statd" \
params nfs_shared_infodir="/var/lib/nfs" \
params nfs_ip="10.1.1.150" \
op monitor interval="30s"


I haven't tested yet but was going to switch from ocf:heartbeat:nfsserver to 
lsb:nfs to see if that fixes the problem.


I also had something like this in my config:
primitive drbd_r0 ocf:heartbeat:drbd \
params drbd_resource="r0" \
op monitor="30s"

That also gave me an error (I think it was "action monitor_0 does not exist").

I think that needs to be switched to this:
primitive drbd_r0 ocf:linbit:drbd \
  params drbd_resource="r0"
op monitor interval="29s" role="Master" timeout="30s" \
  op monitor interval="30s" role="Slave" timeout="30s"

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker shutdown issue

2009-11-30 Thread Dejan Muhamedagic
Hi,

On Mon, Nov 30, 2009 at 11:12:55AM -0500, Tony Bunce wrote:
> I have updated to pacemaker 1.0.6 and openais 1.1.0 and it looks like that
> might have fixed the problem
> 
> The upgrade doesn't like some of my configuration (ocf:heartbeat:nfsserver
> isn't happy)

The upgrade should really be transparent. What problems did you
encounter with nfsserver?

Thanks,

Dejan


> but once I get that worked out I'll be able to confirm that the
> new version fixes the issue.
> 
> > if you don't see log messages of the form "lrmd.*stop.*" then there is 
> > probably a bug.
> 
> I don't think I was seeing those messages before the upgrade. 
> 
> Thanks for the help!
> 
> -Tony
> 
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker shutdown issue

2009-11-30 Thread Tony Bunce
I have updated to pacemaker 1.0.6 and openais 1.1.0 and it looks like that 
might have fixed the problem

The upgrade doesn't like some of my configuration (ocf:heartbeat:nfsserver 
isn't happy) but once I get that worked out I'll be able to confirm that the 
new version fixes the issue.

> if you don't see log messages of the form "lrmd.*stop.*" then there is 
> probably a bug.

I don't think I was seeing those messages before the upgrade. 

Thanks for the help!

-Tony

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pacemaker shutdown issue

2009-11-30 Thread Dejan Muhamedagic
Hi,

On Mon, Nov 30, 2009 at 12:04:00AM -0500, Tony Bunce wrote:
> Hi Everyone,
> 
> I'm having an issue with pacemaker and was hoping someone could point me in 
> the right direction.
> 
> I'm using pacemaker with openais on a set of NFS servers.  Every time I 
> reboot the primary I get a split brain in DRBD.
> 
> From what I can tell when openais is shutting down it doesn't stop the 
> services it is controlling so as far as DRBD is concerned it is the same as a 
> hard shutdown.
> 
> I can reproduce the problem by stopping OpenAIS ("service openais stop" or 
> "/etc/init.d/openais stop") and see that the controlled services (DRBD, files 
> systems, nfs, etc.) are still running.
> 
> I think this is the same exact problem:
> http://www.gossamer-threads.com/lists/linuxha/pacemaker/59384
> 
> Version Info:
> CentOS 5.4 x64
> drbd83-8.3.2-6.el5_3
> openais-0.80.6-8.el5_4.1
> pacemaker-1.0.5-4.1
> 
> Is there something special that needs to be configured so that when openais
> stops it stops all of the resources?

No. The sequence of events is that openais tells crmd that shutdown
is pending, then crmd will try to stop all resources which are running
on the node. It may happen, usually with resources which are broken for
whatever reason, that the shutdown is escalated and that crmd gives up
on waiting for resources to stop. At any rate, if you don't see log
messages of the form "lrmd.*stop.*" then there is probably a bug.
Please make a hb_report and file a bugzilla.

Thanks,

Dejan

> Thanks for the help!
> 
> -Tony

> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker