Re: [Pacemaker] monitor operation stopped running

2010-12-15 Thread Andrew Beekhof
On Wed, Dec 15, 2010 at 8:30 AM, Chris Picton  wrote:
> On Tue, 14 Dec 2010 18:55:06 +0100, Dejan Muhamedagic wrote:
>
>> Hi,
>>
>> On Tue, Dec 14, 2010 at 12:16:22PM +0200, Chris Picton wrote:
>>> Hi
>>>
>>> I have noticed this happening a few times on various of my clusters.
>>> The monitor operation for some resources stops running, and thus
>>> resource failures are not detected.  If I edit the cib, and change
>>> something regarding the resource (generally I change the monitor
>>> interval), the resource starts monitoring again, detects the failure
>>> and restarts correctly
>>>
>>> I am using pacemaker 1.0.9 live, and 1.0.10 in test.
>>>
>>> This has happened with both clone and non-clone resources.
>>>
>>> I have attached a log which shows the behaviour.  I have a resource
>>> (megaswitch) running cloned over 6 nodes.
>>>
>>> Until 06:48:22, the monitor is running correctly (the app logs the
>>> "Deleting context for MONTEST-" line when the monitor is run) After
>>> that, the monitor is not run again on this node
>>>
>>> I have the logs for the other nodes, if they are needed to try and
>>> debug this.
>>
>> Nov 28 06:48:26 sbc-tpna2-01 crmd: [4863]: info: do_lrm_invoke: Removing
>> resource megaswitch:3 from the LRM Nov 28 06:48:26 sbc-tpna2-01 crmd:
>> [4863]: info: do_lrm_invoke: Resource 'megaswitch:3' deleted for
>> 19511_crm_resource on sbc-tpna2-06.ecntelecoms.za.net Nov 28 06:48:26
>> sbc-tpna2-01 crmd: [4863]: info: notify_deleted: Notifying
>> 19511_crm_resource on sbc-tpna2-06.ecntelecoms.za.net that megaswitch:3
>> was deleted
>>
>> Somebody/something on sbc-tpna2-06.ecntelecoms.za.net ran crm_resource
>> (or perhaps the crm shell) and removed megaswitch from LRM. Any
>> suspicious cron jobs over there?
>
> on sbc-tpna2-06
> ---
> Nov 28 06:48:19 sbc-tpna2-06 crm_resource: [19476]: info: Invoked:
> crm_resource -C -r group_megaswitch:0 -H sbc-tpna2-01.ecntelecoms.za.net
> Nov 28 06:48:21 sbc-tpna2-06 crm_resource: [19482]: info: Invoked:
> crm_resource -C -r group_megaswitch:1 -H sbc-tpna2-01.ecntelecoms.za.net
> Nov 28 06:48:24 sbc-tpna2-06 crm_resource: [19506]: info: Invoked:
> crm_resource -C -r group_megaswitch:2 -H sbc-tpna2-01.ecntelecoms.za.net
> Nov 28 06:48:24 sbc-tpna2-06 crmd: [29893]: ERROR: send_msg_via_ipc:
> Unknown Sub-system (19482_crm_resource)... discarding message.
> Nov 28 06:48:24 sbc-tpna2-06 crmd: [29893]: ERROR: send_msg_via_ipc:
> Unknown Sub-system (19482_crm_resource)... discarding message.
> Nov 28 06:48:26 sbc-tpna2-06 crm_resource: [19511]: info: Invoked:
> crm_resource -C -r group_megaswitch:3 -H sbc-tpna2-01.ecntelecoms.za.net
> Nov 28 06:48:27 sbc-tpna2-06 cib: [19512]: info: write_cib_contents:
> Archived previous version as /var/lib/heartbeat/crm/cib-21.raw
> Nov 28 06:48:27 sbc-tpna2-06 cib: [19512]: info: write_cib_contents:
> Wrote version 0.232.0 of the CIB to disk (digest:
> 6aaa4d35d37a179b8f42c7045220690a)
> Nov 28 06:48:27 sbc-tpna2-06 cib: [19512]: info: retrieveCib: Reading
> cluster configuration from: /var/lib/heartbeat/crm/cib.tmgWhm (digest: /
> var/lib/heartbeat/crm/cib.NqXOtl)
> Nov 28 06:48:27 sbc-tpna2-06 cib: [29889]: info: Managed
> write_cib_contents process 19512 exited with return code 0.
> Nov 28 06:48:27 sbc-tpna2-06 attrd: [29892]: info: attrd_ha_callback:
> flush message from sbc-tpna2-01.ecntelecoms.za.net
> Nov 28 06:48:27 sbc-tpna2-06 cib: [19527]: info: write_cib_contents:
> Archived previous version as /var/lib/heartbeat/crm/cib-22.raw
> Nov 28 06:48:27 sbc-tpna2-06 cib: [19527]: info: write_cib_contents:
> Wrote version 0.233.0 of the CIB to disk (digest:
> 8e39a0b125878ab28f8bed81789f5a59)
> Nov 28 06:48:27 sbc-tpna2-06 cib: [19527]: info: retrieveCib: Reading
> cluster configuration from: /var/lib/heartbeat/crm/cib.mwt8EZ (digest: /
> var/lib/heartbeat/crm/cib.hZ74d0)
> Nov 28 06:48:27 sbc-tpna2-06 cib: [29889]: info: Managed
> write_cib_contents process 19527 exited with return code 0.
> Nov 28 06:48:28 sbc-tpna2-06 crm_resource: [19528]: info: Invoked:
> crm_resource -C -r group_megaswitch:4 -H sbc-tpna2-01.ecntelecoms.za.net
> Nov 28 06:48:30 sbc-tpna2-06 crm_resource: [19534]: info: Invoked:
> crm_resource -C -r group_megaswitch:5 -H sbc-tpna2-01.ecntelecoms.za.net
>
>
> It looks like a 'crm resource cleanup megaswitch-clone' command was
> executed
>
> On the other nodes, they all log  similar entries
> ---
> sbc-tpna2-05.ecntelecoms.za.net.16.small:Nov 28 06:49:17 sbc-tpna2-05
> crmd: [30350]: info: do_lrm_invoke: Removing resource megaswitch:4 from
> the LRM
> sbc-tpna2-05.ecntelecoms.za.net.16.small-Nov 28 06:49:17 sbc-tpna2-05
> crmd: [30350]: info: do_lrm_invoke: Resource 'megaswitch:4' deleted for
> 19697_crm_resource on sbc-tpna2-06.ecntelecoms.za.net
> sbc-tpna2-05.ecntelecoms.za.net.16.small-Nov 28 06:49:17 sbc-tpna2-05
> crmd: [30350]: info: notify_deleted: Notifying 19697_crm_resource on sbc-
> tpna2-06.ecntelecoms.za.net that megaswitch:4 was deleted
> --
>
>
> So I have 2 questions:
> 1) 

Re: [Pacemaker] [Problem]The movement of the resource is not possible.

2010-12-15 Thread Andrew Beekhof
On Thu, Dec 16, 2010 at 8:15 AM,   wrote:
> Hi Andrew,
>
>> > About 1.0, I ask Mr. Mori for backporting.
>> > Will you revise 1.1?
>>
>> Yes, I have it modified locally. I'll push it out soon.
>
> I confirmed a revision for your PM1.1.
>
>  * http://hg.clusterlabs.org/pacemaker/1.1/rev/862936c5bca3
>
> I think to have PM1.0 reflect this revision.
>
> I ask Mr. Mori for revision, but is not there the problem even if patch 
> revise PM1.0 in the same way?

No, the patch should be sufficient if applied to 1.0.

>
> Best Regards,
> Hideo Yamauchi.
>
>
> --- Andrew Beekhof  wrote:
>
>> On Thu, Dec 2, 2010 at 1:44 AM,   wrote:
>> > Hi Andrew,
>> >
>> >> > Can 1.0 reflect this revision?
>> >> > Because there is influence else, is it impossible?
>> >>
>> >> I have no objection to it being added to 1.0, it should be safe.
>> >
>> > Thanks.
>> >
>> > About 1.0, I ask Mr. Mori for backporting.
>> > Will you revise 1.1?
>>
>> Yes, I have it modified locally. I'll push it out soon.
>>
>> >
>> > Best Regards,
>> > Hideo Yamauchi.
>> >
>> >
>> > --- Andrew Beekhof  wrote:
>> >
>> >> On Mon, Nov 29, 2010 at 5:11 AM, � 
>> >> wrote:
>> >> > Hi Andrew,
>> >> >
>> >> > Sorry
>> >> > My response was late.
>> >> >
>> >> >> I think the smartest thing to do here is drop the cib_scope_local flag 
>> >> >> from -f
>> >> >
>> >> > � � � �if(do_force) {
>> >> > � � � � � � �
>> �crm_debug("Forcing...");
>> >> > /* � � � � � � 
>> >> > �cib_options |=
>> > cib_scope_local|cib_quorum_override; */
>> >> > � � � � � � � 
>> >> > �cib_options |=
>> > cib_quorum_override;
>> >> > � � � �}
>> >> >
>> >> >
>> >> > I confirmed movement with you according to a revision.
>> >> > The resource moves well.
>> >> >
>> >> > Can 1.0 reflect this revision?
>> >> > Because there is influence else, is it impossible?
>> >>
>> >> I have no objection to it being added to 1.0, it should be safe.
>> >>
>> >> >
>> >> > Best Regards,
>> >> > Hideo Yamauchi.
>> >> >
>> >> > --- Andrew Beekhof  wrote:
>> >> >
>> >> >> 2010/11/8 �:
>> >> >> > Hi,
>> >> >> >
>> >> >> > By two simple node constitution, it caused trouble(monitor error) in 
>> >> >> > a resource.
>> >> >> >
>> >> >> > 
>> >> >> > Last updated: Mon Nov �8 10:16:50 2010
>> >> >> > Stack: Heartbeat
>> >> >> > Current DC: srv02 (f80f87fd-cc09-43c7-80bc-8d9e96de376b) - partition 
>> >> >> > WITHOUT quorum
>> >> >> > Version: 1.0.9-0a40fd0cb9f2fcedef9d1967115c912314c57438
>> >> >> > 2 Nodes configured, unknown expected votes
>> >> >> > 1 Resources configured.
>> >> >> > 
>> >> >> >
>> >> >> > Online: [ srv01 srv02 ]
>> >> >> >
>> >> >> > �Resource Group: grpDummy
>> >> >> > � � prmDummy1-1 � � � 
>> >> >> > �(ocf::heartbeat:Dummy):
>> >> Started
>> >> > srv02
>> >> >> > � � prmDummy1-2 � � � 
>> >> >> > �(ocf::heartbeat:Dummy):
>> >> Started
>> >> > srv02
>> >> >> > � � prmDummy1-3 � � � 
>> >> >> > �(ocf::heartbeat:Dummy):
>> >> Started
>> >> > srv02
>> >> >> > � � prmDummy1-4 � � � 
>> >> >> > �(ocf::heartbeat:Dummy):
>> >> Started
>> >> > srv02
>> >> >> >
>> >> >> > Migration summary:
>> >> >> > * Node srv02:
>> >> >> > * Node srv01:
>> >> >> > � prmDummy1-1: migration-threshold=1 fail-count=1
>> >> >> >
>> >> >> > Failed actions:
>> >> >> > � �prmDummy1-1_monitor_3 (node=srv01, call=7, 
>> >> >> > rc=7, status=complete):
>> not
>> >> > running
>> >> >> >
>> >> >> >
>> >> >> > I carried out the next command consecutively after a resource 
>> >> >> > exceeded a fail-over.
>> >> >> >
>> >> >> > [r...@srv01 ~]# crm_resource -C -r prmDummy1-1 -N srv01;crm_resource 
>> >> >> > -M -r grpDummy -N
>> >> srv01
>> >> >> -f -Q
>> >> >> >
>> >> >> > 
>> >> >> > Last updated: Mon Nov �8 10:17:33 2010
>> >> >> > Stack: Heartbeat
>> >> >> > Current DC: srv02 (f80f87fd-cc09-43c7-80bc-8d9e96de376b) - partition 
>> >> >> > WITHOUT quorum
>> >> >> > Version: 1.0.9-0a40fd0cb9f2fcedef9d1967115c912314c57438
>> >> >> > 2 Nodes configured, unknown expected votes
>> >> >> > 1 Resources configured.
>> >> >> > 
>> >> >> >
>> >> >> > Online: [ srv01 srv02 ]
>> >> >> >
>> >> >> > �Resource Group: grpDummy
>> >> >> > � � prmDummy1-1 � � � 
>> >> >> > �(ocf::heartbeat:Dummy):
>> >> Started
>> >> > srv02
>> >> >> > � � prmDummy1-2 � � � 
>> >> >> > �(ocf::heartbeat:Dummy):
>> >> Started
>> >> > srv02
>> >> >> > � � prmDummy1-3 � � � 
>> >> >> > �(ocf::heartbeat:Dummy):
>> >> Started
>> >> > srv02
>> >> >> > � � prmDummy1-4 � � � 
>> >> >> > �(ocf::heartbeat:Dummy):
>> >> Started
>> >> > srv02
>> >> >> >
>> >> >> > Migration summary:
>> >> >> > * Node srv02:
>> >> >> > * Node srv01:
>> >> >> >
>> >> >> > But, the resource does not move to a srv01 node.
>> >> >> >
>> >> >> > Does the "crm_resource -M" command have to carry it out after 
>> >> >> > waiting for a S_IDLE
>> state?
>> >> >> >
>> >> >> > Or is this phenomenon a bug?
>> >> >> >
>> >> >> > �* I attach a collection of hb_report file
>> >> >>
>> >> >> So the problem here is that not only does -f �enable logic in
>> >> >> m

Re: [Pacemaker] [Problem]The movement of the resource is not possible.

2010-12-15 Thread renayama19661014
Hi Andrew,

> > About 1.0, I ask Mr. Mori for backporting.
> > Will you revise 1.1?
> 
> Yes, I have it modified locally. I'll push it out soon.

I confirmed a revision for your PM1.1.

 * http://hg.clusterlabs.org/pacemaker/1.1/rev/862936c5bca3 

I think to have PM1.0 reflect this revision.

I ask Mr. Mori for revision, but is not there the problem even if patch revise 
PM1.0 in the same way?

Best Regards,
Hideo Yamauchi.


--- Andrew Beekhof  wrote:

> On Thu, Dec 2, 2010 at 1:44 AM,   wrote:
> > Hi Andrew,
> >
> >> > Can 1.0 reflect this revision?
> >> > Because there is influence else, is it impossible?
> >>
> >> I have no objection to it being added to 1.0, it should be safe.
> >
> > Thanks.
> >
> > About 1.0, I ask Mr. Mori for backporting.
> > Will you revise 1.1?
> 
> Yes, I have it modified locally. I'll push it out soon.
> 
> >
> > Best Regards,
> > Hideo Yamauchi.
> >
> >
> > --- Andrew Beekhof  wrote:
> >
> >> On Mon, Nov 29, 2010 at 5:11 AM, � 
> >> wrote:
> >> > Hi Andrew,
> >> >
> >> > Sorry
> >> > My response was late.
> >> >
> >> >> I think the smartest thing to do here is drop the cib_scope_local flag 
> >> >> from -f
> >> >
> >> > � � � �if(do_force) {
> >> > � � � � � � �
> �crm_debug("Forcing...");
> >> > /* � � � � � � 
> >> > �cib_options |=
> > cib_scope_local|cib_quorum_override; */
> >> > � � � � � � � 
> >> > �cib_options |=
> > cib_quorum_override;
> >> > � � � �}
> >> >
> >> >
> >> > I confirmed movement with you according to a revision.
> >> > The resource moves well.
> >> >
> >> > Can 1.0 reflect this revision?
> >> > Because there is influence else, is it impossible?
> >>
> >> I have no objection to it being added to 1.0, it should be safe.
> >>
> >> >
> >> > Best Regards,
> >> > Hideo Yamauchi.
> >> >
> >> > --- Andrew Beekhof  wrote:
> >> >
> >> >> 2010/11/8 �:
> >> >> > Hi,
> >> >> >
> >> >> > By two simple node constitution, it caused trouble(monitor error) in 
> >> >> > a resource.
> >> >> >
> >> >> > 
> >> >> > Last updated: Mon Nov �8 10:16:50 2010
> >> >> > Stack: Heartbeat
> >> >> > Current DC: srv02 (f80f87fd-cc09-43c7-80bc-8d9e96de376b) - partition 
> >> >> > WITHOUT quorum
> >> >> > Version: 1.0.9-0a40fd0cb9f2fcedef9d1967115c912314c57438
> >> >> > 2 Nodes configured, unknown expected votes
> >> >> > 1 Resources configured.
> >> >> > 
> >> >> >
> >> >> > Online: [ srv01 srv02 ]
> >> >> >
> >> >> > �Resource Group: grpDummy
> >> >> > � � prmDummy1-1 � � � 
> >> >> > �(ocf::heartbeat:Dummy):
> >> Started
> >> > srv02
> >> >> > � � prmDummy1-2 � � � 
> >> >> > �(ocf::heartbeat:Dummy):
> >> Started
> >> > srv02
> >> >> > � � prmDummy1-3 � � � 
> >> >> > �(ocf::heartbeat:Dummy):
> >> Started
> >> > srv02
> >> >> > � � prmDummy1-4 � � � 
> >> >> > �(ocf::heartbeat:Dummy):
> >> Started
> >> > srv02
> >> >> >
> >> >> > Migration summary:
> >> >> > * Node srv02:
> >> >> > * Node srv01:
> >> >> > � prmDummy1-1: migration-threshold=1 fail-count=1
> >> >> >
> >> >> > Failed actions:
> >> >> > � �prmDummy1-1_monitor_3 (node=srv01, call=7, rc=7, 
> >> >> > status=complete):
> not
> >> > running
> >> >> >
> >> >> >
> >> >> > I carried out the next command consecutively after a resource 
> >> >> > exceeded a fail-over.
> >> >> >
> >> >> > [r...@srv01 ~]# crm_resource -C -r prmDummy1-1 -N srv01;crm_resource 
> >> >> > -M -r grpDummy -N
> >> srv01
> >> >> -f -Q
> >> >> >
> >> >> > 
> >> >> > Last updated: Mon Nov �8 10:17:33 2010
> >> >> > Stack: Heartbeat
> >> >> > Current DC: srv02 (f80f87fd-cc09-43c7-80bc-8d9e96de376b) - partition 
> >> >> > WITHOUT quorum
> >> >> > Version: 1.0.9-0a40fd0cb9f2fcedef9d1967115c912314c57438
> >> >> > 2 Nodes configured, unknown expected votes
> >> >> > 1 Resources configured.
> >> >> > 
> >> >> >
> >> >> > Online: [ srv01 srv02 ]
> >> >> >
> >> >> > �Resource Group: grpDummy
> >> >> > � � prmDummy1-1 � � � 
> >> >> > �(ocf::heartbeat:Dummy):
> >> Started
> >> > srv02
> >> >> > � � prmDummy1-2 � � � 
> >> >> > �(ocf::heartbeat:Dummy):
> >> Started
> >> > srv02
> >> >> > � � prmDummy1-3 � � � 
> >> >> > �(ocf::heartbeat:Dummy):
> >> Started
> >> > srv02
> >> >> > � � prmDummy1-4 � � � 
> >> >> > �(ocf::heartbeat:Dummy):
> >> Started
> >> > srv02
> >> >> >
> >> >> > Migration summary:
> >> >> > * Node srv02:
> >> >> > * Node srv01:
> >> >> >
> >> >> > But, the resource does not move to a srv01 node.
> >> >> >
> >> >> > Does the "crm_resource -M" command have to carry it out after waiting 
> >> >> > for a S_IDLE
> state?
> >> >> >
> >> >> > Or is this phenomenon a bug?
> >> >> >
> >> >> > �* I attach a collection of hb_report file
> >> >>
> >> >> So the problem here is that not only does -f �enable logic in
> >> >> move_resource(), but also
> >> >>
> >> >> � � � � � � � 
> >> >> cib_options |=
> > cib_scope_local|cib_quorum_override;
> >> >>
> >> >> Combined with the fact that crm_resource -C is not synchronous in 1.0,
> >> >> if you run -M on a non-DC node, the updates hit the local cib while
> 

Re: [Pacemaker] pacemaker + corosync in the cloud

2010-12-15 Thread Steven Dake
On 12/14/2010 05:14 PM, ruslan usifov wrote:
> Hi
> 
> Is it possible to use pacemaker based on corosync in the cloud hosting
> like amazon or soflayer?
> 
> 
> 

yes with corosync 1.3.0 in udpu mode.  The udpu mode avoids the use of
multicast allowing operation in amazon's cloud.

Regards
-steve

> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] colocation issues (isnt it always)

2010-12-15 Thread Dejan Muhamedagic
On Wed, Dec 15, 2010 at 09:33:54AM -0700, Patrick H. wrote:
> 
> 
> Sent: Wed Dec 15 2010 09:10:12 GMT-0700 (Mountain Standard Time)
> From: Dejan Muhamedagic 
> To: The Pacemaker cluster resource manager 
> Subject: Re: [Pacemaker] colocation issues (isnt it always)
> >Hi,
> >
> >On Mon, Dec 13, 2010 at 07:08:52PM -0700, Patrick H. wrote:
> >>Sent: Mon Dec 13 2010 15:19:48 GMT-0700 (Mountain Standard Time)
> >>From: Pavlos Parissis 
> >>To: The Pacemaker cluster resource manager 
> >>Subject: Re: [Pacemaker] colocation issues (isnt it always)
> >>>If you put all of them in a group and have the nfs_sdb1 as last
> >>>resource you will manage to have what you want with a very simple
> >>>configuration
> >>>BTW, I used your conf and in my case all resources started on the same node
> >>I futzed around with it some more and the problem was the nfsserver
> >>resource. It wasnt properly detecting that it wasnt running on the
> >>'nas02' node. When I first added the resource it wasnt in the
> >>colocation rule, so it started up on 'nas02' (or it thought it did),
> >>and then I added the colocation rule. Well the monitor action was
> >>reporting that the service was running when it really wasnt. So
> >>every time it went to shut it down and move it to another node, it
> >>failed cause it thought it was still running. I ended up writing my
> >>own script with a working monitor function and it moved over just
> >>fine.
> >
> >nfsserver actually uses your distribution init script:
> >
> >nfs_init_script (string, [/etc/init.d/nfsserver]): Init script for nfsserver
> >The default init script shipped with the Linux distro.
> >The nfsserver resource agent offloads the start/stop/monitor
> >work to the init script because the procedure to start/stop/monitor nfsserver
> >varies on different Linux distro.
> >
> >It looks like you need to report a bug to your vendor for the
> >NFS server init script.
> >
> No, the problem wasnt the nfs server init script, that worked
> perfectly, and in the script I'm using that I wrote to replace
> nfsserver i'm using the system's init script as well. The problem
> was with the nfsserver script improperly thinking that nfs was
> active, even when it wasnt. The path that nfs stores its state data
> (/var/lib/nfs) didnt even exist and it still said it was running.
> That was the problem.

In that case you should open a bugzilla at 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Linux-HA

Please attach all relevant information, best to use hb_report.

Thanks,

Dejan

> >Thanks,
> >
> >Dejan
> >
> >___
> >Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> >Project Home: http://www.clusterlabs.org
> >Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >Bugs: 
> >http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] continue starting chain with failed group resources

2010-12-15 Thread Dejan Muhamedagic
On Tue, Dec 14, 2010 at 06:18:16PM -0700, Patrick H. wrote:
> 
> 
> Sent: Tue Dec 14 2010 11:37:06 GMT-0700 (Mountain Standard Time)
> From: Dejan Muhamedagic 
> To: The Pacemaker cluster resource manager 
> Subject: Re: [Pacemaker] continue starting chain with failed group
> resources
> >Hi,
> >
> >On Mon, Dec 13, 2010 at 10:43:36PM -0700, Patrick H. wrote:
> >>After tinkering with this for a few hours I finally have something working.
> >>
> >>colocation co-raid inf: ( md_raid iscsi_1 iscsi_2 iscsi_3 )
> >
> >This should be noop. You'd want something like this, I think:
> >
> >colocation co-raid inf: md_raid ( iscsi_1 iscsi_2 iscsi_3 )
> >
> No, that makes the md_raid service depend on all the iscsi services
> being started, which I dont want

Yes, of course. It's just that in the given context, that seems
to be the only sensible relation between the resources.

> >>order or-raid 0: ( iscsi_1 iscsi_2 iscsi_3 ) md_raid
> >>
> >>Got rid of the group, changed the score on the order to 0, and
> >>changed the grouping of both the colocation and order. This
> >>*appears* to function as intended, but if anyone can point out any
> >>pitfalls I'd appreciate it
> >>
> >>-Patrick
> >>
> >>Sent: Mon Dec 13 2010 21:12:04 GMT-0700 (Mountain Standard Time)
> >>From: Patrick H. 
> >>To: The Pacemaker cluster resource manager 
> >>Subject: [Pacemaker] continue starting chain with failed group resources
> >>>Is there a way to continue down a chain of starting resources once
> >>>a previous resource hast tried to start, no matter if the try was
> >>>successful or not?
> >
> >No, that's currently not possible to express. I think that you
> >should take the iSCSI resources out of the cluster and let them
> >start on boot _before_ the cluster manager. If there are not
> >enough disks, then the md_raid resource is going to fail.
> Cant do that either. When the node that is currently using the iscsi
> services fails, they have to be migrated over to another host so it
> can assemble them into a raid array. If theyre not being managed by
> pacemaker, they wont migrate.

Perhaps you can then set on_fail=fence for say a filesystem which
is on top of this md_raid.

> I made a few more tweaks from the configuration I posted earlier and
> it seems to work pretty good with only one exception.
> colocation co-raid inf: ( md_raid iscsi_1 iscsi_2 iscsi_3 )

If this collocation makes a difference, then I really don't know
what it is.

> order or-raid_start 0: ( iscsi_1:start iscsi_2:start iscsi_3:start )
> md_raid:start
> order or-raid_stop inf: md_raid:stop ( iscsi_1:stop iscsi_2:stop
> iscsi_3:stop )
> 
> That makes it so that when they start up, they start in order, but
> it isnt required that every iscsi start before md_raid, just that
> they try to start

That's not how advisory order is defined, i.e. it has an effect
only in case both resources are to be started or stopped. For
instance, if all iscsi resources fail, the md_raid one would
continue to run. See Configuration Explained or Ordering
Explained doc.

> Then when they stop, its manditory that they stop in that order so
> that no iscsi service will stop while md_raid is still running.
> 
> The exception I mentioned is a bug in the policy engine. Bug 2435.
> The policy engine allows resources within a colocation set to start
> on other nodes. So if I were to stop one of the iscsi services, and
> then start it again, it might start on a different node. Unless this
> bug gets fixed soon, I'll probably modify the iscsi script so that

That bug is in the state fixed. If you think it's not fixed, then
you should reopen it.

> all the iscsi devices are under 1 resource.

Yes, that may be one option. Probably not too difficult to modify
the RA.

Thanks,

Dejan

> >Thanks,
> >
> >Dejan
> >
> >>>I've got 3 iSCSI resources which are in a group, and then an md
> >>>raid-5 array as another resource. I have the raid array resource
> >>>set to start after the group with a colocation rule, but it will
> >>>only start if the whole group comes up. Since this is raid-5, we
> >>>can obviously handle some disk failure and start up anyway. So how
> >>>do I get it to try to start it up once all the iSCSI resources
> >>>have tried to start? Went looking through the docs and didnt find
> >>>anything.
> >>>
> >>>Note: there will be other resources in the chain (like mounting
> >>>the filesystem) that I dont want to try and start if the raid
> >>>array resource didnt start.
> >>>
> >>>
> >>>___
> >>>Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >>>http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>>Project Home: http://www.clusterlabs.org
> >>>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>Bugs: 
> >>>http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >
> >>___
> >>Pa

Re: [Pacemaker] colocation issues (isnt it always)

2010-12-15 Thread Patrick H.



Sent: Wed Dec 15 2010 09:10:12 GMT-0700 (Mountain Standard Time)
From: Dejan Muhamedagic 
To: The Pacemaker cluster resource manager 
Subject: Re: [Pacemaker] colocation issues (isnt it always)

Hi,

On Mon, Dec 13, 2010 at 07:08:52PM -0700, Patrick H. wrote:
  

Sent: Mon Dec 13 2010 15:19:48 GMT-0700 (Mountain Standard Time)
From: Pavlos Parissis 
To: The Pacemaker cluster resource manager 
Subject: Re: [Pacemaker] colocation issues (isnt it always)


If you put all of them in a group and have the nfs_sdb1 as last
resource you will manage to have what you want with a very simple
configuration
BTW, I used your conf and in my case all resources started on the same node
  

I futzed around with it some more and the problem was the nfsserver
resource. It wasnt properly detecting that it wasnt running on the
'nas02' node. When I first added the resource it wasnt in the
colocation rule, so it started up on 'nas02' (or it thought it did),
and then I added the colocation rule. Well the monitor action was
reporting that the service was running when it really wasnt. So
every time it went to shut it down and move it to another node, it
failed cause it thought it was still running. I ended up writing my
own script with a working monitor function and it moved over just
fine.



nfsserver actually uses your distribution init script:

nfs_init_script (string, [/etc/init.d/nfsserver]): Init script for nfsserver
The default init script shipped with the Linux distro.
The nfsserver resource agent offloads the start/stop/monitor
work to the init script because the procedure to start/stop/monitor nfsserver
varies on different Linux distro.

It looks like you need to report a bug to your vendor for the
NFS server init script.

  
No, the problem wasnt the nfs server init script, that worked perfectly, 
and in the script I'm using that I wrote to replace nfsserver i'm using 
the system's init script as well. The problem was with the nfsserver 
script improperly thinking that nfs was active, even when it wasnt. The 
path that nfs stores its state data (/var/lib/nfs) didnt even exist and 
it still said it was running. That was the problem.

Thanks,

Dejan

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] colocation issues (isnt it always)

2010-12-15 Thread Dejan Muhamedagic
Hi,

On Mon, Dec 13, 2010 at 07:08:52PM -0700, Patrick H. wrote:
> Sent: Mon Dec 13 2010 15:19:48 GMT-0700 (Mountain Standard Time)
> From: Pavlos Parissis 
> To: The Pacemaker cluster resource manager 
> Subject: Re: [Pacemaker] colocation issues (isnt it always)
> >If you put all of them in a group and have the nfs_sdb1 as last
> >resource you will manage to have what you want with a very simple
> >configuration
> >BTW, I used your conf and in my case all resources started on the same node
> I futzed around with it some more and the problem was the nfsserver
> resource. It wasnt properly detecting that it wasnt running on the
> 'nas02' node. When I first added the resource it wasnt in the
> colocation rule, so it started up on 'nas02' (or it thought it did),
> and then I added the colocation rule. Well the monitor action was
> reporting that the service was running when it really wasnt. So
> every time it went to shut it down and move it to another node, it
> failed cause it thought it was still running. I ended up writing my
> own script with a working monitor function and it moved over just
> fine.

nfsserver actually uses your distribution init script:

nfs_init_script (string, [/etc/init.d/nfsserver]): Init script for nfsserver
The default init script shipped with the Linux distro.
The nfsserver resource agent offloads the start/stop/monitor
work to the init script because the procedure to start/stop/monitor nfsserver
varies on different Linux distro.

It looks like you need to report a bug to your vendor for the
NFS server init script.

Thanks,

Dejan

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker