Re: [Pacemaker] [PATCH] pingd checks pidfile on start

2012-03-29 Thread Takatoshi MATSUO
Hi Andrew

> Any chance you could redo this as a github pull request? :-D

Thanks for your reply.
I sent a pull request.

Regards,
Takatoshi MATSUO

2012/3/29 Andrew Beekhof :
> Any chance you could redo this as a github pull request? :-D
>
> On Wed, Mar 14, 2012 at 6:49 PM, Takatoshi MATSUO  
> wrote:
>> Hi
>>
>> I use pacemaker 1.0.11 and pingd RA.
>> Occasionally, pingd's first monitor is failed after start.
>>
>> It seems that the main cause is pingd daemon returns 0 before creating 
>> pidfile
>> and RA doesn't check pidfile on start.
>>
>> test script
>> -
>> while true; do
>>    killall pingd; sleep 3
>>    rm -f /tmp/pingd.pid; sleep 1
>>    /usr/lib64/heartbeat/pingd -D -p /tmp/pingd.pid -a ping_status -d
>> 0 -m 100 -h 192.168.0.1
>>   echo $?
>>   ls /tmp/pingd.pid; sleep .1
>>   ls /tmp/pingd.pid
>> done
>> -
>>
>> result
>> -
>> 0
>> /tmp/pingd.pid
>> /tmp/pingd.pid
>> 0
>> ls: cannot access /tmp/pingd.pid:  No such file or directory   <- NG
>> /tmp/pingd.pid
>> 0
>> /tmp/pingd.pid
>> /tmp/pingd.pid
>> 0
>> /tmp/pingd.pid
>> /tmp/pingd.pid
>> 0
>> /tmp/pingd.pid
>> /tmp/pingd.pid
>> 0
>> ls: cannot access /tmp/pingd.pid: No such file or directory   <- NG
>> /tmp/pingd.pid
>> --
>>
>> Please consider the attached patch for pacemaker-1.0.
>>
>> Regards,
>> Takatoshi MATSUO
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB not saved

2012-03-29 Thread Fiorenza Meini


Normally we log an error at startup if we can't write there... did
this not happen?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Ies, it happened. I saw a warning while writing CIB..but after I wrote 
in this mailing list :)


Regards

--

Fiorenza Meini
Spazio Web S.r.l.

V. Dante Alighieri, 10 - 13900 Biella
Tel.: 015.2431982 - 015.9526066
Fax: 015.2522600
Reg. Imprese, CF e P.I.: 02414430021
Iscr. REA: BI - 188936
Iscr. CCIAA: Biella - 188936
Cap. Soc.: 30.000,00 Euro i.v.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Nodes not rejoining cluster

2012-03-29 Thread Andrew Beekhof
Gotta have logs.  From all 3 nodes mentioned.
Only then can we determine if the problem is at the corosync or
pacemaker layer - which is the pre-requisit for figuring out what to
do next :)

On Fri, Mar 30, 2012 at 1:30 PM, Gregg Stock  wrote:
> I had a circuit breaker go out and take two of the 5 nodes in my cluster
> down. Now that their back up and running, they are not rejoining the
> cluster.
>
> Here is what I get from crm_mon -1
>
> node 1,2 and 3 itchy, scratchy and walter show the following:
> 
> Last updated: Thu Mar 29 19:04:05 2012
> Last change: Thu Mar 29 19:04:03 2012 via cibadmin on walter
> Stack: openais
> Current DC: walter - partition with quorum
> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
> 5 Nodes configured, 5 expected votes
> 9 Resources configured.
> 
>
> Online: [ itchy scratchy walter butthead timmy ]
>
>
> On butthead I get
>
> 
> Last updated: Thu Mar 29 19:04:24 2012
> Last change: Thu Mar 29 18:42:09 2012 via cibadmin on itchy
> Stack: openais
> Current DC: NONE
> 5 Nodes configured, 5 expected votes
> 9 Resources configured.
> 
>
> OFFLINE: [ itchy scratchy walter butthead timmy ]
>
>
> On Timmy, I get
>
> 
> Last updated: Thu Mar 29 19:04:20 2012
> Last change:
> Current DC: NONE
> 0 Nodes configured, unknown expected votes
> 0 Resources configured.
> 
>
>
> I don't have anything important running yet. so I can do a full clean up of
> everything if needed.
>
> I also get some weird behavior with timmy. I brought this node up with the
> host name as timmy.example.com and I changed the host name to timmy but when
> the cluster is offline timmy.example.com shows up as offline. I enter crm
> node delete timmy.example.com and it goes away until timmy goes offline
> again.
>
> Thanks,
> Gregg Stock
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Nodes not rejoining cluster

2012-03-29 Thread Gregg Stock
I had a circuit breaker go out and take two of the 5 nodes in my cluster 
down. Now that their back up and running, they are not rejoining the 
cluster.


Here is what I get from crm_mon -1

node 1,2 and 3 itchy, scratchy and walter show the following:

Last updated: Thu Mar 29 19:04:05 2012
Last change: Thu Mar 29 19:04:03 2012 via cibadmin on walter
Stack: openais
Current DC: walter - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
5 Nodes configured, 5 expected votes
9 Resources configured.


Online: [ itchy scratchy walter butthead timmy ]


On butthead I get


Last updated: Thu Mar 29 19:04:24 2012
Last change: Thu Mar 29 18:42:09 2012 via cibadmin on itchy
Stack: openais
Current DC: NONE
5 Nodes configured, 5 expected votes
9 Resources configured.


OFFLINE: [ itchy scratchy walter butthead timmy ]


On Timmy, I get


Last updated: Thu Mar 29 19:04:20 2012
Last change:
Current DC: NONE
0 Nodes configured, unknown expected votes
0 Resources configured.



I don't have anything important running yet. so I can do a full clean up 
of everything if needed.


I also get some weird behavior with timmy. I brought this node up with 
the host name as timmy.example.com and I changed the host name to timmy 
but when the cluster is offline timmy.example.com shows up as offline. I 
enter crm node delete timmy.example.com and it goes away until timmy 
goes offline again.


Thanks,
Gregg Stock


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Problem] The cluster fails in the stop of the node.

2012-03-29 Thread renayama19661014
Hi Andrew,

> This appears to be resolved with 1.1.7, perhaps look for a patch to backport?

I confirm movement of Pacemaker 1.1.7.
And I talk about the backporting with Mr Mori.

Best Regards,
Hideo Yamauchi.

--- On Thu, 2012/3/29, Andrew Beekhof  wrote:

> This appears to be resolved with 1.1.7, perhaps look for a patch to backport?
> 
> On Tue, Mar 27, 2012 at 4:46 PM,   wrote:
> > Hi All,
> >
> > When we set a group resource within Master/Slave resource, we found the 
> > problem that a node could not stop.
> >
> > This problem occurs in Pacemaker1.0.11.
> >
> > We confirmed a problem in the following procedure.
> >
> > Step1) Start all nodes.
> >
> > 
> > Last updated: Tue Mar 27 14:35:16 2012
> > Stack: Heartbeat
> > Current DC: test2 (b645c456-af78-429e-a40a-279ed063b97d) - partition 
> > WITHOUT quorum
> > Version: 1.0.12-unknown
> > 2 Nodes configured, unknown expected votes
> > 4 Resources configured.
> > 
> >
> > Online: [ test1 test2 ]
> >
> >  Master/Slave Set: msGroup01
> >     Masters: [ test1 ]
> >     Slaves: [ test2 ]
> >  Resource Group: testGroup
> >     prmDummy1  (ocf::pacemaker:Dummy): Started test1
> >     prmDummy2  (ocf::pacemaker:Dummy): Started test1
> >  Resource Group: grpStonith1
> >     prmStonithN1       (stonith:external/ssh): Started test2
> >  Resource Group: grpStonith2
> >     prmStonithN2       (stonith:external/ssh): Started test1
> >
> > Migration summary:
> > * Node test2:
> > * Node test1:
> >
> > Step2) Stop Slave node.
> >
> > [root@test2 ~]# service heartbeat stop
> > Stopping High-Availability services: Done.
> >
> > Step3) Stop Master node. However, a loop does the Master node and does not 
> > stop.
> >
> > (snip)
> > Mar 27 14:38:06 test1 crmd: [21443]: WARN: run_graph: Transition 3 
> > (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=23, 
> > Source=/var/lib/pengine/pe-input-3.bz2): Terminated
> > Mar 27 14:38:06 test1 crmd: [21443]: ERROR: te_graph_trigger: Transition 
> > failed: terminated
> > Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Graph 3 (30 actions 
> > in 30 synapses): batch-limit=30 jobs, network-delay=6ms
> > Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 0 is 
> > pending (priority: 0)
> > Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:     [Action 12]: 
> > Pending (id: testMsGroup01:0_stop_0, type: pseduo, priority: 0)
> > Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 14]: 
> > Completed (id: testMsGroup01:0_demote_0, type: pseduo, priority: 0)
> > Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 32]: 
> > Pending (id: msGroup01_stop_0, type: pseduo, priority: 0)
> > Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 1 is 
> > pending (priority: 0)
> > Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:     [Action 13]: 
> > Pending (id: testMsGroup01:0_stopped_0, type: pseduo, priority: 0)
> > Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 8]: 
> > Pending (id: prmStateful1:0_stop_0, loc: test1, priority: 0)
> > Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 9]: 
> > Pending (id: prmStateful2:0_stop_0, loc: test1, priority: 0)
> > Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_elem:      * [Input 12]: 
> > Pending (id: testMsGroup01:0_stop_0, type: pseduo, priority: 0)
> > Mar 27 14:38:06 test1 crmd: [21443]: WARN: print_graph: Synapse 2 was 
> > confirmed (priority: 0)
> > (snip)
> >
> > I attach data of hb_report.
> >
> > Best Regards,
> > Hideo Yamauchi.
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Patch]Patch for crmd-transition-delay processing.

2012-03-29 Thread renayama19661014
Hi Andrew,

Thank you for comment.

> The patch makes sense, could you resend as a github pull request? :-D

All right!! I send it if ready.
Please wait

Best Regards,
Hideo Yamauchi.

--- On Thu, 2012/3/29, Andrew Beekhof  wrote:

> The patch makes sense, could you resend as a github pull request? :-D
> 
> On Thu, Mar 22, 2012 at 8:18 PM,   wrote:
> > Hi All,
> >
> > Sorry
> >
> > My patch was wrong.
> > I send a right patch.
> >
> > Best Regards,
> > Hideo Yamauchi.
> >
> > --- On Thu, 2012/3/22, renayama19661...@ybb.ne.jp 
> >  wrote:
> >
> >> Hi All,
> >>
> >> The crmd-transition-delay waits for the update of the attribute to be late.
> >>
> >> However, crmd cannot realize the wait of the attribute well because a 
> >> timer is not reset when the delay of the attribute occurs after a timer 
> >> was set.
> >>
> >> As a result, the resource may not be placed definitely.
> >>
> >> I wrote a patch for Pacemaker 1.0.12.
> >>
> >> And this patch blocks the handling of tengine when a crmd-transition-delay 
> >> timer is set.
> >> And tengine handles instructions of pengine after a crmd-transition-delay 
> >> timer exercised it definitely.
> >>
> >>
> >> By this patch, the start of the resource may be late.
> >> However, it realizes the placement of a right resource depending on 
> >> limitation.
> >>
> >>  * I think that the similar correction is necessary for a development 
> >> version of Pacemaker.
> >>
> >> Best Regards,
> >> Hideo Yamauchi.
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Issue with ordering

2012-03-29 Thread Andrew Beekhof
On Thu, Mar 29, 2012 at 7:07 PM, Vladislav Bogdanov
 wrote:
> Hi Andrew, all,
>
> I'm continuing experiments with lustre on stacked drbd, and see
> following problem:
>
> I have one drbd resource (ms-drbd-testfs-mdt) is stacked on top of
> other (ms-drbd-testfs-mdt-left), and have following constraints
> between them:
>
> colocation drbd-testfs-mdt-with-drbd-testfs-mdt-left inf:
> ms-drbd-testfs-mdt ms-drbd-testfs-mdt-left:Master
> order drbd-testfs-mdt-after-drbd-testfs-mdt-left inf:
> ms-drbd-testfs-mdt-left:promote ms-drbd-testfs-mdt:start
>
> Then I have filesystem mounted on top of ms-drbd-testfs-mdt
> (testfs-mdt resource).
>
> colocation testfs-mdt-with-drbd-testfs-mdt inf: testfs-mdt
> ms-drbd-testfs-mdt:Master
> order testfs-mdt-after-drbd-testfs-mdt inf:
> ms-drbd-testfs-mdt:promote testfs-mdt:start
>
> When I trigger event which causes many resources to stop (including
> these three), LogActions output look like:
>
> LogActions: Stop    drbd-local#011(lustre01-left)
> LogActions: Stop    drbd-stacked#011(Started lustre02-left)
> LogActions: Stop    drbd-testfs-local#011(Started lustre03-left)
> LogActions: Stop    drbd-testfs-stacked#011(Started lustre04-left)
> LogActions: Stop    lustre#011(Started lustre04-left)
> LogActions: Stop    mgs#011(Started lustre01-left)
> LogActions: Stop    testfs#011(Started lustre03-left)
> LogActions: Stop    testfs-mdt#011(Started lustre01-left)
> LogActions: Stop    testfs-ost#011(Started lustre01-left)
> LogActions: Stop    testfs-ost0001#011(Started lustre02-left)
> LogActions: Stop    testfs-ost0002#011(Started lustre03-left)
> LogActions: Stop    testfs-ost0003#011(Started lustre04-left)
> LogActions: Stop    drbd-mgs:0#011(Master lustre01-left)
> LogActions: Stop    drbd-mgs:1#011(Slave lustre02-left)
> LogActions: Stop    drbd-testfs-mdt:0#011(Master lustre01-left)
> LogActions: Stop    drbd-testfs-mdt-left:0#011(Master lustre01-left)
> LogActions: Stop    drbd-testfs-mdt-left:1#011(Slave lustre02-left)
> LogActions: Stop    drbd-testfs-ost:0#011(Master lustre01-left)
> LogActions: Stop    drbd-testfs-ost-left:0#011(Master lustre01-left)
> LogActions: Stop    drbd-testfs-ost-left:1#011(Slave lustre02-left)
> LogActions: Stop    drbd-testfs-ost0001:0#011(Master lustre02-left)
> LogActions: Stop    drbd-testfs-ost0001-left:0#011(Master lustre02-left)
> LogActions: Stop    drbd-testfs-ost0001-left:1#011(Slave lustre01-left)
> LogActions: Stop    drbd-testfs-ost0002:0#011(Master lustre03-left)
> LogActions: Stop    drbd-testfs-ost0002-left:0#011(Master lustre03-left)
> LogActions: Stop    drbd-testfs-ost0002-left:1#011(Slave lustre04-left)
> LogActions: Stop    drbd-testfs-ost0003:0#011(Master lustre04-left)
> LogActions: Stop    drbd-testfs-ost0003-left:0#011(Master lustre04-left)
> LogActions: Stop    drbd-testfs-ost0003-left:1#011(Slave lustre03-left)
>
> For some reason demote is not run on both mdt drbd esources (should
> it?), so drbd RA prints warning about that.

So its not just a logging error, the demote really isn't scheduled?
That would be bad, can you file a bug please?

>
> What I see then is that ms-drbd-testfs-mdt-left is tried to stop
> before ms-drbd-testfs-mdt.
>
> More, testfs-mdt filesystem resource is not stopped before stopping
> drbd-testfs-mdt.
>
> I have advisory ordering constraints between mdt and ost filesystem
> resources, so all ost's are stopped before mdt. Thus mdt stop is delayed
> a bit. May be this influences what happens.
>
> I'm pretty sure I have correct constraints for at least these three
> resources, so it looks like a bug, because mandatory ordering is not
> preserved.
>
> I can produce report for this.
>
> Best,
> Vladislav
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB not saved

2012-03-29 Thread Andrew Beekhof
On Thu, Mar 29, 2012 at 8:45 PM, Fiorenza Meini  wrote:
> Il 29/03/2012 10:12, Rasto Levrinc ha scritto:
>
>> On Thu, Mar 29, 2012 at 9:54 AM, Fiorenza Meini  wrote:
>>>
>>> Hi there,
>>> a strange thing happened to my two node cluster: I rebooted both machine
>>> at
>>> the same time, when s.o. went up again, no resources were configured
>>> anymore: as it was a fresh installation. Why ?
>>> It was explained to me that the configuration of resources managed by
>>> pacemaker should be in a file called cib.xml, but cannot find it in the
>>> system. Have I to specify any particular option in the configuration
>>> file?
>>
>>
>> Normally you shouldn't worry about it. cib.xml is stored in
>> /var/lib/heartbeat/crm/ or similar and the directory should have have
>> hacluster:haclient permissions. What distro is it and how did you install
>> it?
>>
>> Rasto
>>
>
> Thanks, it was a permission problems.

Normally we log an error at startup if we can't write there... did
this not happen?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] OCF_RESKEY_CRM_meta_{ordered,notify,interleave}

2012-03-29 Thread Andrew Beekhof
On Fri, Mar 30, 2012 at 1:47 AM, Florian Haas  wrote:
> Lars (lmb), or Andrew -- maybe one of you remembers what this was all about.
>
> In this commit, Lars enabled the
> OCF_RESKEY_CRM_meta_{ordered,notify,interleave} attributes to be
> injected into the environment of RAs:
> https://github.com/ClusterLabs/pacemaker/commit/b0ba01f61086f073be69db3e6beb0914642f79d9
>
> Then that change was almost immediately backed out:
> https://github.com/ClusterLabs/pacemaker/commit/b33d3bf5376ab59baa435086c803b9fdaf6de504

Because it was felt that RAs shouldn't need to know.
Those options change pacemaker's behaviour, not the RAs.

But subsequently, in lf#2391, you convinced us to add notify since it
allowed the drbd agent to error out if they were not turned on.

>
> And since then, at some point evidently only interleave and notify
> made it back in. Any specific reason for omitting ordered? I happen to
> have a pretty good use case for an ordered-clone RA, and it would be
> handy to be able to test whether clone ordering has been enabled.

I'd need more information.  The RA shouldn't need to care I would have
thought. The ordering happens in the PE/crmd, the RA should just do
what its told.
>
> All insights are much appreciated.
>
> Cheers,
> Florian
>
> --
> Need help with High Availability?
> http://www.hastexo.com/now
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] OCF_RESKEY_CRM_meta_{ordered,notify,interleave}

2012-03-29 Thread Florian Haas
Lars (lmb), or Andrew -- maybe one of you remembers what this was all about.

In this commit, Lars enabled the
OCF_RESKEY_CRM_meta_{ordered,notify,interleave} attributes to be
injected into the environment of RAs:
https://github.com/ClusterLabs/pacemaker/commit/b0ba01f61086f073be69db3e6beb0914642f79d9

Then that change was almost immediately backed out:
https://github.com/ClusterLabs/pacemaker/commit/b33d3bf5376ab59baa435086c803b9fdaf6de504

And since then, at some point evidently only interleave and notify
made it back in. Any specific reason for omitting ordered? I happen to
have a pretty good use case for an ordered-clone RA, and it would be
handy to be able to test whether clone ordering has been enabled.

All insights are much appreciated.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] VirtualDomain Shutdown Timeout

2012-03-29 Thread Andrew Martin
Hi Andrew, 


Thanks, that sounds good. I am using the Ubuntu HA ppa, so I will wait for a 
1.1.7 package to become available. 


Andrew 

- Original Message -

From: "Andrew Beekhof"  
To: "The Pacemaker cluster resource manager"  
Sent: Thursday, March 29, 2012 1:08:21 AM 
Subject: Re: [Pacemaker] VirtualDomain Shutdown Timeout 

On Sun, Mar 25, 2012 at 6:27 AM, Andrew Martin  wrote: 
> Hello, 
> 
> I have configured a KVM virtual machine primitive using Pacemaker 1.1.6 and 
> Heartbeat 3.0.5 on Ubuntu 10.04 Server using DRBD as the storage device (so 
> there is no shared storage, no live-migration): 
> primitive p_vm ocf:heartbeat:VirtualDomain \ 
> params config="/vmstore/config/vm.xml" \ 
> meta allow-migrate="false" \ 
> op start interval="0" timeout="180s" \ 
> op stop interval="0" timeout="120s" \ 
> op monitor interval="10" timeout="30" 
> 
> I would expect the following events to happen on failover on the "from" node 
> (the migration source) if the VM hangs while shutting down: 
> 1. VirtualDomain issues "virsh shutdown vm" to gracefully shutdown the VM 
> 2. pacemaker waits 120 seconds for the timeout specified in the "op stop" 
> timeout 
> 3. VirtualDomain waits a bit less than 120 seconds to see if it will 
> gracefully shutdown. Once it gets to almost 120 seconds, it issues "virsh 
> destroy vm" to hard stop the VM. 
> 4. pacemaker wakes up from the 120 second timeout and sees that the VM has 
> stopped and proceeds with the failover 
> 
> However, I observed that VirtualDomain seems to be using the timeout from 
> the "op start" line, 180 seconds, yet pacemaker uses the 120 second timeout. 
> Thus, the VM is still running after the pacemaker timeout is reached and so 
> the node is STONITHed. Here is the relevant section of code from 
> /usr/lib/ocf/resource.d/heartbeat/VirtualDomain: 
> VirtualDomain_Stop() { 
> local i 
> local status 
> local shutdown_timeout 
> local out ex 
> 
> VirtualDomain_Status 
> status=$? 
> 
> case $status in 
> $OCF_SUCCESS) 
> if ! ocf_is_true $OCF_RESKEY_force_stop; then 
> # Issue a graceful shutdown request 
> ocf_log info "Issuing graceful shutdown request for domain 
> ${DOMAIN_NAME}." 
> virsh $VIRSH_OPTIONS shutdown ${DOMAIN_NAME} 
> # The "shutdown_timeout" we use here is the operation 
> # timeout specified in the CIB, minus 5 seconds 
> shutdown_timeout=$(( $NOW + 
> ($OCF_RESKEY_CRM_meta_timeout/1000) -5 )) 
> # Loop on status until we reach $shutdown_timeout 
> while [ $NOW -lt $shutdown_timeout ]; do 
> 
> Doesn't $OCF_RESKEY_CRM_meta_timeout correspond to the timeout value in the 
> "op stop ..." line? 

It should, however there was a bug in 1.1.6 where this wasn't the case. 
The relevant patch is: 
https://github.com/beekhof/pacemaker/commit/fcfe6fe 

Or you could try 1.1.7 

> 
> How can I optimize my pacemaker configuration so that the VM will attempt to 
> gracefully shutdown and then at worst case destroy the VM before the 
> pacemaker timeout is reached? Moreover, is there anything I can do inside of 
> the VM (another Ubuntu 10.04 install) to optimize/speed up the shutdown 
> process? 
> 
> Thanks, 
> 
> Andrew 
> 
> 
> ___ 
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 
> 

___ 
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Issue with ordering

2012-03-29 Thread Florian Haas
On Thu, Mar 29, 2012 at 11:40 AM, Vladislav Bogdanov
 wrote:
> Hi Florian,
>
> 29.03.2012 11:54, Florian Haas wrote:
>> On Thu, Mar 29, 2012 at 10:07 AM, Vladislav Bogdanov
>>  wrote:
>>> Hi Andrew, all,
>>>
>>> I'm continuing experiments with lustre on stacked drbd, and see
>>> following problem:
>>
>> At the risk of going off topic, can you explain *why* you want to do
>> this? If you need a distributed, replicated filesystem with
>> asynchronous replication capability (the latter presumably for DR),
>> why not use a Distributed-Replicated GlusterFS volume with
>> geo-replication?
>
> I need fast POSIX fs scalable to tens of petabytes with support for
> fallocate() and friends to prevent fragmentation.
>
> I generally agree with Linus about FUSE and userspace filesystems in
> general, so that is not an option.

I generally agree with Linus and just about everyone else that
filesystems shouldn't require invasive core kernel patches. But I
digress. :)

> Using any API except what VFS provides via syscalls+glibc is not an
> option too because I need access to files from various scripted
> languages including shell and directly from a web server written in C.
> Having bindings for them all is a real overkill. And it all is in
> userspace again.
>
> So I generally have choice of CEPH, Lustre, GPFS and PVFS.
>
> CEPH is still very alpha, so I can't rely on it, although I keep my eye
> on it.
>
> GPFS is not an option because it is not free and produced by IBM (can't
> say which of these two is more important ;) )
>
> Can't remember why exactly PVFS is a no-go, their site is down right
> now. Probably userspace server implementation (although some examples
> like nfs server discredit idea of in-kernel servers, I still believe
> this is a way to go).

Ceph is 100% userspace server side, jftr. :) And it has no async
replication capability at this point, which you seem to be after.

> Lustre is widely deployed, predictable and stable. It fully runs in
> kernel space. Although Oracle did its best to bury Lustre development,
> it is actively developed by whamcloud and company. They have builds for
> EL6, so I'm pretty happy with this. Lustre doesn't have any replication
> built-in so I need to add it on a lower layer (no rsync, no rsync, no
> rsync ;) ). DRBD suits my needs for a simple HA.
>
> But I also need datacenter-level HA, that's why I evaluate stacked DRBD
> and tickets with booth.
>
> So, frankly speaking, I decided to go with Lustre not because it is so
> cool (it has many-many niceties), but because all others I know do not
> suit my needs at all due to various reasons.
>
> Hope this clarifies my point,

It does. Doesn't necessarily mean I agree, but the point you're making is fine.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker + Oracle

2012-03-29 Thread emmanuel segura
cat /etc/oratab

And maybe you can post your log :-)

Il giorno 29 marzo 2012 13:53, Ruwan Fernando  ha
scritto:

> Hi,
> I'm working with Pacemaker Active Passive Cluster and need to use oracle
> as a resource to the pacemaker. my resource script is
> crm configureprimitive Oracle ocf:heartbeat:oracle params sid=OracleDB op
> monitor inetrval=120s
> but it is not worked for me.
>
> Can someone help out on this matter?
>
> Regards,
> Ruwan
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>


-- 
esta es mi vida e me la vivo hasta que dios quiera
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Pacemaker + Oracle

2012-03-29 Thread Ruwan Fernando
Hi,
I'm working with Pacemaker Active Passive Cluster and need to use oracle as
a resource to the pacemaker. my resource script is
crm configureprimitive Oracle ocf:heartbeat:oracle params sid=OracleDB op
monitor inetrval=120s
but it is not worked for me.

Can someone help out on this matter?

Regards,
Ruwan
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB not saved

2012-03-29 Thread Fiorenza Meini

Il 29/03/2012 10:12, Rasto Levrinc ha scritto:

On Thu, Mar 29, 2012 at 9:54 AM, Fiorenza Meini  wrote:

Hi there,
a strange thing happened to my two node cluster: I rebooted both machine at
the same time, when s.o. went up again, no resources were configured
anymore: as it was a fresh installation. Why ?
It was explained to me that the configuration of resources managed by
pacemaker should be in a file called cib.xml, but cannot find it in the
system. Have I to specify any particular option in the configuration file?


Normally you shouldn't worry about it. cib.xml is stored in
/var/lib/heartbeat/crm/ or similar and the directory should have have
hacluster:haclient permissions. What distro is it and how did you install
it?

Rasto



Thanks, it was a permission problems.

Regards
--

Fiorenza Meini
Spazio Web S.r.l.

V. Dante Alighieri, 10 - 13900 Biella
Tel.: 015.2431982 - 015.9526066
Fax: 015.2522600
Reg. Imprese, CF e P.I.: 02414430021
Iscr. REA: BI - 188936
Iscr. CCIAA: Biella - 188936
Cap. Soc.: 30.000,00 Euro i.v.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Issue with ordering

2012-03-29 Thread Vladislav Bogdanov
Hi Florian,

29.03.2012 11:54, Florian Haas wrote:
> On Thu, Mar 29, 2012 at 10:07 AM, Vladislav Bogdanov
>  wrote:
>> Hi Andrew, all,
>>
>> I'm continuing experiments with lustre on stacked drbd, and see
>> following problem:
> 
> At the risk of going off topic, can you explain *why* you want to do
> this? If you need a distributed, replicated filesystem with
> asynchronous replication capability (the latter presumably for DR),
> why not use a Distributed-Replicated GlusterFS volume with
> geo-replication?

I need fast POSIX fs scalable to tens of petabytes with support for
fallocate() and friends to prevent fragmentation.

I generally agree with Linus about FUSE and userspace filesystems in
general, so that is not an option.

Using any API except what VFS provides via syscalls+glibc is not an
option too because I need access to files from various scripted
languages including shell and directly from a web server written in C.
Having bindings for them all is a real overkill. And it all is in
userspace again.

So I generally have choice of CEPH, Lustre, GPFS and PVFS.

CEPH is still very alpha, so I can't rely on it, although I keep my eye
on it.

GPFS is not an option because it is not free and produced by IBM (can't
say which of these two is more important ;) )

Can't remember why exactly PVFS is a no-go, their site is down right
now. Probably userspace server implementation (although some examples
like nfs server discredit idea of in-kernel servers, I still believe
this is a way to go).

Lustre is widely deployed, predictable and stable. It fully runs in
kernel space. Although Oracle did its best to bury Lustre development,
it is actively developed by whamcloud and company. They have builds for
EL6, so I'm pretty happy with this. Lustre doesn't have any replication
built-in so I need to add it on a lower layer (no rsync, no rsync, no
rsync ;) ). DRBD suits my needs for a simple HA.

But I also need datacenter-level HA, that's why I evaluate stacked DRBD
and tickets with booth.

So, frankly speaking, I decided to go with Lustre not because it is so
cool (it has many-many niceties), but because all others I know do not
suit my needs at all due to various reasons.

Hope this clarifies my point,

Best,
Vladislav

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Issue with ordering

2012-03-29 Thread Florian Haas
On Thu, Mar 29, 2012 at 10:07 AM, Vladislav Bogdanov
 wrote:
> Hi Andrew, all,
>
> I'm continuing experiments with lustre on stacked drbd, and see
> following problem:

At the risk of going off topic, can you explain *why* you want to do
this? If you need a distributed, replicated filesystem with
asynchronous replication capability (the latter presumably for DR),
why not use a Distributed-Replicated GlusterFS volume with
geo-replication?

Note that I know next to nothing about your actual detailed
requirements, so GlusterFS may well be non-ideal for you and my
suggestion may thus be moot, but it would be nice if you could explain
why you're doing this.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CIB not saved

2012-03-29 Thread Rasto Levrinc
On Thu, Mar 29, 2012 at 9:54 AM, Fiorenza Meini  wrote:
> Hi there,
> a strange thing happened to my two node cluster: I rebooted both machine at
> the same time, when s.o. went up again, no resources were configured
> anymore: as it was a fresh installation. Why ?
> It was explained to me that the configuration of resources managed by
> pacemaker should be in a file called cib.xml, but cannot find it in the
> system. Have I to specify any particular option in the configuration file?

Normally you shouldn't worry about it. cib.xml is stored in
/var/lib/heartbeat/crm/ or similar and the directory should have have
hacluster:haclient permissions. What distro is it and how did you install
it?

Rasto

-- 
Dipl.-Ing. Rastislav Levrinc
rasto.levr...@gmail.com
Linux Cluster Management Console
http://lcmc.sf.net/

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Issue with ordering

2012-03-29 Thread Vladislav Bogdanov
Hi Andrew, all,

I'm continuing experiments with lustre on stacked drbd, and see
following problem:

I have one drbd resource (ms-drbd-testfs-mdt) is stacked on top of
other (ms-drbd-testfs-mdt-left), and have following constraints
between them:

colocation drbd-testfs-mdt-with-drbd-testfs-mdt-left inf:
ms-drbd-testfs-mdt ms-drbd-testfs-mdt-left:Master
order drbd-testfs-mdt-after-drbd-testfs-mdt-left inf:
ms-drbd-testfs-mdt-left:promote ms-drbd-testfs-mdt:start

Then I have filesystem mounted on top of ms-drbd-testfs-mdt
(testfs-mdt resource).

colocation testfs-mdt-with-drbd-testfs-mdt inf: testfs-mdt
ms-drbd-testfs-mdt:Master
order testfs-mdt-after-drbd-testfs-mdt inf:
ms-drbd-testfs-mdt:promote testfs-mdt:start

When I trigger event which causes many resources to stop (including
these three), LogActions output look like:

LogActions: Stopdrbd-local#011(lustre01-left)
LogActions: Stopdrbd-stacked#011(Started lustre02-left)
LogActions: Stopdrbd-testfs-local#011(Started lustre03-left)
LogActions: Stopdrbd-testfs-stacked#011(Started lustre04-left)
LogActions: Stoplustre#011(Started lustre04-left)
LogActions: Stopmgs#011(Started lustre01-left)
LogActions: Stoptestfs#011(Started lustre03-left)
LogActions: Stoptestfs-mdt#011(Started lustre01-left)
LogActions: Stoptestfs-ost#011(Started lustre01-left)
LogActions: Stoptestfs-ost0001#011(Started lustre02-left)
LogActions: Stoptestfs-ost0002#011(Started lustre03-left)
LogActions: Stoptestfs-ost0003#011(Started lustre04-left)
LogActions: Stopdrbd-mgs:0#011(Master lustre01-left)
LogActions: Stopdrbd-mgs:1#011(Slave lustre02-left)
LogActions: Stopdrbd-testfs-mdt:0#011(Master lustre01-left)
LogActions: Stopdrbd-testfs-mdt-left:0#011(Master lustre01-left)
LogActions: Stopdrbd-testfs-mdt-left:1#011(Slave lustre02-left)
LogActions: Stopdrbd-testfs-ost:0#011(Master lustre01-left)
LogActions: Stopdrbd-testfs-ost-left:0#011(Master lustre01-left)
LogActions: Stopdrbd-testfs-ost-left:1#011(Slave lustre02-left)
LogActions: Stopdrbd-testfs-ost0001:0#011(Master lustre02-left)
LogActions: Stopdrbd-testfs-ost0001-left:0#011(Master lustre02-left)
LogActions: Stopdrbd-testfs-ost0001-left:1#011(Slave lustre01-left)
LogActions: Stopdrbd-testfs-ost0002:0#011(Master lustre03-left)
LogActions: Stopdrbd-testfs-ost0002-left:0#011(Master lustre03-left)
LogActions: Stopdrbd-testfs-ost0002-left:1#011(Slave lustre04-left)
LogActions: Stopdrbd-testfs-ost0003:0#011(Master lustre04-left)
LogActions: Stopdrbd-testfs-ost0003-left:0#011(Master lustre04-left)
LogActions: Stopdrbd-testfs-ost0003-left:1#011(Slave lustre03-left)

For some reason demote is not run on both mdt drbd esources (should
it?), so drbd RA prints warning about that.

What I see then is that ms-drbd-testfs-mdt-left is tried to stop
before ms-drbd-testfs-mdt.

More, testfs-mdt filesystem resource is not stopped before stopping
drbd-testfs-mdt.

I have advisory ordering constraints between mdt and ost filesystem
resources, so all ost's are stopped before mdt. Thus mdt stop is delayed
a bit. May be this influences what happens.

I'm pretty sure I have correct constraints for at least these three
resources, so it looks like a bug, because mandatory ordering is not
preserved.

I can produce report for this.

Best,
Vladislav

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] CIB not saved

2012-03-29 Thread Fiorenza Meini

Hi there,
a strange thing happened to my two node cluster: I rebooted both machine 
at the same time, when s.o. went up again, no resources were configured 
anymore: as it was a fresh installation. Why ?
It was explained to me that the configuration of resources managed by 
pacemaker should be in a file called cib.xml, but cannot find it in the 
system. Have I to specify any particular option in the configuration file?


Thanks and regards
--

Fiorenza Meini
Spazio Web S.r.l.

V. Dante Alighieri, 10 - 13900 Biella
Tel.: 015.2431982 - 015.9526066
Fax: 015.2522600
Reg. Imprese, CF e P.I.: 02414430021
Iscr. REA: BI - 188936
Iscr. CCIAA: Biella - 188936
Cap. Soc.: 30.000,00 Euro i.v.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Migration of "lower" resource causes dependent resources to restart

2012-03-29 Thread Vladislav Bogdanov
29.03.2012 10:07, Andrew Beekhof wrote:
> On Thu, Mar 29, 2012 at 5:43 PM, Vladislav Bogdanov
>  wrote:
>> 29.03.2012 09:35, Andrew Beekhof wrote:
>>> On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov
>>>  wrote:
 Hi Andrew, all,

 Pacemaker restarts resources when resource they depend on (ordering
 only, no colocation) is migrated.

 I mean that when I do crm resource migrate lustre, I get

 LogActions: Migrate lustre#011(Started lustre03-left -> lustre04-left)
 LogActions: Restart mgs#011(Started lustre01-left)

 I only have one ordering constraint for these two resources:

 order mgs-after-lustre inf: lustre:start mgs:start

 This reminds me what have been with reload in a past (dependent resource
 restart when "lower" resource is reloaded).

 Shouldn't this be changed? Migration usually means that service is not
 interrupted...
>>>
>>> Is that strictly true?  Always?
>>
>> This probably depends on implementation.
>> With qemu live migration - yes.
> 
> So there will be no point at which, for example, pinging the VM's ip
> address fails?

Even all existing connections are preserved.
Small delays during last migration phase are still possible, but they
are minor (during around 100-200 milliseconds while context is switching
and ip is announced from another node). And packets are not lost, just
delayed a bit.

I have corosync/pacemaker udpu clusters in VMs, and even corosync is
happy when VM it runs on is migrating to another node (with some token
tuning).

> 
>> With pacemaker:Dummy (with meta allow-migrate="true") probably yes too...
>>
>>> My understanding was although A thinks the migration happens
>>> instantaneously, it is in fact more likely to be pause+migrate+resume
>>> and during that time anyone trying to talk to A during that time is
>>> going to be disappointed.
>>
>>
>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Migration of "lower" resource causes dependent resources to restart

2012-03-29 Thread Andrew Beekhof
On Thu, Mar 29, 2012 at 5:43 PM, Vladislav Bogdanov
 wrote:
> 29.03.2012 09:35, Andrew Beekhof wrote:
>> On Thu, Mar 29, 2012 at 5:28 PM, Vladislav Bogdanov
>>  wrote:
>>> Hi Andrew, all,
>>>
>>> Pacemaker restarts resources when resource they depend on (ordering
>>> only, no colocation) is migrated.
>>>
>>> I mean that when I do crm resource migrate lustre, I get
>>>
>>> LogActions: Migrate lustre#011(Started lustre03-left -> lustre04-left)
>>> LogActions: Restart mgs#011(Started lustre01-left)
>>>
>>> I only have one ordering constraint for these two resources:
>>>
>>> order mgs-after-lustre inf: lustre:start mgs:start
>>>
>>> This reminds me what have been with reload in a past (dependent resource
>>> restart when "lower" resource is reloaded).
>>>
>>> Shouldn't this be changed? Migration usually means that service is not
>>> interrupted...
>>
>> Is that strictly true?  Always?
>
> This probably depends on implementation.
> With qemu live migration - yes.

So there will be no point at which, for example, pinging the VM's ip
address fails?

> With pacemaker:Dummy (with meta allow-migrate="true") probably yes too...
>
>> My understanding was although A thinks the migration happens
>> instantaneously, it is in fact more likely to be pause+migrate+resume
>> and during that time anyone trying to talk to A during that time is
>> going to be disappointed.
>
>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org