date:20140113

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrey Groshev



14.01.2014, 07:47, "Andrew Beekhof" :
> Ok, here's what happens:
>
> 1. node2 is lost
> 2. fencing of node2 starts
> 3. node2 reboots (and cluster starts)
> 4. node2 returns to the membership
> 5. node2 is marked as a cluster member
> 6. DC tries to bring it into the cluster, but needs to cancel the active 
> transition first.
>    Which is a problem since the node2 fencing operation is part of that
> 7. node2 is in a transition (pending) state until fencing passes or fails
> 8a. fencing fails: transition completes and the node joins the cluster
>
> Thats in theory, except we automatically try again. Which isn't appropriate.
> This should be relatively easy to fix.
>
> 8b. fencing passes: the node is incorrectly marked as offline
>
> This I have no idea how to fix yet.
>
> On another note, it doesn't look like this agent works at all.
> The node has been back online for a long time and the agent is still timing 
> out after 10 minutes.
> So "Once the script makes sure that the victim will rebooted and again 
> available via ssh - it exit with 0." does not seem true.

Damn. Looks like you're right. At some time I broke my agent and had not 
noticed it. Who will understand.

> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof  wrote:
>
>>  Apart from anything else, your timeout needs to be bigger:
>>
>>  Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
>> commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 
>> 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
>> 'st1' returned: -62 (Timer expired)
>>
>>  On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:
>>>  On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
  13.01.2014, 02:51, "Andrew Beekhof" :
>  On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>>  10.01.2014, 14:31, "Andrey Groshev" :
>>>  10.01.2014, 14:01, "Andrew Beekhof" :
  On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>   10.01.2014, 05:29, "Andrew Beekhof" :
>>    On 9 Jan 2014, at 11:11 pm, Andrey Groshev  
>> wrote:
>>> 08.01.2014, 06:22, "Andrew Beekhof" :
 On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
 wrote:
>  Hi, ALL.
>
>  I'm still trying to cope with the fact that after the fence 
> - node hangs in "pending".
 Please define "pending".  Where did you see this?
>>> In crm_mon:
>>> ..
>>> Node dev-cluster2-node2 (172793105): pending
>>> ..
>>>
>>> The experiment was like this:
>>> Four nodes in cluster.
>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 
>>> 11).
>>> Thereafter, the remaining start it constantly reboot, under 
>>> various pretexts, "softly whistling", "fly low", "not a cluster 
>>> member!" ...
>>> Then in the log fell out "Too many failures "
>>> All this time in the status in crm_mon is "pending".
>>> Depending on the wind direction changed to "UNCLEAN"
>>> Much time has passed and I can not accurately describe the 
>>> behavior...
>>>
>>> Now I am in the following state:
>>> I tried locate the problem. Came here with this.
>>> I set big value in property stonith-timeout="600s".
>>> And got the following behavior:
>>> 1. pkill -4 corosync
>>> 2. from node with DC call my fence agent "sshbykey"
>>> 3. It sends reboot victim and waits until she comes to life 
>>> again.
>>    Hmmm what version of pacemaker?
>>    This sounds like a timing issue that we fixed a while back
>   Was a version 1.1.11 from December 3.
>   Now try full update and retest.
  That should be recent enough.  Can you create a crm_report the next 
 time you reproduce?
>>>  Of course yes. Little delay :)
>>>
>>>  ..
>>>  cc1: warnings being treated as errors
>>>  upstart.c: In function ‘upstart_job_property’:
>>>  upstart.c:264: error: implicit declaration of function 
>>> ‘g_variant_lookup_value’
>>>  upstart.c:264: error: nested extern declaration of 
>>> ‘g_variant_lookup_value’
>>>  upstart.c:264: error: assignment makes pointer from integer without a 
>>> cast
>>>  gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>  gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>  make[1]: *** [all-recursive] Error 1
>>>  make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>  make: *** [core] Error 1
>>>
>>>  I'm trying to solve this a problem.
>>  Do not get solved quickly...
>>
>>  
>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>  g_variant_lookup_value () Since 2.28

Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

2014-01-13 Thread Andrew Beekhof


On 14 Jan 2014, at 4:33 pm, renayama19661...@ybb.ne.jp wrote:

> Hi Andrew,
> 
 Are you using the new attrd code or the legacy stuff?
>>> 
>>> I use new attrd.
>> 
>> And the values are not being sent to the cib at the same time? 
> 
> As far as I looked. . .
> When the transmission of the attribute of attrd of the node was late, a 
> leader of attrd seemed to send an attribute to cib without waiting for it.

And you have a delay configured?  And this value was set prior to that delay 
expiring?

> 
 Only the new code makes (or at least should do) crmd-transition-delay 
 redundant.
>>> 
>>> It did not seem to work so that new attrd dispensed with 
>>> crmd-transition-delay to me.
>>> I report the details again.
>>> # Probably it will be Bugzilla. . .
>> 
>> Sounds good
> 
> All right!
> 
> Many Thanks!
> Hideo Yamauch.
> 
> --- On Tue, 2014/1/14, Andrew Beekhof  wrote:
> 
>> 
>> On 14 Jan 2014, at 4:13 pm, renayama19661...@ybb.ne.jp wrote:
>> 
>>> Hi Andrew,
>>> 
>>> Thank you for comments.
>>> 
 Are you using the new attrd code or the legacy stuff?
>>> 
>>> I use new attrd.
>> 
>> And the values are not being sent to the cib at the same time? 
>> 
>>> 
 
 If you're not using corosync 2.x or see:
 
  crm_notice("Starting mainloop...");
 
 then its the old code.  The new code could also be used with CMAN but 
 isn't configured to build for in that situation.
 
 Only the new code makes (or at least should do) crmd-transition-delay 
 redundant.
>>> 
>>> It did not seem to work so that new attrd dispensed with 
>>> crmd-transition-delay to me.
>>> I report the details again.
>>> # Probably it will be Bugzilla. . .
>> 
>> Sounds good
>> 
>>> 
>>> Best Regards,
>>> Hideo Yamauchi.
>>> 
>>> --- On Tue, 2014/1/14, Andrew Beekhof  wrote:
>>> 
 
 On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote:
 
> Hi All,
> 
> I contributed next bugzilla by a problem to occur for the difference of 
> the timing of the attribute update by attrd before.
> * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528
> 
> We can evade this problem now by using crmd-transition-delay parameter.
> 
> I confirmed whether I could evade this problem by renewed attrd recently.
> * In latest attrd, one became a leader and seemed to come to update an 
> attribute.
> 
> However, latest attrd does not seem to substitute for 
> crmd-transition-delay.
> * I contribute detailed log later.
> 
> We are dissatisfied with continuing using crmd-transition-delay.
> Is there the plan when attrd handles this problem well in the future?
 
 Are you using the new attrd code or the legacy stuff?
 
 If you're not using corosync 2.x or see:
 
  crm_notice("Starting mainloop...");
 
 then its the old code.  The new code could also be used with CMAN but 
 isn't configured to build for in that situation.
 
 Only the new code makes (or at least should do) crmd-transition-delay 
 redundant.
 
>> 
>> 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

2014-01-13 Thread renayama19661014

Hi Andrew,

> >> Are you using the new attrd code or the legacy stuff?
> > 
> > I use new attrd.
> 
> And the values are not being sent to the cib at the same time? 

As far as I looked. . .
When the transmission of the attribute of attrd of the node was late, a leader 
of attrd seemed to send an attribute to cib without waiting for it.

> >> Only the new code makes (or at least should do) crmd-transition-delay 
> >> redundant.
> > 
> > It did not seem to work so that new attrd dispensed with 
> > crmd-transition-delay to me.
> > I report the details again.
> > # Probably it will be Bugzilla. . .
> 
> Sounds good

All right!

Many Thanks!
Hideo Yamauch.

--- On Tue, 2014/1/14, Andrew Beekhof  wrote:

> 
> On 14 Jan 2014, at 4:13 pm, renayama19661...@ybb.ne.jp wrote:
> 
> > Hi Andrew,
> > 
> > Thank you for comments.
> > 
> >> Are you using the new attrd code or the legacy stuff?
> > 
> > I use new attrd.
> 
> And the values are not being sent to the cib at the same time? 
> 
> > 
> >> 
> >> If you're not using corosync 2.x or see:
> >> 
> >>     crm_notice("Starting mainloop...");
> >> 
> >> then its the old code.  The new code could also be used with CMAN but 
> >> isn't configured to build for in that situation.
> >> 
> >> Only the new code makes (or at least should do) crmd-transition-delay 
> >> redundant.
> > 
> > It did not seem to work so that new attrd dispensed with 
> > crmd-transition-delay to me.
> > I report the details again.
> > # Probably it will be Bugzilla. . .
> 
> Sounds good
> 
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > --- On Tue, 2014/1/14, Andrew Beekhof  wrote:
> > 
> >> 
> >> On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote:
> >> 
> >>> Hi All,
> >>> 
> >>> I contributed next bugzilla by a problem to occur for the difference of 
> >>> the timing of the attribute update by attrd before.
> >>> * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528
> >>> 
> >>> We can evade this problem now by using crmd-transition-delay parameter.
> >>> 
> >>> I confirmed whether I could evade this problem by renewed attrd recently.
> >>> * In latest attrd, one became a leader and seemed to come to update an 
> >>> attribute.
> >>> 
> >>> However, latest attrd does not seem to substitute for 
> >>> crmd-transition-delay.
> >>> * I contribute detailed log later.
> >>> 
> >>> We are dissatisfied with continuing using crmd-transition-delay.
> >>> Is there the plan when attrd handles this problem well in the future?
> >> 
> >> Are you using the new attrd code or the legacy stuff?
> >> 
> >> If you're not using corosync 2.x or see:
> >> 
> >>     crm_notice("Starting mainloop...");
> >> 
> >> then its the old code.  The new code could also be used with CMAN but 
> >> isn't configured to build for in that situation.
> >> 
> >> Only the new code makes (or at least should do) crmd-transition-delay 
> >> redundant.
> >> 
> 
> 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

2014-01-13 Thread Andrew Beekhof


On 14 Jan 2014, at 4:13 pm, renayama19661...@ybb.ne.jp wrote:

> Hi Andrew,
> 
> Thank you for comments.
> 
>> Are you using the new attrd code or the legacy stuff?
> 
> I use new attrd.

And the values are not being sent to the cib at the same time? 

> 
>> 
>> If you're not using corosync 2.x or see:
>> 
>> crm_notice("Starting mainloop...");
>> 
>> then its the old code.  The new code could also be used with CMAN but isn't 
>> configured to build for in that situation.
>> 
>> Only the new code makes (or at least should do) crmd-transition-delay 
>> redundant.
> 
> It did not seem to work so that new attrd dispensed with 
> crmd-transition-delay to me.
> I report the details again.
> # Probably it will be Bugzilla. . .

Sounds good

> 
> Best Regards,
> Hideo Yamauchi.
> 
> --- On Tue, 2014/1/14, Andrew Beekhof  wrote:
> 
>> 
>> On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote:
>> 
>>> Hi All,
>>> 
>>> I contributed next bugzilla by a problem to occur for the difference of the 
>>> timing of the attribute update by attrd before.
>>> * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528
>>> 
>>> We can evade this problem now by using crmd-transition-delay parameter.
>>> 
>>> I confirmed whether I could evade this problem by renewed attrd recently.
>>> * In latest attrd, one became a leader and seemed to come to update an 
>>> attribute.
>>> 
>>> However, latest attrd does not seem to substitute for crmd-transition-delay.
>>> * I contribute detailed log later.
>>> 
>>> We are dissatisfied with continuing using crmd-transition-delay.
>>> Is there the plan when attrd handles this problem well in the future?
>> 
>> Are you using the new attrd code or the legacy stuff?
>> 
>> If you're not using corosync 2.x or see:
>> 
>> crm_notice("Starting mainloop...");
>> 
>> then its the old code.  The new code could also be used with CMAN but isn't 
>> configured to build for in that situation.
>> 
>> Only the new code makes (or at least should do) crmd-transition-delay 
>> redundant.
>> 



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

2014-01-13 Thread renayama19661014

Hi Andrew,

Thank you for comments.

> Are you using the new attrd code or the legacy stuff?

I use new attrd.

> 
> If you're not using corosync 2.x or see:
> 
>     crm_notice("Starting mainloop...");
> 
> then its the old code.  The new code could also be used with CMAN but isn't 
> configured to build for in that situation.
> 
> Only the new code makes (or at least should do) crmd-transition-delay 
> redundant.

It did not seem to work so that new attrd dispensed with crmd-transition-delay 
to me.
I report the details again.
# Probably it will be Bugzilla. . .

Best Regards,
Hideo Yamauchi.

--- On Tue, 2014/1/14, Andrew Beekhof  wrote:

> 
> On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote:
> 
> > Hi All,
> > 
> > I contributed next bugzilla by a problem to occur for the difference of the 
> > timing of the attribute update by attrd before.
> > * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528
> > 
> > We can evade this problem now by using crmd-transition-delay parameter.
> > 
> > I confirmed whether I could evade this problem by renewed attrd recently.
> > * In latest attrd, one became a leader and seemed to come to update an 
> > attribute.
> > 
> > However, latest attrd does not seem to substitute for crmd-transition-delay.
> > * I contribute detailed log later.
> > 
> > We are dissatisfied with continuing using crmd-transition-delay.
> > Is there the plan when attrd handles this problem well in the future?
> 
> Are you using the new attrd code or the legacy stuff?
> 
> If you're not using corosync 2.x or see:
> 
>     crm_notice("Starting mainloop...");
> 
> then its the old code.  The new code could also be used with CMAN but isn't 
> configured to build for in that situation.
> 
> Only the new code makes (or at least should do) crmd-transition-delay 
> redundant.
> 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

2014-01-13 Thread Andrew Beekhof


On 14 Jan 2014, at 3:52 pm, renayama19661...@ybb.ne.jp wrote:

> Hi All,
> 
> I contributed next bugzilla by a problem to occur for the difference of the 
> timing of the attribute update by attrd before.
> * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528
> 
> We can evade this problem now by using crmd-transition-delay parameter.
> 
> I confirmed whether I could evade this problem by renewed attrd recently.
> * In latest attrd, one became a leader and seemed to come to update an 
> attribute.
> 
> However, latest attrd does not seem to substitute for crmd-transition-delay.
> * I contribute detailed log later.
> 
> We are dissatisfied with continuing using crmd-transition-delay.
> Is there the plan when attrd handles this problem well in the future?

Are you using the new attrd code or the legacy stuff?

If you're not using corosync 2.x or see:

crm_notice("Starting mainloop...");

then its the old code.  The new code could also be used with CMAN but isn't 
configured to build for in that situation.

Only the new code makes (or at least should do) crmd-transition-delay redundant.


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-13 Thread Andrew Beekhof


On 14 Jan 2014, at 3:41 pm, Brian J. Murrell (brian)  
wrote:

> On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote:
>> 
>> The local cib hasn't caught up yet by the looks of it.
> 
> Should crm_resource actually be [mis-]reporting as if it were
> knowledgeable when it's not though?  IOW is this expected behaviour or
> should it be considered a bug?  Should I open a ticket?

It doesn't know that it doesn't know.
Does it show anything as running?  Any nodes as online?

I'd not expect that it stays in that situation for more than a second or two...

> 
>> You could compare 'cibadmin -Ql' with 'cibadmin -Q'
> 
> Is there no other way to force crm_resource to be truthful/accurate or
> silent if it cannot be truthful/accurate?  Having to run this kind of
> pre-check before every crm_resource --locate seems like it's going to
> drive overhead up quite a bit.

True.

> 
> Maybe I am using the wrong tool for the job.  Is there a better tool
> than crm_resource to ascertain, with full truthfullness (or silence if
> truthfullness is not possible), where resources are running?

We could add an option to force crm_resource to use the master instance instead 
of the local one I guess.


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

2014-01-13 Thread renayama19661014

Hi All,

I contributed next bugzilla by a problem to occur for the difference of the 
timing of the attribute update by attrd before.
 * https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2528

We can evade this problem now by using crmd-transition-delay parameter.

I confirmed whether I could evade this problem by renewed attrd recently.
 * In latest attrd, one became a leader and seemed to come to update an 
attribute.

However, latest attrd does not seem to substitute for crmd-transition-delay.
 * I contribute detailed log later.

We are dissatisfied with continuing using crmd-transition-delay.
Is there the plan when attrd handles this problem well in the future?

Best Regards,
Hideo Yamauchi.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-13 Thread Brian J. Murrell (brian)

On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote:
> 
> The local cib hasn't caught up yet by the looks of it.

Should crm_resource actually be [mis-]reporting as if it were
knowledgeable when it's not though?  IOW is this expected behaviour or
should it be considered a bug?  Should I open a ticket?

> You could compare 'cibadmin -Ql' with 'cibadmin -Q'

Is there no other way to force crm_resource to be truthful/accurate or
silent if it cannot be truthful/accurate?  Having to run this kind of
pre-check before every crm_resource --locate seems like it's going to
drive overhead up quite a bit.

Maybe I am using the wrong tool for the job.  Is there a better tool
than crm_resource to ascertain, with full truthfullness (or silence if
truthfullness is not possible), where resources are running?

Cheers,
b.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrew Beekhof


On 14 Jan 2014, at 3:34 pm, Andrey Groshev  wrote:

> 
> 
> 14.01.2014, 06:25, "Andrew Beekhof" :
>> Apart from anything else, your timeout needs to be bigger:
>> 
>> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
>> commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 
>> 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
>> 'st1' returned: -62 (Timer expired)
>> 
> 
> Bigger than that?

See my other email, the agent is broken.

> In :21 node2 A long time ago already booted and work (almost).

Exactly, so why didnt the agent return?

> #cat /var/log/cluster/mystonith.log
> .
> Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
> STONITH DEBUG(): getinfo-devdescr
> Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
> STONITH DEBUG(): getinfo-devid
> Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
> STONITH DEBUG(): getinfo-xml
> Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
> STONITH 
> DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): 
> getconfignames
> Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
> STONITH 
> DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): 
> status
> Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
> STONITH 
> DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): 
> getconfignames
> Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
> STONITH 
> DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): 
> reset dev-cluster2-node2.unix.tensor.ru
> Mon Jan 13 12:11:37 MSK 2014 Now boot time 1389256739, send reboot
> ...
> 
>> On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:
>> 
>>>  On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
  13.01.2014, 02:51, "Andrew Beekhof" :
>  On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>>  10.01.2014, 14:31, "Andrey Groshev" :
>>>  10.01.2014, 14:01, "Andrew Beekhof" :
   On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>10.01.2014, 05:29, "Andrew Beekhof" :
>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev  
>> wrote:
>>>  08.01.2014, 06:22, "Andrew Beekhof" :
  On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
 wrote:
>   Hi, ALL.
> 
>   I'm still trying to cope with the fact that after the fence 
> - node hangs in "pending".
  Please define "pending".  Where did you see this?
>>>  In crm_mon:
>>>  ..
>>>  Node dev-cluster2-node2 (172793105): pending
>>>  ..
>>> 
>>>  The experiment was like this:
>>>  Four nodes in cluster.
>>>  On one of them kill corosync or pacemakerd (signal 4 or 6 oк 
>>> 11).
>>>  Thereafter, the remaining start it constantly reboot, under 
>>> various pretexts, "softly whistling", "fly low", "not a cluster 
>>> member!" ...
>>>  Then in the log fell out "Too many failures "
>>>  All this time in the status in crm_mon is "pending".
>>>  Depending on the wind direction changed to "UNCLEAN"
>>>  Much time has passed and I can not accurately describe the 
>>> behavior...
>>> 
>>>  Now I am in the following state:
>>>  I tried locate the problem. Came here with this.
>>>  I set big value in property stonith-timeout="600s".
>>>  And got the following behavior:
>>>  1. pkill -4 corosync
>>>  2. from node with DC call my fence agent "sshbykey"
>>>  3. It sends reboot victim and waits until she comes to life 
>>> again.
>> Hmmm what version of pacemaker?
>> This sounds like a timing issue that we fixed a while back
>Was a version 1.1.11 from December 3.
>Now try full update and retest.
   That should be recent enough.  Can you create a crm_report the next 
 time you reproduce?
>>>  Of course yes. Little delay :)
>>> 
>>>  ..
>>>  cc1: warnings being treated as errors
>>>  upstart.c: In function ‘upstart_job_property’:
>>>  upstart.c:264: error: implicit declaration of function 
>>> ‘g_variant_lookup_value’
>>>  upstart.c:264: error: nested extern declaration of 
>>> ‘g_variant_lookup_value’
>>>  upstart.c:264: error: assignment makes pointer from integer without a 
>>> cast
>>>  gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>  gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>  make[1]: *** [all-recursive] Error 1
>>>  make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>  mak

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrey Groshev



14.01.2014, 07:00, "Andrew Beekhof" :
> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof  wrote:
>
>>  Apart from anything else, your timeout needs to be bigger:
>>
>>  Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
>> commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 
>> 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
>> 'st1' returned: -62 (Timer expired)
>
> also:
>
> Jan 13 12:04:54 [17226] dev-cluster2-node1.unix.tensor.ru    pengine: ( 
> utils.c:723   )   error: unpack_operation: Specifying on_fail=fence and 
> stonith-enabled=false makes no sense

After load full config option changes to stonith-enabled=true

>>  On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:
>>>  On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
  13.01.2014, 02:51, "Andrew Beekhof" :
>  On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>>  10.01.2014, 14:31, "Andrey Groshev" :
>>>  10.01.2014, 14:01, "Andrew Beekhof" :
  On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>   10.01.2014, 05:29, "Andrew Beekhof" :
>>    On 9 Jan 2014, at 11:11 pm, Andrey Groshev  
>> wrote:
>>> 08.01.2014, 06:22, "Andrew Beekhof" :
 On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
 wrote:
>  Hi, ALL.
>
>  I'm still trying to cope with the fact that after the fence 
> - node hangs in "pending".
 Please define "pending".  Where did you see this?
>>> In crm_mon:
>>> ..
>>> Node dev-cluster2-node2 (172793105): pending
>>> ..
>>>
>>> The experiment was like this:
>>> Four nodes in cluster.
>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 
>>> 11).
>>> Thereafter, the remaining start it constantly reboot, under 
>>> various pretexts, "softly whistling", "fly low", "not a cluster 
>>> member!" ...
>>> Then in the log fell out "Too many failures "
>>> All this time in the status in crm_mon is "pending".
>>> Depending on the wind direction changed to "UNCLEAN"
>>> Much time has passed and I can not accurately describe the 
>>> behavior...
>>>
>>> Now I am in the following state:
>>> I tried locate the problem. Came here with this.
>>> I set big value in property stonith-timeout="600s".
>>> And got the following behavior:
>>> 1. pkill -4 corosync
>>> 2. from node with DC call my fence agent "sshbykey"
>>> 3. It sends reboot victim and waits until she comes to life 
>>> again.
>>    Hmmm what version of pacemaker?
>>    This sounds like a timing issue that we fixed a while back
>   Was a version 1.1.11 from December 3.
>   Now try full update and retest.
  That should be recent enough.  Can you create a crm_report the next 
 time you reproduce?
>>>  Of course yes. Little delay :)
>>>
>>>  ..
>>>  cc1: warnings being treated as errors
>>>  upstart.c: In function ‘upstart_job_property’:
>>>  upstart.c:264: error: implicit declaration of function 
>>> ‘g_variant_lookup_value’
>>>  upstart.c:264: error: nested extern declaration of 
>>> ‘g_variant_lookup_value’
>>>  upstart.c:264: error: assignment makes pointer from integer without a 
>>> cast
>>>  gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>  gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>  make[1]: *** [all-recursive] Error 1
>>>  make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>  make: *** [core] Error 1
>>>
>>>  I'm trying to solve this a problem.
>>  Do not get solved quickly...
>>
>>  
>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>  g_variant_lookup_value () Since 2.28
>>
>>  # yum list installed glib2
>>  Loaded plugins: fastestmirror, rhnplugin, security
>>  This system is receiving updates from RHN Classic or Red Hat Satellite.
>>  Loading mirror speeds from cached hostfile
>>  Installed Packages
>>  glib2.x86_64    
>>   2.26.1-3.el6   
>> installed
>>
>>  # cat /etc/issue
>>  CentOS release 6.5 (Final)
>>  Kernel \r on an \m
>  Can you try this patch?
>  Upstart jobs wont work, but the code will compile
>
>  diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>  index 831e7cf..195c3a4 100644
>  --- a/lib/services/upstart.c
>  +++ b/lib/services/upstart.c
>  @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>  static

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrey Groshev



14.01.2014, 06:25, "Andrew Beekhof" :
> Apart from anything else, your timeout needs to be bigger:
>
> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
> commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 2 
> from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
> 'st1' returned: -62 (Timer expired)
>

Bigger than that?
In :21 node2 A long time ago already booted and work (almost).
#cat /var/log/cluster/mystonith.log
.
Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH DEBUG(): getinfo-devdescr
Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH DEBUG(): getinfo-devid
Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH DEBUG(): getinfo-xml
Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH 
DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): 
getconfignames
Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH 
DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): status
Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH 
DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): 
getconfignames
Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH 
DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): reset 
dev-cluster2-node2.unix.tensor.ru
Mon Jan 13 12:11:37 MSK 2014 Now boot time 1389256739, send reboot
...

> On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:
>
>>  On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
>>>  13.01.2014, 02:51, "Andrew Beekhof" :
  On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>  10.01.2014, 14:31, "Andrey Groshev" :
>>  10.01.2014, 14:01, "Andrew Beekhof" :
>>>   On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
    10.01.2014, 05:29, "Andrew Beekhof" :
> On 9 Jan 2014, at 11:11 pm, Andrey Groshev  
> wrote:
>>  08.01.2014, 06:22, "Andrew Beekhof" :
>>>  On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
>>> wrote:
   Hi, ALL.

   I'm still trying to cope with the fact that after the fence 
 - node hangs in "pending".
>>>  Please define "pending".  Where did you see this?
>>  In crm_mon:
>>  ..
>>  Node dev-cluster2-node2 (172793105): pending
>>  ..
>>
>>  The experiment was like this:
>>  Four nodes in cluster.
>>  On one of them kill corosync or pacemakerd (signal 4 or 6 oк 
>> 11).
>>  Thereafter, the remaining start it constantly reboot, under 
>> various pretexts, "softly whistling", "fly low", "not a cluster 
>> member!" ...
>>  Then in the log fell out "Too many failures "
>>  All this time in the status in crm_mon is "pending".
>>  Depending on the wind direction changed to "UNCLEAN"
>>  Much time has passed and I can not accurately describe the 
>> behavior...
>>
>>  Now I am in the following state:
>>  I tried locate the problem. Came here with this.
>>  I set big value in property stonith-timeout="600s".
>>  And got the following behavior:
>>  1. pkill -4 corosync
>>  2. from node with DC call my fence agent "sshbykey"
>>  3. It sends reboot victim and waits until she comes to life 
>> again.
> Hmmm what version of pacemaker?
> This sounds like a timing issue that we fixed a while back
    Was a version 1.1.11 from December 3.
    Now try full update and retest.
>>>   That should be recent enough.  Can you create a crm_report the next 
>>> time you reproduce?
>>  Of course yes. Little delay :)
>>
>>  ..
>>  cc1: warnings being treated as errors
>>  upstart.c: In function ‘upstart_job_property’:
>>  upstart.c:264: error: implicit declaration of function 
>> ‘g_variant_lookup_value’
>>  upstart.c:264: error: nested extern declaration of 
>> ‘g_variant_lookup_value’
>>  upstart.c:264: error: assignment makes pointer from integer without a 
>> cast
>>  gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>  gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>  make[1]: *** [all-recursive] Error 1
>>  make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>  make: *** [core] Error 1
>>
>>  I'm trying to solve this a problem.
>  Do not get solved quickly...
>
>  
> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>  g_variant_lookup_value () Since 2.28
>
>  # yum list installed

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrew Beekhof

Ok, here's what happens:

1. node2 is lost
2. fencing of node2 starts
3. node2 reboots (and cluster starts)
4. node2 returns to the membership
5. node2 is marked as a cluster member
6. DC tries to bring it into the cluster, but needs to cancel the active 
transition first.
   Which is a problem since the node2 fencing operation is part of that
7. node2 is in a transition (pending) state until fencing passes or fails
8a. fencing fails: transition completes and the node joins the cluster

Thats in theory, except we automatically try again. Which isn't appropriate.
This should be relatively easy to fix.

8b. fencing passes: the node is incorrectly marked as offline

This I have no idea how to fix yet.


On another note, it doesn't look like this agent works at all.
The node has been back online for a long time and the agent is still timing out 
after 10 minutes.
So "Once the script makes sure that the victim will rebooted and again 
available via ssh - it exit with 0." does not seem true.

On 14 Jan 2014, at 1:19 pm, Andrew Beekhof  wrote:

> Apart from anything else, your timeout needs to be bigger:
> 
> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
> commands.c:1321  )   error: log_operation:   Operation 'reboot' [11331] (call 
> 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
> 'st1' returned: -62 (Timer expired)
> 
> 
> On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:
> 
>> 
>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
>> 
>>> 
>>> 
>>> 13.01.2014, 02:51, "Andrew Beekhof" :
 On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
 
> 10.01.2014, 14:31, "Andrey Groshev" :
>> 10.01.2014, 14:01, "Andrew Beekhof" :
>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
  10.01.2014, 05:29, "Andrew Beekhof" :
>   On 9 Jan 2014, at 11:11 pm, Andrey Groshev  wrote:
>>08.01.2014, 06:22, "Andrew Beekhof" :
>>>On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
>>> wrote:
 Hi, ALL.
 
 I'm still trying to cope with the fact that after the fence - 
 node hangs in "pending".
>>>Please define "pending".  Where did you see this?
>>In crm_mon:
>>..
>>Node dev-cluster2-node2 (172793105): pending
>>..
>> 
>>The experiment was like this:
>>Four nodes in cluster.
>>On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>Thereafter, the remaining start it constantly reboot, under 
>> various pretexts, "softly whistling", "fly low", "not a cluster 
>> member!" ...
>>Then in the log fell out "Too many failures "
>>All this time in the status in crm_mon is "pending".
>>Depending on the wind direction changed to "UNCLEAN"
>>Much time has passed and I can not accurately describe the 
>> behavior...
>> 
>>Now I am in the following state:
>>I tried locate the problem. Came here with this.
>>I set big value in property stonith-timeout="600s".
>>And got the following behavior:
>>1. pkill -4 corosync
>>2. from node with DC call my fence agent "sshbykey"
>>3. It sends reboot victim and waits until she comes to life again.
>   Hmmm what version of pacemaker?
>   This sounds like a timing issue that we fixed a while back
  Was a version 1.1.11 from December 3.
  Now try full update and retest.
>>> That should be recent enough.  Can you create a crm_report the next 
>>> time you reproduce?
>> Of course yes. Little delay :)
>> 
>> ..
>> cc1: warnings being treated as errors
>> upstart.c: In function ‘upstart_job_property’:
>> upstart.c:264: error: implicit declaration of function 
>> ‘g_variant_lookup_value’
>> upstart.c:264: error: nested extern declaration of 
>> ‘g_variant_lookup_value’
>> upstart.c:264: error: assignment makes pointer from integer without a 
>> cast
>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory `/root/ha/pacemaker/lib'
>> make: *** [core] Error 1
>> 
>> I'm trying to solve this a problem.
> Do not get solved quickly...
> 
> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
> g_variant_lookup_value () Since 2.28
> 
> # yum list installed glib2
> Loaded plugins: fastestmirror, rhnplugin, security
> This system is receiving updates from RHN Classic or Red Hat Satellite.
> Loading mirror speeds from cached hostfile
> Installed Packages
> glib2.x86_64  
>>

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrew Beekhof


On 14 Jan 2014, at 1:19 pm, Andrew Beekhof  wrote:

> Apart from anything else, your timeout needs to be bigger:
> 
> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
> commands.c:1321  )   error: log_operation:   Operation 'reboot' [11331] (call 
> 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
> 'st1' returned: -62 (Timer expired)
> 

also:

Jan 13 12:04:54 [17226] dev-cluster2-node1.unix.tensor.rupengine: ( 
utils.c:723   )   error: unpack_operation:  Specifying on_fail=fence and 
stonith-enabled=false makes no sense


> 
> On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:
> 
>> 
>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
>> 
>>> 
>>> 
>>> 13.01.2014, 02:51, "Andrew Beekhof" :
 On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
 
> 10.01.2014, 14:31, "Andrey Groshev" :
>> 10.01.2014, 14:01, "Andrew Beekhof" :
>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
  10.01.2014, 05:29, "Andrew Beekhof" :
>   On 9 Jan 2014, at 11:11 pm, Andrey Groshev  wrote:
>>08.01.2014, 06:22, "Andrew Beekhof" :
>>>On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
>>> wrote:
 Hi, ALL.
 
 I'm still trying to cope with the fact that after the fence - 
 node hangs in "pending".
>>>Please define "pending".  Where did you see this?
>>In crm_mon:
>>..
>>Node dev-cluster2-node2 (172793105): pending
>>..
>> 
>>The experiment was like this:
>>Four nodes in cluster.
>>On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>Thereafter, the remaining start it constantly reboot, under 
>> various pretexts, "softly whistling", "fly low", "not a cluster 
>> member!" ...
>>Then in the log fell out "Too many failures "
>>All this time in the status in crm_mon is "pending".
>>Depending on the wind direction changed to "UNCLEAN"
>>Much time has passed and I can not accurately describe the 
>> behavior...
>> 
>>Now I am in the following state:
>>I tried locate the problem. Came here with this.
>>I set big value in property stonith-timeout="600s".
>>And got the following behavior:
>>1. pkill -4 corosync
>>2. from node with DC call my fence agent "sshbykey"
>>3. It sends reboot victim and waits until she comes to life again.
>   Hmmm what version of pacemaker?
>   This sounds like a timing issue that we fixed a while back
  Was a version 1.1.11 from December 3.
  Now try full update and retest.
>>> That should be recent enough.  Can you create a crm_report the next 
>>> time you reproduce?
>> Of course yes. Little delay :)
>> 
>> ..
>> cc1: warnings being treated as errors
>> upstart.c: In function ‘upstart_job_property’:
>> upstart.c:264: error: implicit declaration of function 
>> ‘g_variant_lookup_value’
>> upstart.c:264: error: nested extern declaration of 
>> ‘g_variant_lookup_value’
>> upstart.c:264: error: assignment makes pointer from integer without a 
>> cast
>> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory `/root/ha/pacemaker/lib'
>> make: *** [core] Error 1
>> 
>> I'm trying to solve this a problem.
> Do not get solved quickly...
> 
> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
> g_variant_lookup_value () Since 2.28
> 
> # yum list installed glib2
> Loaded plugins: fastestmirror, rhnplugin, security
> This system is receiving updates from RHN Classic or Red Hat Satellite.
> Loading mirror speeds from cached hostfile
> Installed Packages
> glib2.x86_64  
> 2.26.1-3.el6  
>  installed
> 
> # cat /etc/issue
> CentOS release 6.5 (Final)
> Kernel \r on an \m
 
 Can you try this patch?
 Upstart jobs wont work, but the code will compile
 
 diff --git a/lib/services/upstart.c b/lib/services/upstart.c
 index 831e7cf..195c3a4 100644
 --- a/lib/services/upstart.c
 +++ b/lib/services/upstart.c
 @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
 static char *
 upstart_job_property(const char *obj, const gchar * iface, const char 
 *name)
 {
 +char *output = NULL;
 +
 +#if !GLIB_CHECK_VERSION(2,28,0)
 +static bool err = TRUE;
 +
 +if(err) {
 +crm_err("This version of

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrew Beekhof

Apart from anything else, your timeout needs to be bigger:

Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 
2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
'st1' returned: -62 (Timer expired)


On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:

> 
> On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
> 
>> 
>> 
>> 13.01.2014, 02:51, "Andrew Beekhof" :
>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>>> 
 10.01.2014, 14:31, "Andrey Groshev" :
> 10.01.2014, 14:01, "Andrew Beekhof" :
>>  On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>>>   10.01.2014, 05:29, "Andrew Beekhof" :
On 9 Jan 2014, at 11:11 pm, Andrey Groshev  wrote:
> 08.01.2014, 06:22, "Andrew Beekhof" :
>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
>> wrote:
>>>  Hi, ALL.
>>> 
>>>  I'm still trying to cope with the fact that after the fence - 
>>> node hangs in "pending".
>> Please define "pending".  Where did you see this?
> In crm_mon:
> ..
> Node dev-cluster2-node2 (172793105): pending
> ..
> 
> The experiment was like this:
> Four nodes in cluster.
> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
> Thereafter, the remaining start it constantly reboot, under 
> various pretexts, "softly whistling", "fly low", "not a cluster 
> member!" ...
> Then in the log fell out "Too many failures "
> All this time in the status in crm_mon is "pending".
> Depending on the wind direction changed to "UNCLEAN"
> Much time has passed and I can not accurately describe the 
> behavior...
> 
> Now I am in the following state:
> I tried locate the problem. Came here with this.
> I set big value in property stonith-timeout="600s".
> And got the following behavior:
> 1. pkill -4 corosync
> 2. from node with DC call my fence agent "sshbykey"
> 3. It sends reboot victim and waits until she comes to life again.
Hmmm what version of pacemaker?
This sounds like a timing issue that we fixed a while back
>>>   Was a version 1.1.11 from December 3.
>>>   Now try full update and retest.
>>  That should be recent enough.  Can you create a crm_report the next 
>> time you reproduce?
> Of course yes. Little delay :)
> 
> ..
> cc1: warnings being treated as errors
> upstart.c: In function ‘upstart_job_property’:
> upstart.c:264: error: implicit declaration of function 
> ‘g_variant_lookup_value’
> upstart.c:264: error: nested extern declaration of 
> ‘g_variant_lookup_value’
> upstart.c:264: error: assignment makes pointer from integer without a cast
> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/root/ha/pacemaker/lib'
> make: *** [core] Error 1
> 
> I'm trying to solve this a problem.
 Do not get solved quickly...
 
 https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
 g_variant_lookup_value () Since 2.28
 
 # yum list installed glib2
 Loaded plugins: fastestmirror, rhnplugin, security
 This system is receiving updates from RHN Classic or Red Hat Satellite.
 Loading mirror speeds from cached hostfile
 Installed Packages
 glib2.x86_64  
 2.26.1-3.el6   
 installed
 
 # cat /etc/issue
 CentOS release 6.5 (Final)
 Kernel \r on an \m
>>> 
>>> Can you try this patch?
>>> Upstart jobs wont work, but the code will compile
>>> 
>>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>>> index 831e7cf..195c3a4 100644
>>> --- a/lib/services/upstart.c
>>> +++ b/lib/services/upstart.c
>>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>> static char *
>>> upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>> {
>>> +char *output = NULL;
>>> +
>>> +#if !GLIB_CHECK_VERSION(2,28,0)
>>> +static bool err = TRUE;
>>> +
>>> +if(err) {
>>> +crm_err("This version of glib is too old to support upstart jobs");
>>> +err = FALSE;
>>> +}
>>> +#else
>>> GError *error = NULL;
>>> GDBusProxy *proxy;
>>> GVariant *asv = NULL;
>>> GVariant *value = NULL;
>>> GVariant *_ret = NULL;
>>> -char *output = NULL;
>>> 
>>> crm_info("Calling GetAll on %s", obj);
>>> proxy = get_proxy(obj, B

Re: [Pacemaker] pgsql RA - slave is in HS:ASYNC status and won; t promote

2014-01-13 Thread 東一彦


Hi,

> but after some tests something went wrong and i don't know what and why and 
how to get it back working ... now when i start crm, master is PRI, but slave gets 
into HS:ASYNC state .. and when master fails, and slave gets into HS:alone state
It is PostgreSQL to select the node whether "sync" or "async".
pgsql RA displays a result of the following SQL.

  select application_name,upper(state),upper(sync_state) from 
pg_stat_replication;

So, at first, please watch PostgreSQL's log.



Possibly the data may become inconsistent.
You can resolve the inconsistency in the following operation.

 http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#after_fail-over


Regards,
Kazuhiko HIGASHI

(2014/01/10 17:48), Tomáš Vajrauch wrote:

Hi,

i am trying to run postgresql cluster with streaming replication using pgsql RA 
and pacemaker ..
i succeded once, master was as PRI, slave HS:sync, failover worked as it should 
(slave become master) ..
but after some tests something went wrong and i don't know what and why and how 
to get it back working ... now when i start crm, master is PRI, but slave gets 
into HS:ASYNC state .. and when master fails, and slave gets into HS:alone state

can somebody please give me hint what should i do or what should i look for?

Thanks a lot for any help
Tomas

my configuration:

node jboss-test \
 attributes pgsql-data-status="LATEST"
node jboss-test2 \
 attributes pgsql-data-status="STREAMING|ASYNC"
primitive pgsql ocf:heartbeat:pgsql \
 params pgctl="/opt/postgres/9.3/bin/pg_ctl" psql="/opt/postgres/9.3/bin/psql" pgdata="/opt/postgres/9.3/data/" 
rep_mode="sync" node_list="jboss-test jboss-test2" restore_command="cp /opt/postgres/9.3/data/pg_archive/%f %p" 
primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip="172.16.111.120" stop_escalate="0" \
 op start interval="0s" timeout="60s" on-fail="restart" \
 op stop interval="0s" timeout="60s" on-fail="block" \
 op monitor interval="11s" timeout="60s" on-fail="restart" \
 op monitor interval="10s" role="Master" timeout="60s" 
on-fail="restart" \
 op promote interval="0s" timeout="60s" on-fail="restart" \
 op demote interval="0s" timeout="60s" on-fail="block" \
 op notify interval="0s" timeout="60s"
primitive pingCheck ocf:pacemaker:ping \
 params name="default_ping_set" host_list="172.16.0.1" multiplier="100" 
\
 op start interval="0s" timeout="60s" on-fail="restart" \
 op monitor interval="2s" timeout="60s" on-fail="restart" \
 op stop interval="0s" timeout="60s" on-fail="ignore"
primitive vip-master ocf:heartbeat:IPaddr2 \
 params ip="172.16.111.110" nic="eth0" cidr_netmask="24" \
 op start interval="0s" timeout="60s" on-fail="restart" \
 op monitor interval="10s" timeout="60s" on-fail="restart" \
 op stop interval="0s" timeout="60s" on-fail="block"
primitive vip-rep ocf:heartbeat:IPaddr2 \
 params ip="172.16.111.120" nic="eth0" cidr_netmask="24" \
 meta migration-threshold="0" \
 op start interval="0s" timeout="60s" on-fail="stop" \
 op monitor interval="10s" timeout="60s" on-fail="restart" \
 op stop interval="0s" timeout="60s" on-fail="block"
primitive vip-slave ocf:heartbeat:IPaddr2 \
 params ip="172.16.111.111" nic="eth0" cidr_netmask="24" \
 meta resource-stickiness="1" \
 op start interval="0s" timeout="60s" on-fail="restart" \
 op monitor interval="10s" timeout="60s" on-fail="restart" \
 op stop interval="0s" timeout="60s" on-fail="block"
group master-group vip-master vip-rep \
 meta ordered="false"
ms msPostgresql pgsql \
 meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" 
notify="true"
clone clnPingCheck pingCheck
location rsc_location-1 vip-slave \
 rule $id="rsc_location-1-rule" 200: pgsql-status eq HS:sync \
 rule $id="rsc_location-1-rule-0" 190: pgsql-status eq HS:async \
 rule $id="rsc_location-1-rule-1" 100: pgsql-status eq PRI \
 rule $id="rsc_location-1-rule-2" -inf: not_defined pgsql-status \
 rule $id="rsc_location-1-rule-3" -inf: pgsql-status ne HS:sync and 
pgsql-status ne PRI and pgsql-status ne HS:async
location rsc_location-2 msPostgresql \
 rule $id="rsc_location-3-rule" -inf: not_defined default_ping_set or 
default_ping_set lt 100
colocation rsc_colocation-1 inf: msPostgresql clnPingCheck
colocation rsc_colocation-2 inf: master-group msPostgresql:Master
order rsc_order-1 0: clnPingCheck msPostgresql
order rsc_order-2 0: msPostgresql:promote master-group:start symmetrical=false
order rsc_order-3 0: msPostgresql:demote master-group:stop symmetrical=false
property $id="cib-bootstrap-options" \
 no-quorum-policy="ignore" \
 stonith-enabled="false" \
 crmd-transition-delay="0s" \
 dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb6

Re: [Pacemaker] Location / Colocation constraints issue

2014-01-13 Thread Andrew Beekhof


On 19 Dec 2013, at 1:08 am, Gaëtan Slongo  wrote:

> Hi !
> 
> I'm currently building a 2 node cluster for firewalling.
> I would like to run a shorewall on both on the master and the "Slave"
> node. I tried many things but nothing works as expected. Shorewall
> configurations are good.
> What I want to do is to start shorewall standby on the other node as
> soon as my drbd resources are "Slave" or "Stopped"..?
> Could you please give me a bit of help on this problem ?

It will be something like:

colocation XXX -inf: shorewall-standby drbd_master_slave_ServicesConfigs1:Master
colocation YYY -inf: shorewall-standby drbd_master_slave_ServicesLogs1:Master

> 
> Here is my current config
> 
> Thanks
> 
> 
> node keskonrix1 \
>attributes standby="off"
> node keskonrix2 \
>attributes standby="off"
> primitive VIPDMZ ocf:heartbeat:IPaddr2 \
>params ip="10.0.1.1" nic="eth2" cidr_netmask="24" iflabel="VIPDMZ" \
>op monitor interval="30s" timeout="30s"
> primitive VIPEXPL ocf:heartbeat:IPaddr2 \
>params ip="10.0.2.2" nic="eth3" cidr_netmask="28"
> iflabel="VIPEXPL" \
>op monitor interval="30s" timeout="30s"
> primitive VIPLAN ocf:heartbeat:IPaddr2 \
>params ip="192.168.1.248" nic="br0" cidr_netmask="16"
> iflabel="VIPLAN" \
>op monitor interval="30s" timeout="30s"
> primitive VIPNET ocf:heartbeat:IPaddr2 \
>params ip="XX.XX.XX.XX" nic="eth1" cidr_netmask="29"
> iflabel="VIPDMZ" \
>op monitor interval="30s" timeout="30s"
> primitive VIPPDA ocf:heartbeat:IPaddr2 \
>params ip="XX.XX.XX.XX" nic="eth1" cidr_netmask="29"
> iflabel="VIPPDA" \
>op monitor interval="30s" timeout="30s"
> primitive apache2 lsb:apache2 \
>op start interval="0" timeout="15s"
> primitive bind9 lsb:bind9 \
>op start interval="0" timeout="15s"
> primitive dansguardian lsb:dansguardian \
>op start interval="0" timeout="30s" on-fail="ignore"
> primitive drbd-ServicesConfigs1 ocf:linbit:drbd \
>params drbd_resource="services-configs1" \
>op monitor interval="29s" role="Master" \
>op monitor interval="31s" role="Slave"
> primitive drbd-ServicesLogs1 ocf:linbit:drbd \
>params drbd_resource="services-logs1" \
>op monitor interval="29s" role="Master" \
>op monitor interval="31s" role="Slave"
> primitive fs_ServicesConfigs1 ocf:heartbeat:Filesystem \
>params device="/dev/drbd/by-res/services-configs1"
> directory="/drbd/services-configs1/" fstype="ext4"
> options="noatime,nodiratime" \
>meta target-role="Started"
> primitive fs_ServicesLogs1 ocf:heartbeat:Filesystem \
>params device="/dev/drbd/by-res/services-logs1"
> directory="/drbd/services-logs1/" fstype="ext4"
> options="noatime,nodiratime" \
>meta target-role="Started"
> primitive ipsec-setkey lsb:setkey \
>op start interval="0" timeout="30s"
> primitive links_ServicesConfigs1 heartbeat:drbdlinks \
>meta target-role="Started"
> primitive openvpn lsb:openvpn \
>op monitor interval="10" timeout="30s" \
>meta target-role="Started"
> primitive racoon lsb:racoon \
>op start interval="0" timeout="30s"
> primitive shorewall lsb:shorewall \
>op start interval="0" timeout="30s" \
>meta target-role="Started"
> primitive shorewall-standby lsb:shorewall \
>op start interval="0" timeout="30s"
> primitive squid lsb:squid \
>op start interval="0" timeout="15s" \
>op stop interval="0" timeout="120s"
> group IPS-Services1 VIPLAN VIPDMZ VIPPDA VIPEXPL VIPNET \
>meta target-role="Started"
> group IPSec ipsec-setkey racoon
> group Services1 bind9 squid dansguardian apache2 openvpn shorewall
> group ServicesData1 fs_ServicesConfigs1 fs_ServicesLogs1
> links_ServicesConfigs1
> ms drbd_master_slave_ServicesConfigs1 drbd-ServicesConfigs1 \
>meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" globally-unique="false" notify="true"
> target-role="Master"
> ms drbd_master_slave_ServicesLogs1 drbd-ServicesLogs1 \
>meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" globally-unique="false" notify="true"
> target-role="Master"
> colocation Services1_on_drbd inf:
> drbd_master_slave_ServicesConfigs1:Master
> drbd_master_slave_ServicesLogs1:Master ServicesData1 IPS-Services1
> Services1 IPSec
> colocation start-shorewall_standby-on-passive-node -inf:
> shorewall-standby shorewall
> order all_drbd inf: shorewall-standby:stop
> drbd_master_slave_ServicesConfigs1:promote
> drbd_master_slave_ServicesLogs1:promote ServicesData1:start
> IPS-Services1:start IPSec:start Services1:start
> property $id="cib-bootstrap-options" \
>dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>cluster-infrastructure="openais" \
>expected-quorum-votes="2" \
>stonith-enabled="false" \
>no-quorum-policy="ignore"
> rsc_defaults $id="rsc-options" \
>resource

Re: [Pacemaker] [Linux-HA] Better way to change master in 3 node pgsql cluster

2014-01-13 Thread Andrew Beekhof


On 13 Jan 2014, at 8:32 pm, Andrey Rogovsky  wrote:

> Hi
> 
> I have 3 node postgresql cluster.
> It work well. But I have some trobule with change master.
> 
> For now, if I need change master, I must:
> 1) Stop PGSQL on each node and cluster service
> 2) Start Setup new manual PGSQL replication
> 3) Change attributes on each node for point to new master
> 4) Stop PGSQL on each node
> 5) Celanup resource and start cluster service
> 
> It take a lot of time. Is it exist better way to change master?

Newer versions support:

   crm_resource --resource msPostgresql --ban --master --host 
a.geocluster.e-autopay.com

> 
> 
> 
> This is my cluster service status:
> Node Attributes:
> * Node a.geocluster.e-autopay.com:
>+ master-pgsql:0   : 1000
>+ pgsql-data-status   : LATEST
>+ pgsql-master-baseline   : 2F90
>+ pgsql-status : PRI
> * Node c.geocluster.e-autopay.com:
>+ master-pgsql:0   : 1000
>+ pgsql-data-status   : SYNC
>+ pgsql-status : STOP
> * Node b.geocluster.e-autopay.com:
>+ master-pgsql:0   : 1000
>+ pgsql-data-status   : SYNC
>+ pgsql-status : STOP
> 
> I was use http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster for my 3
> nodes cluster without hard stik.
> Now I got strange situation all nodes stay slave:
> 
> Last updated: Sat Dec  7 04:33:47 2013
> Last change: Sat Dec  7 12:56:23 2013 via crmd on a
> Stack: openais
> Current DC: c - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 5 Nodes configured, 3 expected votes
> 4 Resources configured.
> 
> 
> Online: [ a c b ]
> 
> Master/Slave Set: msPostgresql [pgsql]
> Slaves: [ a c b ]
> 
> My config is:
> node a \
> attributes pgsql-data-status="DISCONNECT"
> node b \
> attributes pgsql-data-status="DISCONNECT"
> node c \
> attributes pgsql-data-status="DISCONNECT"
> primitive pgsql ocf:heartbeat:pgsql \
> params pgctl="/usr/lib/postgresql/9.3/bin/pg_ctl" psql="/usr/bin/psql"
> pgdata="/var/lib/postgresql/9.3/main" start_opt="-p 5432" rep_mode="sync"
> node_list="a b c" restore_command="cp /var/lib/postgresql/9.3/pg_archive/%f
> %p" master_ip="192.168.10.200" restart_on_promote="true"
> config="/etc/postgresql/9.3/main/postgresql.conf" \
> op start interval="0s" timeout="60s" on-fail="restart" \
> op monitor interval="4s" timeout="60s" on-fail="restart" \
> op monitor interval="3s" role="Master" timeout="60s" on-fail="restart" \
> op promote interval="0s" timeout="60s" on-fail="restart" \
> op demote interval="0s" timeout="60s" on-fail="stop" \
> op stop interval="0s" timeout="60s" on-fail="block" \
> op notify interval="0s" timeout="60s"
> primitive pgsql-master-ip ocf:heartbeat:IPaddr2 \
> params ip="192.168.10.200" nic="peervpn0" \
> op start interval="0s" timeout="60s" on-fail="restart" \
> op monitor interval="10s" timeout="60s" on-fail="restart" \
> op stop interval="0s" timeout="60s" on-fail="block"
> group master pgsql-master-ip
> ms msPostgresql pgsql \
> meta master-max="1" master-node-max="1" clone-max="3" clone-node-max="1"
> notify="true"
> colocation set_ip inf: master msPostgresql:Master
> order ip_down 0: msPostgresql:demote master:stop symmetrical=false
> order ip_up 0: msPostgresql:promote master:start symmetrical=false
> property $id="cib-bootstrap-options" \
> dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="3" \
> no-quorum-policy="ignore" \
> stonith-enabled="false" \
> crmd-transition-delay="0" \
> last-lrm-refresh="1386404222"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100" \
> migration-threshold="1"
> ___
> Linux-HA mailing list
> linux...@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-13 Thread Andrew Beekhof


On 14 Jan 2014, at 5:13 am, Brian J. Murrell (brian)  
wrote:

> Hi,
> 
> I found a situation using pacemaker 1.1.10 on RHEL6.5 where the output
> of "crm_resource -L" is not trust-able, shortly after a node is booted.
> 
> Here is the output from crm_resource -L on one of the nodes in a two
> node cluster (the one that was not rebooted):
> 
> st-fencing(stonith:fence_foo):Started 
> res1  (ocf::foo:Target):  Started 
> res2  (ocf::foo:Target):  Started 
> 
> Here is the output from the same command on the other node in the two
> node cluster right after it was rebooted:
> 
> st-fencing(stonith:fence_foo):Stopped 
> res1  (ocf::foo:Target):  Stopped 
> res2  (ocf::foo:Target):  Stopped 
> 
> These were collected at the same time (within the same second) on the
> two nodes.
> 
> Clearly the rebooted node is not telling the truth.  Perhaps the truth
> for it is "I don't know", which would be fair enough but that's not what
> pacemaker is asserting there.
> 
> So, how do I know (i.e. programmatically -- what command can I issue to
> know) if and when crm_resource can be trusted to be truthful?

The local cib hasn't caught up yet by the looks of it.
You could compare 'cibadmin -Ql' with 'cibadmin -Q'

> 
> b.
> 
> 
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrew Beekhof


On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:

> 
> 
> 13.01.2014, 02:51, "Andrew Beekhof" :
>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>> 
>>>  10.01.2014, 14:31, "Andrey Groshev" :
  10.01.2014, 14:01, "Andrew Beekhof" :
>   On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>>10.01.2014, 05:29, "Andrew Beekhof" :
>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev  wrote:
  08.01.2014, 06:22, "Andrew Beekhof" :
>  On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
> wrote:
>>   Hi, ALL.
>> 
>>   I'm still trying to cope with the fact that after the fence - 
>> node hangs in "pending".
>  Please define "pending".  Where did you see this?
  In crm_mon:
  ..
  Node dev-cluster2-node2 (172793105): pending
  ..
 
  The experiment was like this:
  Four nodes in cluster.
  On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
  Thereafter, the remaining start it constantly reboot, under 
 various pretexts, "softly whistling", "fly low", "not a cluster 
 member!" ...
  Then in the log fell out "Too many failures "
  All this time in the status in crm_mon is "pending".
  Depending on the wind direction changed to "UNCLEAN"
  Much time has passed and I can not accurately describe the 
 behavior...
 
  Now I am in the following state:
  I tried locate the problem. Came here with this.
  I set big value in property stonith-timeout="600s".
  And got the following behavior:
  1. pkill -4 corosync
  2. from node with DC call my fence agent "sshbykey"
  3. It sends reboot victim and waits until she comes to life again.
>>> Hmmm what version of pacemaker?
>>> This sounds like a timing issue that we fixed a while back
>>Was a version 1.1.11 from December 3.
>>Now try full update and retest.
>   That should be recent enough.  Can you create a crm_report the next 
> time you reproduce?
  Of course yes. Little delay :)
 
  ..
  cc1: warnings being treated as errors
  upstart.c: In function ‘upstart_job_property’:
  upstart.c:264: error: implicit declaration of function 
 ‘g_variant_lookup_value’
  upstart.c:264: error: nested extern declaration of 
 ‘g_variant_lookup_value’
  upstart.c:264: error: assignment makes pointer from integer without a cast
  gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
  gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
  make[1]: *** [all-recursive] Error 1
  make[1]: Leaving directory `/root/ha/pacemaker/lib'
  make: *** [core] Error 1
 
  I'm trying to solve this a problem.
>>>  Do not get solved quickly...
>>> 
>>>  
>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>  g_variant_lookup_value () Since 2.28
>>> 
>>>  # yum list installed glib2
>>>  Loaded plugins: fastestmirror, rhnplugin, security
>>>  This system is receiving updates from RHN Classic or Red Hat Satellite.
>>>  Loading mirror speeds from cached hostfile
>>>  Installed Packages
>>>  glib2.x86_64  
>>> 2.26.1-3.el6   
>>> installed
>>> 
>>>  # cat /etc/issue
>>>  CentOS release 6.5 (Final)
>>>  Kernel \r on an \m
>> 
>> Can you try this patch?
>> Upstart jobs wont work, but the code will compile
>> 
>> diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>> index 831e7cf..195c3a4 100644
>> --- a/lib/services/upstart.c
>> +++ b/lib/services/upstart.c
>> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>  static char *
>>  upstart_job_property(const char *obj, const gchar * iface, const char *name)
>>  {
>> +char *output = NULL;
>> +
>> +#if !GLIB_CHECK_VERSION(2,28,0)
>> +static bool err = TRUE;
>> +
>> +if(err) {
>> +crm_err("This version of glib is too old to support upstart jobs");
>> +err = FALSE;
>> +}
>> +#else
>>  GError *error = NULL;
>>  GDBusProxy *proxy;
>>  GVariant *asv = NULL;
>>  GVariant *value = NULL;
>>  GVariant *_ret = NULL;
>> -char *output = NULL;
>> 
>>  crm_info("Calling GetAll on %s", obj);
>>  proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * 
>> iface, const char *name)
>> 
>>  g_object_unref(proxy);
>>  g_variant_unref(_ret);
>> +#endif
>>  return output;
>>  }
>> 
> 
> Ok :) I patch source. 
> Type "make rc" - the same error.

Because its not building your local changes

> Make new copy via "fetch" - the same error.
> It seems that if not exist ClusterLa

[Pacemaker] crm_resource -L not trustable right after restart

2014-01-13 Thread Brian J. Murrell (brian)

Hi,

I found a situation using pacemaker 1.1.10 on RHEL6.5 where the output
of "crm_resource -L" is not trust-able, shortly after a node is booted.

Here is the output from crm_resource -L on one of the nodes in a two
node cluster (the one that was not rebooted):

 st-fencing (stonith:fence_foo):Started 
 res1   (ocf::foo:Target):  Started 
 res2   (ocf::foo:Target):  Started 

Here is the output from the same command on the other node in the two
node cluster right after it was rebooted:

 st-fencing (stonith:fence_foo):Stopped 
 res1   (ocf::foo:Target):  Stopped 
 res2   (ocf::foo:Target):  Stopped 

These were collected at the same time (within the same second) on the
two nodes.

Clearly the rebooted node is not telling the truth.  Perhaps the truth
for it is "I don't know", which would be fair enough but that's not what
pacemaker is asserting there.

So, how do I know (i.e. programmatically -- what command can I issue to
know) if and when crm_resource can be trusted to be truthful?

b.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Location / Colocation constraints issue

2014-01-13 Thread Gaëtan Slongo


  
  
Hi !
  
  Thanks for answer.
  I'm not trying to use shorewall as a ms resource. Let me explain :
  I have 2 nodes. All resources are always on the same node (using
  group and constraints) but what I want to do is to start a
  shorewall on the "passive" node. 
  How could I do that simply ? I tried to use constraints but it'is
  not working well
  
  Gaëtan
  
  
  Le 22/12/13 13:24, emmanuel segura a écrit :


  Your shorewall cannot handle ms Master and Slave
operations, because is a lsb script, if you want your script to
act as drbd ms, look that one and do it an script agent
  
  

2013/12/22 Gaëtan Slongo 
  

  Hi !

Someone has any idea ?

Thanks !


Le 18/12/13 15:08, Gaëtan Slongo a écrit :
  
  

  
Hi !

I'm currently building a 2 node cluster for firewalling.
I would like to run a shorewall on both on the master and the "Slave"
node. I tried many things but nothing works as expected. Shorewall
configurations are good.
What I want to do is to start shorewall standby on the other node as
soon as my drbd resources are "Slave" or "Stopped"..?
Could you please give me a bit of help on this problem ?

Here is my current config

Thanks


node keskonrix1 \
attributes standby="off"
node keskonrix2 \
attributes standby="off"
primitive VIPDMZ ocf:heartbeat:IPaddr2 \
params ip="10.0.1.1" nic="eth2" cidr_netmask="24" iflabel="VIPDMZ" \
op monitor interval="30s" timeout="30s"
primitive VIPEXPL ocf:heartbeat:IPaddr2 \
params ip="10.0.2.2" nic="eth3" cidr_netmask="28"
iflabel="VIPEXPL" \
op monitor interval="30s" timeout="30s"
primitive VIPLAN ocf:heartbeat:IPaddr2 \
params ip="192.168.1.248" nic="br0" cidr_netmask="16"
iflabel="VIPLAN" \
op monitor interval="30s" timeout="30s"
primitive VIPNET ocf:heartbeat:IPaddr2 \
params ip="XX.XX.XX.XX" nic="eth1" cidr_netmask="29"
iflabel="VIPDMZ" \
op monitor interval="30s" timeout="30s"
primitive VIPPDA ocf:heartbeat:IPaddr2 \
params ip="XX.XX.XX.XX" nic="eth1" cidr_netmask="29"
iflabel="VIPPDA" \
op monitor interval="30s" timeout="30s"
primitive apache2 lsb:apache2 \
op start interval="0" timeout="15s"
primitive bind9 lsb:bind9 \
op start interval="0" timeout="15s"
primitive dansguardian lsb:dansguardian \
op start interval="0" timeout="30s" on-fail="ignore"
primitive drbd-ServicesConfigs1 ocf:linbit:drbd \
params drbd_resource="services-configs1" \
op monitor interval="29s" role="Master" \
op monitor interval="31s" role="Slave"
primitive drbd-ServicesLogs1 ocf:linbit:drbd \
params drbd_resource="services-logs1" \
op monitor interval="29s" role="Master" \
op monitor interval="31s" role="Slave"
primitive fs_ServicesConfigs1 ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/services-configs1"
directory="/drbd/services-configs1/" fstype="ext4"
options="noatime,nodiratime" \
meta target-role="Started"
primitive fs_ServicesLogs1 ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/services-logs1"
directory="/drbd/services-logs1/" fstype="ext4"
options="noatime,nodiratime" \
meta target-role="Started"
primitive ipsec-setkey lsb:setkey \
op start interval="0" timeout="30s"
primitive links_ServicesConfigs1 heartbeat:drbdlinks \
meta target-role="Started"
primitive openvpn lsb:openvpn \
op monitor interval="10" timeout="30s" \
meta target-role="Started"
primitive racoon lsb:racoon \
op start interval="0" timeout="30s"
primitive shorewall lsb:shorewall \
op start interval="0" timeout="30s" \
meta target-role="Started"
primitive shorewall-standby lsb:shorewall \
op start interval="0" timeout="30s"
primitive squid lsb:squid \
op start interval="0" timeout="15s" \
op stop interval="0" timeout="120s"
group IPS-Services1 VIPLAN VIPDMZ VIPPDA VIPEXPL VIPNET \
meta target-role="Started"
group IPSec ipsec-setkey racoon
group Services1 bind9 squid dansguardian apache2 openvpn shorewall
group ServicesData1 fs_ServicesConfigs1 fs_ServicesLogs1
links_ServicesConfigs1
ms drbd_master_slave_ServicesConfigs1 drbd-ServicesConfigs1 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" globally-unique="false" notify="true"
target-role="Master"
ms drbd_master_slave_ServicesLogs1 drbd-ServicesLogs1 \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" globally-unique="false" notify="true"
target-role="Master"
colocation Services1_on_drbd inf:
drbd_master_slave_ServicesConfigs1:Master
drbd_mas

Re: [Pacemaker] How to configure hearbeat using private network?

2014-01-13 Thread Lars Marowsky-Bree

On 2014-01-12T18:53:50, John Wei  wrote:

> I believe corosync does support this. Can someone point me to the document
> on how to do this.

Just configure corosync to use the private network interface via the
bindnetaddr.



-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] again "return code", now in crm_attribute

2014-01-13 Thread Andrey Groshev



13.01.2014, 02:51, "Andrew Beekhof" :
> On 10 Jan 2014, at 6:18 pm, Andrey Groshev  wrote:
>
>>  10.01.2014, 10:15, "Andrew Beekhof" :
>>>  On 10 Jan 2014, at 4:38 pm, Andrey Groshev  wrote:
   10.01.2014, 09:06, "Andrew Beekhof" :
>   On 10 Jan 2014, at 3:51 pm, Andrey Groshev  wrote:
>>    10.01.2014, 03:28, "Andrew Beekhof" :
>>>    On 9 Jan 2014, at 4:44 pm, Andrey Groshev  wrote:
 09.01.2014, 02:39, "Andrew Beekhof" :
>  On 18 Dec 2013, at 11:55 pm, Andrey Groshev  
> wrote:
>>   Hi, Andrew and ALL.
>>
>>   I'm sorry, but I again found an error. :)
>>   Crux of the problem:
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled 
>> --query; echo $?
>>   scope=crm_config  name=stonith-enabled value=true
>>   0
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled 
>> --update firstval ; echo $?
>>   0
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled 
>> --query; echo $?
>>   scope=crm_config  name=stonith-enabled value=firstval
>>   0
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled  
>> --update secondval --lifetime=reboot ; echo $?
>>   0
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled 
>> --query; echo $?
>>   scope=crm_config  name=stonith-enabled value=firstval
>>   0
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled  
>> --update thirdval --lifetime=forever ; echo $?
>>   0
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled 
>> --query; echo $?
>>   scope=crm_config  name=stonith-enabled value=firstval
>>   0
>>
>>   Ie if specify the lifetime of an attribute, then a attribure 
>> is not updated.
>>
>>   If impossible setup the lifetime of the attribute when it is 
>> installing, it must be return an error.
>  Agreed. I'll reproduce and get back to you.
 How, I was able to review code, problem comes when used both 
 options "--type" and options "--lifetime".
 One variant in "case" without break;
 Unfortunately, I did not have time to dive into the logic.
>>>    Actually, the logic is correct.  The command:
>>>
>>>    # crm_attribute --type crm_config --attr-name stonith-enabled  
>>> --update secondval --lifetime=reboot ; echo $?
>>>
>>>    is invalid.  You only get to specify --type OR --lifetime, not both.
>>>    By specifying --lifetime, you're creating a node attribute, not a 
>>> cluster proprerty.
>>    With this, I do not argue. I think that should be the exit code is 
>> NOT ZERO, ie it's error!
>   No, its setting a value, just not where you thought (or where you're 
> looking for it in the next command).
>
>   Its the same as writing:
>
> crm_attribute --type crm_config --type status --attr-name 
> stonith-enabled  --update secondval; echo $?
>
>   Only the last value for --type wins
   Because of this confusion is obtained. Here is an example of the old 
 cluster:
   #crm_attribute --type crm_config --attr-name test1  --update val1 
 --lifetime=reboot ; echo $?
   0
   # cibadmin -Q|grep test1
    >>> value="val1"/>
   Win "--lifetime" ?
>>>  Yes.  Because it was specified last.
   Is not it easier to produce an error when trying to use incompatible 
 options?
>>>  They're not incompatible. They're aliases for each other in a different 
>>> context.
>>  Ok. I understood you . Let's say you're right . :)
>>  In the end , if you change the order of words in the human language, then 
>> meaning of a sentence can change.
>>  But suppose, I have a strange desire, but it may be represented and write.
>>
>>  I say "crm_attribute - attr-name attr1 - update val1 - lifetime = reboot - 
>> type crm_config".
>>
>>  I mean, that ...
>>  "I want to set some attribute to a cluster , and this attribute should 
>> disappear if the cluster is restarted. "
>
> This functionality does/can not exist for anything other than node attributes.

I.e. You return to me "NOT OK" ? ;-) 

>>  If I сhange the order of arguments, the meaning of a sentence is still not 
>> change.
>>  But what do you say to that? 100% chance you will say that the sentence is 
>> not correct. Why ? Because the "lifetime" is not quite a time of life and 
>> can not be used in the context of the properties of the cluster ? Ie You 
>> gave me back " 1" and crm_attribute returned "0" :)
   Then there is this uncertainty and "was meant...", "was meant...",

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrey Groshev



13.01.2014, 02:51, "Andrew Beekhof" :
> On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>
>>  10.01.2014, 14:31, "Andrey Groshev" :
>>>  10.01.2014, 14:01, "Andrew Beekhof" :
   On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>    10.01.2014, 05:29, "Andrew Beekhof" :
>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev  wrote:
>>>  08.01.2014, 06:22, "Andrew Beekhof" :
  On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
 wrote:
>   Hi, ALL.
>
>   I'm still trying to cope with the fact that after the fence - 
> node hangs in "pending".
  Please define "pending".  Where did you see this?
>>>  In crm_mon:
>>>  ..
>>>  Node dev-cluster2-node2 (172793105): pending
>>>  ..
>>>
>>>  The experiment was like this:
>>>  Four nodes in cluster.
>>>  On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>  Thereafter, the remaining start it constantly reboot, under 
>>> various pretexts, "softly whistling", "fly low", "not a cluster 
>>> member!" ...
>>>  Then in the log fell out "Too many failures "
>>>  All this time in the status in crm_mon is "pending".
>>>  Depending on the wind direction changed to "UNCLEAN"
>>>  Much time has passed and I can not accurately describe the 
>>> behavior...
>>>
>>>  Now I am in the following state:
>>>  I tried locate the problem. Came here with this.
>>>  I set big value in property stonith-timeout="600s".
>>>  And got the following behavior:
>>>  1. pkill -4 corosync
>>>  2. from node with DC call my fence agent "sshbykey"
>>>  3. It sends reboot victim and waits until she comes to life again.
>> Hmmm what version of pacemaker?
>> This sounds like a timing issue that we fixed a while back
>    Was a version 1.1.11 from December 3.
>    Now try full update and retest.
   That should be recent enough.  Can you create a crm_report the next time 
 you reproduce?
>>>  Of course yes. Little delay :)
>>>
>>>  ..
>>>  cc1: warnings being treated as errors
>>>  upstart.c: In function ‘upstart_job_property’:
>>>  upstart.c:264: error: implicit declaration of function 
>>> ‘g_variant_lookup_value’
>>>  upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
>>>  upstart.c:264: error: assignment makes pointer from integer without a cast
>>>  gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>  gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>  make[1]: *** [all-recursive] Error 1
>>>  make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>  make: *** [core] Error 1
>>>
>>>  I'm trying to solve this a problem.
>>  Do not get solved quickly...
>>
>>  
>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>  g_variant_lookup_value () Since 2.28
>>
>>  # yum list installed glib2
>>  Loaded plugins: fastestmirror, rhnplugin, security
>>  This system is receiving updates from RHN Classic or Red Hat Satellite.
>>  Loading mirror speeds from cached hostfile
>>  Installed Packages
>>  glib2.x86_64  
>> 2.26.1-3.el6   
>> installed
>>
>>  # cat /etc/issue
>>  CentOS release 6.5 (Final)
>>  Kernel \r on an \m
>
> Can you try this patch?
> Upstart jobs wont work, but the code will compile
>
> diff --git a/lib/services/upstart.c b/lib/services/upstart.c
> index 831e7cf..195c3a4 100644
> --- a/lib/services/upstart.c
> +++ b/lib/services/upstart.c
> @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>  static char *
>  upstart_job_property(const char *obj, const gchar * iface, const char *name)
>  {
> +    char *output = NULL;
> +
> +#if !GLIB_CHECK_VERSION(2,28,0)
> +    static bool err = TRUE;
> +
> +    if(err) {
> +    crm_err("This version of glib is too old to support upstart jobs");
> +    err = FALSE;
> +    }
> +#else
>  GError *error = NULL;
>  GDBusProxy *proxy;
>  GVariant *asv = NULL;
>  GVariant *value = NULL;
>  GVariant *_ret = NULL;
> -    char *output = NULL;
>
>  crm_info("Calling GetAll on %s", obj);
>  proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
> @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, const gchar * 
> iface, const char *name)
>
>  g_object_unref(proxy);
>  g_variant_unref(_ret);
> +#endif
>  return output;
>  }
>

Ok :) I patch source. 
Type "make rc" - the same error.
Make new copy via "fetch" - the same error.
It seems that if not exist ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, 
then download it. 
Otherwise use exist archive.
Cutted log ...

# make rc
make TAG=Pacemaker-1.1.11-rc3 rpm
make[1]: Entering directory `/root/ha/pacemaker'
rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* pa

[Pacemaker] Better way to change master in 3 node pgsql cluster

2014-01-13 Thread Andrey Rogovsky

Hi

I have 3 node postgresql cluster.
It work well. But I have some trobule with change master.

For now, if I need change master, I must:
1) Stop PGSQL on each node and cluster service
2) Start Setup new manual PGSQL replication
3) Change attributes on each node for point to new master
4) Stop PGSQL on each node
5) Celanup resource and start cluster service

It take a lot of time. Is it exist better way to change master?



This is my cluster service status:
Node Attributes:
* Node a.geocluster.e-autopay.com:
+ master-pgsql:0   : 1000
+ pgsql-data-status   : LATEST
+ pgsql-master-baseline   : 2F90
+ pgsql-status : PRI
* Node c.geocluster.e-autopay.com:
+ master-pgsql:0   : 1000
+ pgsql-data-status   : SYNC
+ pgsql-status : STOP
* Node b.geocluster.e-autopay.com:
+ master-pgsql:0   : 1000
+ pgsql-data-status   : SYNC
+ pgsql-status : STOP

I was use http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster for my 3
nodes cluster without hard stik.
Now I got strange situation all nodes stay slave:

Last updated: Sat Dec  7 04:33:47 2013
Last change: Sat Dec  7 12:56:23 2013 via crmd on a
Stack: openais
Current DC: c - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
5 Nodes configured, 3 expected votes
4 Resources configured.


Online: [ a c b ]

 Master/Slave Set: msPostgresql [pgsql]
 Slaves: [ a c b ]

My config is:
node a \
attributes pgsql-data-status="DISCONNECT"
node b \
attributes pgsql-data-status="DISCONNECT"
node c \
attributes pgsql-data-status="DISCONNECT"
primitive pgsql ocf:heartbeat:pgsql \
params pgctl="/usr/lib/postgresql/9.3/bin/pg_ctl" psql="/usr/bin/psql"
pgdata="/var/lib/postgresql/9.3/main" start_opt="-p 5432" rep_mode="sync"
node_list="a b c" restore_command="cp /var/lib/postgresql/9.3/pg_archive/%f
%p" master_ip="192.168.10.200" restart_on_promote="true"
config="/etc/postgresql/9.3/main/postgresql.conf" \
op start interval="0s" timeout="60s" on-fail="restart" \
op monitor interval="4s" timeout="60s" on-fail="restart" \
op monitor interval="3s" role="Master" timeout="60s" on-fail="restart" \
op promote interval="0s" timeout="60s" on-fail="restart" \
op demote interval="0s" timeout="60s" on-fail="stop" \
op stop interval="0s" timeout="60s" on-fail="block" \
op notify interval="0s" timeout="60s"
primitive pgsql-master-ip ocf:heartbeat:IPaddr2 \
params ip="192.168.10.200" nic="peervpn0" \
op start interval="0s" timeout="60s" on-fail="restart" \
op monitor interval="10s" timeout="60s" on-fail="restart" \
op stop interval="0s" timeout="60s" on-fail="block"
group master pgsql-master-ip
ms msPostgresql pgsql \
meta master-max="1" master-node-max="1" clone-max="3" clone-node-max="1"
notify="true"
colocation set_ip inf: master msPostgresql:Master
order ip_down 0: msPostgresql:demote master:stop symmetrical=false
order ip_up 0: msPostgresql:promote master:start symmetrical=false
property $id="cib-bootstrap-options" \
dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
cluster-infrastructure="openais" \
expected-quorum-votes="3" \
no-quorum-policy="ignore" \
stonith-enabled="false" \
crmd-transition-delay="0" \
last-lrm-refresh="1386404222"
rsc_defaults $id="rsc-options" \
resource-stickiness="100" \
migration-threshold="1"
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

Re: [Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

Re: [Pacemaker] crm_resource -L not trustable right after restart

[Pacemaker] A resource starts with a standby node.(Latest attrd does not serve as the crmd-transition-delay parameter)

Re: [Pacemaker] crm_resource -L not trustable right after restart

Re: [Pacemaker] hangs pending

Re: [Pacemaker] hangs pending

Re: [Pacemaker] hangs pending

Re: [Pacemaker] hangs pending

Re: [Pacemaker] hangs pending

Re: [Pacemaker] hangs pending

Re: [Pacemaker] pgsql RA - slave is in HS:ASYNC status and won; t promote

Re: [Pacemaker] Location / Colocation constraints issue

Re: [Pacemaker] [Linux-HA] Better way to change master in 3 node pgsql cluster

Re: [Pacemaker] crm_resource -L not trustable right after restart

Re: [Pacemaker] hangs pending

[Pacemaker] crm_resource -L not trustable right after restart

Re: [Pacemaker] Location / Colocation constraints issue

Re: [Pacemaker] How to configure hearbeat using private network?

Re: [Pacemaker] again "return code", now in crm_attribute

Re: [Pacemaker] hangs pending

[Pacemaker] Better way to change master in 3 node pgsql cluster

26 matches

Site Navigation

Mail list logo

Footer information