Re: [Pacemaker] corosync [TOTEM ] Process pause detected for 577 ms

2014-04-29 Thread Andrey Groshev
  24.04.2014, 21:12, "emmanuel segura" : Hello List,I have this two lines in my cluster logs, somebody can help to know what this means. hiIt's in bare metal or virtual environment? :: corosync [TOTEM ] Process pause detected for 577 ms, flushing membership messages.corosync [TOTEM ] Process pause detected for 538 ms, flushing membership messages.corosync [TOTEM ] A processor failed, forming new configuration. ::I know the "corosync [TOTEM ] A processor failed, forming new configuration" message is when the toten package is definitely lost. Thanks,___Pacemaker mailing list: Pacemaker@oss.clusterlabs.orghttp://oss.clusterlabs.org/mailman/listinfo/pacemakerProject Home: http://www.clusterlabs.orgGetting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdfBugs: http://bugs.clusterlabs.org___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Best practice for quorum nodes

2014-04-25 Thread Andrey Groshev
Hi, Andrew.
Not sure what my version better, but I did so ..
In short, I use conventional configuration of the cluster postgresql.
Mainly in the resource hierarchy is a clone of the primitive "ping".
I add next line to reles:

location checkquorumnode clnPingCheck \
rule $id="checkquorumnode-rule" -inf: not_defined thisquorumnode or 
thisquorumnode eq yes
.
Symmetric cluster. But resources are not run if this attribute is not defined.


18.04.2014, 18:46, "Andrew Martin" :
> Hello,
>
> I've read several guides about how to configure a 3-node cluster with one 
> node that can't actually run the resources, but just serves as a quorum node. 
> One practice for configuring this node is to put it in "standby", which 
> prevents it from running resources. In my experience, this seems to work 
> pretty well, however from time-to-time I see these errors appear in my 
> pacemaker logs:
> Preventing  from re-starting on : operation monitor failed 'not 
> installed' (rc=5)
>
> Is there a better way to designate a node as a quorum node, so that resources 
> do not attempt to start or re-start on it? Perhaps a combination of setting 
> it in "standby" mode and a resource constraint to prevent the resources from 
> running on it? Or, is there a better way to set it up?
>
> Thanks,
>
> Andrew
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] behavior when do fail start resource

2014-03-25 Thread Andrey Groshev
Hi, ALL!
Some time ago, I saw somewhere a description of behavior: 
"When at the start of the resource fails, the fail-count is set to 100 and 
the resource is no longer starts, 
even if it is established "start on-fail = restart"
Where I could see it? 
And is there a way to change this behavior?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hangs pending

2014-03-19 Thread Andrey Groshev


20.03.2014, 07:13, "Andrew Beekhof" :
> On 19 Mar 2014, at 4:00 pm, Andrey Groshev  wrote:
>
>>  19.03.2014, 03:29, "Andrew Beekhof" :
>>>  On 19 Mar 2014, at 6:19 am, Andrey Groshev  wrote:
>>>>   12.03.2014, 02:53, "Andrew Beekhof" :
>>>>>   Sorry for the delay, sometimes it takes a while to rebuild the 
>>>>> necessary context
>>>>   I'm sorry too for the answer delay.
>>>>   I switched to using "upstart" for initializing corosync and pacemaker 
>>>> (with respawn).
>>>>   Now the behavior of the system has changed and it suits me. (yet :) )
>>>>   I must kill crmd/lrmd in infinite loop, then STONITH shoot.
>>>>   Else very fast respawn and do nothing.
>>>>
>>>>   Of course, I still found a other way to hang the system.
>>>>   This requires only one idiot.
>>>>   1. He decides to update pacemaker (and/or erase incomprehensible 
>>>> service).
>>>>   2. Then kills the process corosync or simply reboot the server.
>>>>   Everything! This node will remain hang in "pending".
>>>  While trying to shutdown?
>>>  Our spec files shut pacemaker down prior to upgrades FWIW.
>>  Not so simple ... we have a national tradition - care and cherish idiots.
>>  Therefore, they are clever, quirky and unpredictable. ;)
>>  He can simply delete files of package, without uninstall.
>>  (In reality, it may be just crash of the file system).
>
> Dunno, I think hanging is somewhat reasonable behaviour if parts of pacemaker 
> have been removed :)
> It didn't get fenced though?

Yes, it is logical.
But I have another idea.

"The client part" STONITH agent does not depend on pacemaker. 
(at least in the case of my )
On the other nodah STONITH should still work .
Ie they can restart / off node.
And I think, we should have possible set different behavior of the cluster.
Ie if STONITH can shut down / restart the node , but the resource is not start 
- 
DO NOT assume node UNCLEAN. Assume that it is, but without resources .

Next there is a chicken and egg problem !
Suppose I removed myself PCMK or crash file system, it was on the "master" 
STONITH. 
The idea should start STONITH resource to another node, but this does not 
happen, because one node is "hanging pending". 
I have reproduced this behavior.

> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hangs pending

2014-03-18 Thread Andrey Groshev


19.03.2014, 03:29, "Andrew Beekhof" :
> On 19 Mar 2014, at 6:19 am, Andrey Groshev  wrote:
>
>>  12.03.2014, 02:53, "Andrew Beekhof" :
>>>  Sorry for the delay, sometimes it takes a while to rebuild the necessary 
>>> context
>>  I'm sorry too for the answer delay.
>>  I switched to using "upstart" for initializing corosync and pacemaker (with 
>> respawn).
>>  Now the behavior of the system has changed and it suits me. (yet :) )
>>  I must kill crmd/lrmd in infinite loop, then STONITH shoot.
>>  Else very fast respawn and do nothing.
>>
>>  Of course, I still found a other way to hang the system.
>>  This requires only one idiot.
>>  1. He decides to update pacemaker (and/or erase incomprehensible service).
>>  2. Then kills the process corosync or simply reboot the server.
>>  Everything! This node will remain hang in "pending".
>
> While trying to shutdown?
> Our spec files shut pacemaker down prior to upgrades FWIW.

Not so simple ... we have a national tradition - care and cherish idiots. 
Therefore, they are clever, quirky and unpredictable. ;)
He can simply delete files of package, without uninstall. 
(In reality, it may be just crash of the file system).

>
>>  And the worst thing ... if at least one node hangs in "pending" - does not 
>> work promote/demote and other manage a resources.
>>
>>  Yes, there is one oddity with long inclusion of some resources, but IMHO it 
>> is not very critical.
>>  I try correct a later, now I write documentation for project.
>>>  On 5 Mar 2014, at 4:42 pm, Andrey Groshev  wrote:
>>>>   05.03.2014, 04:04, "Andrew Beekhof" :
>>>>>   On 25 Feb 2014, at 8:30 pm, Andrey Groshev  wrote:
>>>>>>    21.02.2014, 12:04, "Andrey Groshev" :
>>>>>>>    21.02.2014, 05:53, "Andrew Beekhof" :
>>>>>>>> On 19 Feb 2014, at 7:53 pm, Andrey Groshev  
>>>>>>>> wrote:
>>>>>>>>>  19.02.2014, 09:49, "Andrew Beekhof" :
>>>>>>>>>>  On 19 Feb 2014, at 4:18 pm, Andrey Groshev  
>>>>>>>>>> wrote:
>>>>>>>>>>>   19.02.2014, 09:08, "Andrew Beekhof" :
>>>>>>>>>>>>   On 19 Feb 2014, at 4:00 pm, Andrey Groshev 
>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>    19.02.2014, 06:48, "Andrew Beekhof" :
>>>>>>>>>>>>>>    On 18 Feb 2014, at 11:05 pm, Andrey Groshev 
>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>> Hi, ALL and Andrew!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Today is a good day - I killed a lot, and a lot of 
>>>>>>>>>>>>>>> shooting at me.
>>>>>>>>>>>>>>> In general - I am happy (almost like an elephant)   :)
>>>>>>>>>>>>>>> Except resources on the node are important to me eight 
>>>>>>>>>>>>>>> processes: 
>>>>>>>>>>>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>>>>>> I killed them with different signals (4,6,11 and even 
>>>>>>>>>>>>>>> 9).
>>>>>>>>>>>>>>> Behavior does not depend of number signal - it's good.
>>>>>>>>>>>>>>> If STONITH send reboot to the node - it rebooted and 
>>>>>>>>>>>>>>> rejoined the cluster - too it's good.
>>>>>>>>>>>>>>> But the behavior is different from killing various 
>>>>>>>>>>>>>>> demons.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Turned four groups:
>>>>>>>>>>>>>>> 1. corosync,cib - STONITH work 100%.
>>>>>>>>>>>>>>> Kill via any signals - call STONITH and reboot.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. lrmd,crmd - strange

Re: [Pacemaker] hangs pending

2014-03-18 Thread Andrey Groshev


12.03.2014, 02:53, "Andrew Beekhof" :
> Sorry for the delay, sometimes it takes a while to rebuild the necessary 
> context

I'm sorry too for the answer delay. 
I switched to using "upstart" for initializing corosync and pacemaker (with 
respawn).
Now the behavior of the system has changed and it suits me. (yet :) )
I must kill crmd/lrmd in infinite loop, then STONITH shoot.
Else very fast respawn and do nothing.

Of course, I still found a other way to hang the system. 
This requires only one idiot. 
1. He decides to update pacemaker (and/or erase incomprehensible service). 
2. Then kills the process corosync or simply reboot the server. 
Everything! This node will remain hang in "pending". 
And the worst thing ... if at least one node hangs in "pending" - does not work 
promote/demote and other manage a resources.

Yes, there is one oddity with long inclusion of some resources, but IMHO it is 
not very critical. 
I try correct a later, now I write documentation for project.



>
> On 5 Mar 2014, at 4:42 pm, Andrey Groshev  wrote:
>
>>  05.03.2014, 04:04, "Andrew Beekhof" :
>>>  On 25 Feb 2014, at 8:30 pm, Andrey Groshev  wrote:
>>>>   21.02.2014, 12:04, "Andrey Groshev" :
>>>>>   21.02.2014, 05:53, "Andrew Beekhof" :
>>>>>>    On 19 Feb 2014, at 7:53 pm, Andrey Groshev  wrote:
>>>>>>> 19.02.2014, 09:49, "Andrew Beekhof" :
>>>>>>>> On 19 Feb 2014, at 4:18 pm, Andrey Groshev  
>>>>>>>> wrote:
>>>>>>>>>  19.02.2014, 09:08, "Andrew Beekhof" :
>>>>>>>>>>  On 19 Feb 2014, at 4:00 pm, Andrey Groshev  
>>>>>>>>>> wrote:
>>>>>>>>>>>   19.02.2014, 06:48, "Andrew Beekhof" :
>>>>>>>>>>>>   On 18 Feb 2014, at 11:05 pm, Andrey Groshev 
>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>    Hi, ALL and Andrew!
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Today is a good day - I killed a lot, and a lot of 
>>>>>>>>>>>>> shooting at me.
>>>>>>>>>>>>>    In general - I am happy (almost like an elephant)   :)
>>>>>>>>>>>>>    Except resources on the node are important to me eight 
>>>>>>>>>>>>> processes: 
>>>>>>>>>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>>>>    I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>>>>>    Behavior does not depend of number signal - it's good.
>>>>>>>>>>>>>    If STONITH send reboot to the node - it rebooted and 
>>>>>>>>>>>>> rejoined the cluster - too it's good.
>>>>>>>>>>>>>    But the behavior is different from killing various demons.
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Turned four groups:
>>>>>>>>>>>>>    1. corosync,cib - STONITH work 100%.
>>>>>>>>>>>>>    Kill via any signals - call STONITH and reboot.
>>>>>>>>>>>>>
>>>>>>>>>>>>>    2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>>>>>    Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>>>>>    Sometimes restart daemon and restart resources with large 
>>>>>>>>>>>>> delay MS:pgsql.
>>>>>>>>>>>>>    One time after restart crmd - pgsql don't restart.
>>>>>>>>>>>>>
>>>>>>>>>>>>>    3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>>>>>    This daemons simple restart, resources - stay running.
>>>>>>>>>>>>>
>>>>>>>>>>>>>    4. pacemakerd - nothing happens.
>>>>>>>>>>>>>    And then I can kill any process of the third group. They 
>>>>>>>>>>>>> do not restart.
>>>>>>>>>>>

Re: [Pacemaker] hangs pending

2014-03-04 Thread Andrey Groshev


05.03.2014, 04:04, "Andrew Beekhof" :
> On 25 Feb 2014, at 8:30 pm, Andrey Groshev  wrote:
>
>>  21.02.2014, 12:04, "Andrey Groshev" :
>>>  21.02.2014, 05:53, "Andrew Beekhof" :
>>>>   On 19 Feb 2014, at 7:53 pm, Andrey Groshev  wrote:
>>>>>    19.02.2014, 09:49, "Andrew Beekhof" :
>>>>>>    On 19 Feb 2014, at 4:18 pm, Andrey Groshev  wrote:
>>>>>>> 19.02.2014, 09:08, "Andrew Beekhof" :
>>>>>>>>     On 19 Feb 2014, at 4:00 pm, Andrey Groshev  
>>>>>>>> wrote:
>>>>>>>>>  19.02.2014, 06:48, "Andrew Beekhof" :
>>>>>>>>>>  On 18 Feb 2014, at 11:05 pm, Andrey Groshev  
>>>>>>>>>> wrote:
>>>>>>>>>>>   Hi, ALL and Andrew!
>>>>>>>>>>>
>>>>>>>>>>>   Today is a good day - I killed a lot, and a lot of shooting 
>>>>>>>>>>> at me.
>>>>>>>>>>>   In general - I am happy (almost like an elephant)   :)
>>>>>>>>>>>   Except resources on the node are important to me eight 
>>>>>>>>>>> processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>>   I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>>>   Behavior does not depend of number signal - it's good.
>>>>>>>>>>>   If STONITH send reboot to the node - it rebooted and rejoined 
>>>>>>>>>>> the cluster - too it's good.
>>>>>>>>>>>   But the behavior is different from killing various demons.
>>>>>>>>>>>
>>>>>>>>>>>   Turned four groups:
>>>>>>>>>>>   1. corosync,cib - STONITH work 100%.
>>>>>>>>>>>   Kill via any signals - call STONITH and reboot.
>>>>>>>>>>>
>>>>>>>>>>>   2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>>>   Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>>>   Sometimes restart daemon and restart resources with large 
>>>>>>>>>>> delay MS:pgsql.
>>>>>>>>>>>   One time after restart crmd - pgsql don't restart.
>>>>>>>>>>>
>>>>>>>>>>>   3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>>>   This daemons simple restart, resources - stay running.
>>>>>>>>>>>
>>>>>>>>>>>   4. pacemakerd - nothing happens.
>>>>>>>>>>>   And then I can kill any process of the third group. They do 
>>>>>>>>>>> not restart.
>>>>>>>>>>>   Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>>>>>>
>>>>>>>>>>>   What do you think about this?
>>>>>>>>>>>   The main question of this topic - we decided.
>>>>>>>>>>>   But this varied behavior - another big problem.
>>>>>>>>>>>
>>>>>>>>>>>   Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>>>>>>>>>>  Which of the various conditions above do the logs cover?
>>>>>>>>>  All various in day.
>>>>>>>> Are you trying to torture me?
>>>>>>>> Can you give me a rough idea what happened when?
>>>>>>> No, there is 8 processes on the 4th signal and repeats the 
>>>>>>> experiments with unknown outcome :)
>>>>>>> Easier to conduct new experiments and individual new logs .
>>>>>>> Which variant is more interesting?
>>>>>>    The long delay in restarting pgsql.
>>>>>>    Everything else seems correct.
>>>>>    He even don't tried start pgsql.
>>>>>    In Logs tree the tests.
>>>>>    kill -s4 lrmd pid.
>>>>>    1. STONITH
>>>>>    2. STONITH
>>>>>    3.

Re: [Pacemaker] hangs pending

2014-03-03 Thread Andrey Groshev
Good morning. 
Confused at night - attach another test log.

The point of this test - neatly turn off the cluster. 
That is sequentially send PCMK in standby.
And turn off the services (pacemaker&corosync). 
Then in reverse order. 
Sequences include services and infer from standby. 
The second node hangs on stage "pending" 
Most worryingly, the next nodes even in the status of "online" does not start 
services.
Morning logs - http://send2me.ru/pcmk-04-Mar-2014-2.tar.bz2

04.03.2014, 02:13, "Andrey Groshev" :
> Hi!
> I thought that all the bugs have already been caught. :)
> But today(already tonight) build last git PCMK with add upstart.
> And again catch hangs pending.
> logs http://send2me.ru/pcmk-04-Mar-2014.tar.bz2
>
> 24.02.2014, 03:44, "Andrew Beekhof" :
>
>>  On 22 Feb 2014, at 7:07 pm, Andrey Groshev  wrote:
>>>   21.02.2014, 04:00, "Andrew Beekhof" :
>>>>   On 20 Feb 2014, at 10:04 pm, Andrey Groshev  wrote:
>>>>>    20.02.2014, 13:57, "Andrew Beekhof" :
>>>>>>    On 20 Feb 2014, at 5:33 pm, Andrey Groshev  wrote:
>>>>>>> 20.02.2014, 01:22, "Andrew Beekhof" :
>>>>>>>> On 20 Feb 2014, at 4:18 am, Andrey Groshev  
>>>>>>>> wrote:
>>>>>>>>>  19.02.2014, 06:47, "Andrew Beekhof" :
>>>>>>>>>>  On 18 Feb 2014, at 9:29 pm, Andrey Groshev  
>>>>>>>>>> wrote:
>>>>>>>>>>>   Hi, ALL and Andrew!
>>>>>>>>>>>
>>>>>>>>>>>   Today is a good day - I killed a lot, and a lot of shooting 
>>>>>>>>>>> at me.
>>>>>>>>>>>   In general - I am happy (almost like an elephant)   :)
>>>>>>>>>>>   Except resources on the node are important to me eight 
>>>>>>>>>>> processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>>   I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>>>   Behavior does not depend of number signal - it's good.
>>>>>>>>>>>   If STONITH send reboot to the node - it rebooted and rejoined 
>>>>>>>>>>> the cluster - too it's good.
>>>>>>>>>>>   But the behavior is different from killing various demons.
>>>>>>>>>>>
>>>>>>>>>>>   Turned four groups:
>>>>>>>>>>>   1. corosync,cib - STONITH work 100%.
>>>>>>>>>>>   Kill via any signals - call STONITH and reboot.
>>>>>>>>>>  excellent
>>>>>>>>>>>   3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>>>   This daemons simple restart, resources - stay running.
>>>>>>>>>>  right
>>>>>>>>>>>   2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>>>   Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>>>   Sometimes restart daemon
>>>>>>>>>>  The daemon will always try to restart, the only variable is how 
>>>>>>>>>> long it takes the peer to notice and initiate fencing.
>>>>>>>>>>  If the failure happens just before a they're due to receive 
>>>>>>>>>> totem token, the failure will be very quickly detected and the node 
>>>>>>>>>> fenced.
>>>>>>>>>>  If the failure happens just after, then detection will take 
>>>>>>>>>> longer - giving the node longer to recover and not be fenced.
>>>>>>>>>>
>>>>>>>>>>  So fence/not fence is normal and to be expected.
>>>>>>>>>>>   and restart resources with large delay MS:pgsql.
>>>>>>>>>>>   One time after restart crmd - pgsql don't restart.
>>>>>>>>>>  I would not expect pgsql to ever restart - if the RA does its 
>>>>>>>>>> job properly anyway.
>>>>>>>>>>  In the case the node is not fenced, the crmd will respawn and 
>>>>>>>>>&g

Re: [Pacemaker] hangs pending

2014-03-03 Thread Andrey Groshev
Hi! 
I thought that all the bugs have already been caught. :)
But today(already tonight) build last git PCMK with add upstart.
And again catch hangs pending.
logs http://send2me.ru/pcmk-04-Mar-2014.tar.bz2

24.02.2014, 03:44, "Andrew Beekhof" :
> On 22 Feb 2014, at 7:07 pm, Andrey Groshev  wrote:
>
>>  21.02.2014, 04:00, "Andrew Beekhof" :
>>>  On 20 Feb 2014, at 10:04 pm, Andrey Groshev  wrote:
>>>>   20.02.2014, 13:57, "Andrew Beekhof" :
>>>>>   On 20 Feb 2014, at 5:33 pm, Andrey Groshev  wrote:
>>>>>>    20.02.2014, 01:22, "Andrew Beekhof" :
>>>>>>>    On 20 Feb 2014, at 4:18 am, Andrey Groshev  wrote:
>>>>>>>> 19.02.2014, 06:47, "Andrew Beekhof" :
>>>>>>>>> On 18 Feb 2014, at 9:29 pm, Andrey Groshev  
>>>>>>>>> wrote:
>>>>>>>>>>  Hi, ALL and Andrew!
>>>>>>>>>>
>>>>>>>>>>  Today is a good day - I killed a lot, and a lot of shooting at 
>>>>>>>>>> me.
>>>>>>>>>>  In general - I am happy (almost like an elephant)   :)
>>>>>>>>>>  Except resources on the node are important to me eight 
>>>>>>>>>> processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>  I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>>  Behavior does not depend of number signal - it's good.
>>>>>>>>>>  If STONITH send reboot to the node - it rebooted and rejoined 
>>>>>>>>>> the cluster - too it's good.
>>>>>>>>>>  But the behavior is different from killing various demons.
>>>>>>>>>>
>>>>>>>>>>  Turned four groups:
>>>>>>>>>>  1. corosync,cib - STONITH work 100%.
>>>>>>>>>>  Kill via any signals - call STONITH and reboot.
>>>>>>>>> excellent
>>>>>>>>>>  3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>>  This daemons simple restart, resources - stay running.
>>>>>>>>> right
>>>>>>>>>>  2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>>  Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>>  Sometimes restart daemon
>>>>>>>>> The daemon will always try to restart, the only variable is how 
>>>>>>>>> long it takes the peer to notice and initiate fencing.
>>>>>>>>> If the failure happens just before a they're due to receive totem 
>>>>>>>>> token, the failure will be very quickly detected and the node fenced.
>>>>>>>>> If the failure happens just after, then detection will take 
>>>>>>>>> longer - giving the node longer to recover and not be fenced.
>>>>>>>>>
>>>>>>>>> So fence/not fence is normal and to be expected.
>>>>>>>>>>  and restart resources with large delay MS:pgsql.
>>>>>>>>>>  One time after restart crmd - pgsql don't restart.
>>>>>>>>> I would not expect pgsql to ever restart - if the RA does its job 
>>>>>>>>> properly anyway.
>>>>>>>>> In the case the node is not fenced, the crmd will respawn and the 
>>>>>>>>> the PE will request that it re-detect the state of all resources.
>>>>>>>>>
>>>>>>>>> If the agent reports "all good", then there is nothing more to do.
>>>>>>>>> If the agent is not reporting "all good", you should really be 
>>>>>>>>> asking why.
>>>>>>>>>>  4. pacemakerd - nothing happens.
>>>>>>>>> On non-systemd based machines, correct.
>>>>>>>>>
>>>>>>>>> On a systemd based machine pacemakerd is respawned and reattaches 
>>>>>>>>> to the existing daemons.
>>>>>>>>> Any subsequent daemon failure will be detected and the daemon 
>>>>>>&g

Re: [Pacemaker] [corosync] corosync Segmentation fault.

2014-02-26 Thread Andrey Groshev


26.02.2014, 16:11, "Jan Friesse" :
> Andrey,
> can you please give a try to patch "[PATCH] votequorum: Properly
> initialize atb and atb_string" which I've sent to ML (it should be there
> soon)?

Yes. Service is running. Thanks.

# corosync-quorumtool -l

Membership information
--
Nodeid  Votes Name
 172793104  1 dev-cluster2-node1 (local)


Continue tests.
In messages logs I see

Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15480]: [error] trying to recv 
chunk of size 1024 but 4030249 available
Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15497]: [error] trying to recv 
chunk of size 1024 but 40489 available
Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15514]: [error] Corrupt 
blackbox: File header hash (436212587) does not match calculated hash 
(-1660939413)
Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15531]: [error] Corrupt 
blackbox: File header hash (8328043) does not match calculated hash (-905964693)
Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15548]: [error] Corrupt 
blackbox: File header hash (12651) does not match calculated hash (21972)
.

At this time build libqb. It tests or real errors?


> Thanks,
>   Honza
>
> Andrey Groshev napsal(a):
>
>>  26.02.2014, 12:11, "Jan Friesse" :
>>>  Andrey,
>>>  what version of corosync and libqb are you using?
>>>
>>>  Can you please attach output from valgrind (and gdb backtrace)?
>>  ,,,
>>  1314    qb_loop_run (corosync_poll_handle);
>>  (gdb) n
>>
>>  Program received signal SIGSEGV, Segmentation fault.
>>  0x771e581c in free () from /lib64/libc.so.6
>>  (gdb) bt
>>  #0  0x771e581c in free () from /lib64/libc.so.6
>>  #1  0x77fe77ec in votequorum_readconfig (runtime=> out>) at votequorum.c:1293
>>  #2  0x77fe8300 in votequorum_exec_init_fn (api=> out>) at votequorum.c:2115
>>  #3  0x77feeb7b in corosync_service_link_and_init 
>> (corosync_api=0x78200980, service=0x78200760) at service.c:139
>>  #4  0x77fe4197 in votequorum_init (api=0x78200980, 
>> q_set_quorate_fn=0x77fda5b0 ) at votequorum.c:2255
>>  #5  0x77fda42f in quorum_exec_init_fn (api=0x78200980) at 
>> vsf_quorum.c:280
>>  #6  0x77feeb7b in corosync_service_link_and_init 
>> (corosync_api=0x78200980, service=0x78200c40) at service.c:139
>>  #7  0x77feede9 in corosync_service_defaults_link_and_init 
>> (corosync_api=0x78200980) at service.c:348
>>  #8  0x77fe9621 in main_service_ready () at main.c:978
>>  #9  0x77b90b0f in main_iface_change_fn (context=0x77f73010, 
>> iface_addr=, iface_no=0) at totemsrp.c:4672
>>  #10 0x77b8a734 in timer_function_netif_check_timeout 
>> (data=0x78304f10) at totemudp.c:672
>>  #11 0x777289f8 in ?? () from /usr/lib64/libqb.so.0
>>  #12 0x77727016 in qb_loop_run () from /usr/lib64/libqb.so.0
>>  #13 0x77fea930 in main (argc=, argv=> optimized out>, envp=) at main.c:1314
>>
>>  Unfortunately, I have not yet used a valgrind.
>>  Or "hangs", or fast end with :
>>
>>  # valgrind /usr/sbin/corosync -f
>>  ==2137== Memcheck, a memory error detector
>>  ==2137== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
>>  ==2137== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
>>  ==2137== Command: /usr/sbin/corosync -f
>>  ==2137==
>>  ==2137==
>>  ==2137== HEAP SUMMARY:
>>  ==2137== in use at exit: 29,876 bytes in 193 blocks
>>  ==2137==   total heap usage: 890 allocs, 697 frees, 100,824 bytes allocated
>>  ==2137==
>>  ==2137== LEAK SUMMARY:
>>  ==2137==    definitely lost: 0 bytes in 0 blocks
>>  ==2137==    indirectly lost: 0 bytes in 0 blocks
>>  ==2137==  possibly lost: 539 bytes in 22 blocks
>>  ==2137==    still reachable: 29,337 bytes in 171 blocks
>>  ==2137== suppressed: 0 bytes in 0 blocks
>>  ==2137== Rerun with --leak-check=full to see details of leaked memory
>>  ==2137==
>>  ==2137== For counts of detected and suppressed errors, rerun with: -v
>>  ==2137== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 12 from 6)
>>
>>  Now read manual about valgrind.
>>>  Thanks,
>>>    Honza
>>>
>>>  Andrey Groshev napsal(a):
>>>>   Hi, ALL.
>>>>   Something I already confused, or after updating any package or myself 
>>>> something broke,
>>>>   but call corosycn killed by segmentation fault signal.
>>>>   I correctly understood tha

Re: [Pacemaker] [corosync] corosync Segmentation fault.

2014-02-26 Thread Andrey Groshev


26.02.2014, 12:11, "Jan Friesse" :
> Andrey,
> what version of corosync and libqb are you using?
>
> Can you please attach output from valgrind (and gdb backtrace)?
,,,
1314qb_loop_run (corosync_poll_handle);
(gdb) n

Program received signal SIGSEGV, Segmentation fault.
0x771e581c in free () from /lib64/libc.so.6
(gdb) bt
#0  0x771e581c in free () from /lib64/libc.so.6
#1  0x77fe77ec in votequorum_readconfig (runtime=) 
at votequorum.c:1293
#2  0x77fe8300 in votequorum_exec_init_fn (api=) 
at votequorum.c:2115
#3  0x77feeb7b in corosync_service_link_and_init 
(corosync_api=0x78200980, service=0x78200760) at service.c:139
#4  0x77fe4197 in votequorum_init (api=0x78200980, 
q_set_quorate_fn=0x77fda5b0 ) at votequorum.c:2255
#5  0x77fda42f in quorum_exec_init_fn (api=0x78200980) at 
vsf_quorum.c:280
#6  0x77feeb7b in corosync_service_link_and_init 
(corosync_api=0x78200980, service=0x78200c40) at service.c:139
#7  0x77feede9 in corosync_service_defaults_link_and_init 
(corosync_api=0x78200980) at service.c:348
#8  0x77fe9621 in main_service_ready () at main.c:978
#9  0x77b90b0f in main_iface_change_fn (context=0x77f73010, 
iface_addr=, iface_no=0) at totemsrp.c:4672
#10 0x77b8a734 in timer_function_netif_check_timeout 
(data=0x78304f10) at totemudp.c:672
#11 0x777289f8 in ?? () from /usr/lib64/libqb.so.0
#12 0x77727016 in qb_loop_run () from /usr/lib64/libqb.so.0
#13 0x77fea930 in main (argc=, argv=, envp=) at main.c:1314

Unfortunately, I have not yet used a valgrind. 
Or "hangs", or fast end with :

# valgrind /usr/sbin/corosync -f
==2137== Memcheck, a memory error detector
==2137== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==2137== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==2137== Command: /usr/sbin/corosync -f
==2137== 
==2137== 
==2137== HEAP SUMMARY:
==2137== in use at exit: 29,876 bytes in 193 blocks
==2137==   total heap usage: 890 allocs, 697 frees, 100,824 bytes allocated
==2137== 
==2137== LEAK SUMMARY:
==2137==definitely lost: 0 bytes in 0 blocks
==2137==indirectly lost: 0 bytes in 0 blocks
==2137==  possibly lost: 539 bytes in 22 blocks
==2137==still reachable: 29,337 bytes in 171 blocks
==2137== suppressed: 0 bytes in 0 blocks
==2137== Rerun with --leak-check=full to see details of leaked memory
==2137== 
==2137== For counts of detected and suppressed errors, rerun with: -v
==2137== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 12 from 6)

Now read manual about valgrind.

>
> Thanks,
>   Honza
>
> Andrey Groshev napsal(a):
>
>>  Hi, ALL.
>>  Something I already confused, or after updating any package or myself 
>> something broke,
>>  but call corosycn killed by segmentation fault signal.
>>  I correctly understood that does not link the library libqb ?
>>
>>  .
>>
>>  (gdb) n
>>  [New Thread 0x74b2b700 (LWP 9014)]
>>  1266    if ((flock_err = corosync_flock (corosync_lock_file, getpid 
>> ())) != COROSYNC_DONE_EXIT) {
>>  (gdb) n
>>  1280    totempg_initialize (
>>  (gdb) n
>>  1284    totempg_service_ready_register (
>>  (gdb) n
>>  1287    totempg_groups_initialize (
>>  (gdb) n
>>  1292    totempg_groups_join (
>>  (gdb) n
>>  1307    schedwrk_init (
>>  (gdb) n
>>  1314    qb_loop_run (corosync_poll_handle);
>>  (gdb) n
>>
>>  Program received signal SIGSEGV, Segmentation fault.
>>  0x771e581c in free () from /lib64/libc.so.6
>>  (gdb)
>>  ___
>>  discuss mailing list
>>  disc...@corosync.org
>>  http://lists.corosync.org/mailman/listinfo/discuss

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] corosync Segmentation fault.

2014-02-25 Thread Andrey Groshev
Hi, ALL.
Something I already confused, or after updating any package or myself something 
broke, 
but call corosycn killed by segmentation fault signal.
I correctly understood that does not link the library libqb ?

.

(gdb) n
[New Thread 0x74b2b700 (LWP 9014)]
1266if ((flock_err = corosync_flock (corosync_lock_file, getpid 
())) != COROSYNC_DONE_EXIT) {
(gdb) n
1280totempg_initialize (
(gdb) n
1284totempg_service_ready_register (
(gdb) n
1287totempg_groups_initialize (
(gdb) n
1292totempg_groups_join (
(gdb) n
1307schedwrk_init (
(gdb) n
1314qb_loop_run (corosync_poll_handle);
(gdb) n

Program received signal SIGSEGV, Segmentation fault.
0x771e581c in free () from /lib64/libc.so.6
(gdb) 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hangs pending

2014-02-25 Thread Andrey Groshev


21.02.2014, 12:04, "Andrey Groshev" :
> 21.02.2014, 05:53, "Andrew Beekhof" :
>
>>  On 19 Feb 2014, at 7:53 pm, Andrey Groshev  wrote:
>>>   19.02.2014, 09:49, "Andrew Beekhof" :
>>>>   On 19 Feb 2014, at 4:18 pm, Andrey Groshev  wrote:
>>>>>    19.02.2014, 09:08, "Andrew Beekhof" :
>>>>>>    On 19 Feb 2014, at 4:00 pm, Andrey Groshev  wrote:
>>>>>>> 19.02.2014, 06:48, "Andrew Beekhof" :
>>>>>>>> On 18 Feb 2014, at 11:05 pm, Andrey Groshev  
>>>>>>>> wrote:
>>>>>>>>>  Hi, ALL and Andrew!
>>>>>>>>>
>>>>>>>>>  Today is a good day - I killed a lot, and a lot of shooting at 
>>>>>>>>> me.
>>>>>>>>>  In general - I am happy (almost like an elephant)   :)
>>>>>>>>>  Except resources on the node are important to me eight 
>>>>>>>>> processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>  I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>  Behavior does not depend of number signal - it's good.
>>>>>>>>>  If STONITH send reboot to the node - it rebooted and rejoined 
>>>>>>>>> the cluster - too it's good.
>>>>>>>>>  But the behavior is different from killing various demons.
>>>>>>>>>
>>>>>>>>>  Turned four groups:
>>>>>>>>>  1. corosync,cib - STONITH work 100%.
>>>>>>>>>  Kill via any signals - call STONITH and reboot.
>>>>>>>>>
>>>>>>>>>  2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>  Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>  Sometimes restart daemon and restart resources with large delay 
>>>>>>>>> MS:pgsql.
>>>>>>>>>  One time after restart crmd - pgsql don't restart.
>>>>>>>>>
>>>>>>>>>  3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>  This daemons simple restart, resources - stay running.
>>>>>>>>>
>>>>>>>>>  4. pacemakerd - nothing happens.
>>>>>>>>>  And then I can kill any process of the third group. They do not 
>>>>>>>>> restart.
>>>>>>>>>  Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>>>>
>>>>>>>>>  What do you think about this?
>>>>>>>>>  The main question of this topic - we decided.
>>>>>>>>>  But this varied behavior - another big problem.
>>>>>>>>>
>>>>>>>>>  Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>>>>>>>> Which of the various conditions above do the logs cover?
>>>>>>> All various in day.
>>>>>>    Are you trying to torture me?
>>>>>>    Can you give me a rough idea what happened when?
>>>>>    No, there is 8 processes on the 4th signal and repeats the experiments 
>>>>> with unknown outcome :)
>>>>>    Easier to conduct new experiments and individual new logs .
>>>>>    Which variant is more interesting?
>>>>   The long delay in restarting pgsql.
>>>>   Everything else seems correct.
>>>   He even don't tried start pgsql.
>>>   In Logs tree the tests.
>>>   kill -s4 lrmd pid.
>>>   1. STONITH
>>>   2. STONITH
>>>   3. hangs
>>  Its waiting on a value for default_ping_set
>>
>>  It seems we're calling monitor for pingCheck but for some reason its not 
>> performing an update:
>>
>>  # grep 2632.*lrmd.*pingCheck 
>> /Users/beekhof/Downloads/pcmk-Wed-19-Feb-2014/dev-cluster2-node2.unix.tensor.ru/corosync.log
>>  Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd: 
>> info: process_lrmd_get_rsc_info: Resource 'pingCheck' not found (3 active 
>> resources)
>>  Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd: 
>> info: process_lrmd_get_rsc_info: Resource 'pingCheck:

Re: [Pacemaker] hangs pending

2014-02-22 Thread Andrey Groshev


21.02.2014, 04:00, "Andrew Beekhof" :
> On 20 Feb 2014, at 10:04 pm, Andrey Groshev  wrote:
>
>>  20.02.2014, 13:57, "Andrew Beekhof" :
>>>  On 20 Feb 2014, at 5:33 pm, Andrey Groshev  wrote:
>>>>   20.02.2014, 01:22, "Andrew Beekhof" :
>>>>>   On 20 Feb 2014, at 4:18 am, Andrey Groshev  wrote:
>>>>>>    19.02.2014, 06:47, "Andrew Beekhof" :
>>>>>>>    On 18 Feb 2014, at 9:29 pm, Andrey Groshev  wrote:
>>>>>>>> Hi, ALL and Andrew!
>>>>>>>>
>>>>>>>> Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>> In general - I am happy (almost like an elephant)   :)
>>>>>>>> Except resources on the node are important to me eight processes: 
>>>>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>> I killed them with different signals (4,6,11 and even 9).
>>>>>>>> Behavior does not depend of number signal - it's good.
>>>>>>>> If STONITH send reboot to the node - it rebooted and rejoined the 
>>>>>>>> cluster - too it's good.
>>>>>>>> But the behavior is different from killing various demons.
>>>>>>>>
>>>>>>>> Turned four groups:
>>>>>>>> 1. corosync,cib - STONITH work 100%.
>>>>>>>> Kill via any signals - call STONITH and reboot.
>>>>>>>    excellent
>>>>>>>> 3. stonithd,attrd,pengine - not need STONITH
>>>>>>>> This daemons simple restart, resources - stay running.
>>>>>>>    right
>>>>>>>> 2. lrmd,crmd - strange behavior STONITH.
>>>>>>>> Sometimes called STONITH - and the corresponding reaction.
>>>>>>>> Sometimes restart daemon
>>>>>>>    The daemon will always try to restart, the only variable is how long 
>>>>>>> it takes the peer to notice and initiate fencing.
>>>>>>>    If the failure happens just before a they're due to receive totem 
>>>>>>> token, the failure will be very quickly detected and the node fenced.
>>>>>>>    If the failure happens just after, then detection will take longer - 
>>>>>>> giving the node longer to recover and not be fenced.
>>>>>>>
>>>>>>>    So fence/not fence is normal and to be expected.
>>>>>>>> and restart resources with large delay MS:pgsql.
>>>>>>>> One time after restart crmd - pgsql don't restart.
>>>>>>>    I would not expect pgsql to ever restart - if the RA does its job 
>>>>>>> properly anyway.
>>>>>>>    In the case the node is not fenced, the crmd will respawn and the 
>>>>>>> the PE will request that it re-detect the state of all resources.
>>>>>>>
>>>>>>>    If the agent reports "all good", then there is nothing more to do.
>>>>>>>    If the agent is not reporting "all good", you should really be 
>>>>>>> asking why.
>>>>>>>> 4. pacemakerd - nothing happens.
>>>>>>>    On non-systemd based machines, correct.
>>>>>>>
>>>>>>>    On a systemd based machine pacemakerd is respawned and reattaches to 
>>>>>>> the existing daemons.
>>>>>>>    Any subsequent daemon failure will be detected and the daemon 
>>>>>>> respawned.
>>>>>>    And! I almost forgot about IT!
>>>>>>    Exist another (NORMAL) the variants, the methods, the ideas?
>>>>>>    Without this  ... @$%#$%&$%^&$%^&##@#$$^$%& !
>>>>>>    Otherwise - it's a full epic fail ;)
>>>>>   -ENOPARSE
>>>>   OK, I remove my personal attitude to "systemd".
>>>>   Let me explain.
>>>>
>>>>   Somewhere in the beginning of this topic, I wrote:
>>>>   A.G.:Who knows who runs lrmd?
>>>>   A.B.:Pacemakerd.
>>>>   That's one!
>>>>
>>>>   Let's see the list of processes:
>>>>   #ps -axf
>>>>   .
>>>>   6067 ?    Ssl    7:24 corosync
&g

Re: [Pacemaker] hangs pending

2014-02-21 Thread Andrey Groshev


21.02.2014, 05:53, "Andrew Beekhof" :
> On 19 Feb 2014, at 7:53 pm, Andrey Groshev  wrote:
>
>>  19.02.2014, 09:49, "Andrew Beekhof" :
>>>  On 19 Feb 2014, at 4:18 pm, Andrey Groshev  wrote:
>>>>   19.02.2014, 09:08, "Andrew Beekhof" :
>>>>>   On 19 Feb 2014, at 4:00 pm, Andrey Groshev  wrote:
>>>>>>    19.02.2014, 06:48, "Andrew Beekhof" :
>>>>>>>    On 18 Feb 2014, at 11:05 pm, Andrey Groshev  wrote:
>>>>>>>> Hi, ALL and Andrew!
>>>>>>>>
>>>>>>>> Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>> In general - I am happy (almost like an elephant)   :)
>>>>>>>> Except resources on the node are important to me eight processes: 
>>>>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>> I killed them with different signals (4,6,11 and even 9).
>>>>>>>> Behavior does not depend of number signal - it's good.
>>>>>>>> If STONITH send reboot to the node - it rebooted and rejoined the 
>>>>>>>> cluster - too it's good.
>>>>>>>> But the behavior is different from killing various demons.
>>>>>>>>
>>>>>>>> Turned four groups:
>>>>>>>> 1. corosync,cib - STONITH work 100%.
>>>>>>>> Kill via any signals - call STONITH and reboot.
>>>>>>>>
>>>>>>>> 2. lrmd,crmd - strange behavior STONITH.
>>>>>>>> Sometimes called STONITH - and the corresponding reaction.
>>>>>>>> Sometimes restart daemon and restart resources with large delay 
>>>>>>>> MS:pgsql.
>>>>>>>> One time after restart crmd - pgsql don't restart.
>>>>>>>>
>>>>>>>> 3. stonithd,attrd,pengine - not need STONITH
>>>>>>>> This daemons simple restart, resources - stay running.
>>>>>>>>
>>>>>>>> 4. pacemakerd - nothing happens.
>>>>>>>> And then I can kill any process of the third group. They do not 
>>>>>>>> restart.
>>>>>>>> Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>>>
>>>>>>>> What do you think about this?
>>>>>>>> The main question of this topic - we decided.
>>>>>>>> But this varied behavior - another big problem.
>>>>>>>>
>>>>>>>> Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>>>>>>>    Which of the various conditions above do the logs cover?
>>>>>>    All various in day.
>>>>>   Are you trying to torture me?
>>>>>   Can you give me a rough idea what happened when?
>>>>   No, there is 8 processes on the 4th signal and repeats the experiments 
>>>> with unknown outcome :)
>>>>   Easier to conduct new experiments and individual new logs .
>>>>   Which variant is more interesting?
>>>  The long delay in restarting pgsql.
>>>  Everything else seems correct.
>>  He even don't tried start pgsql.
>>  In Logs tree the tests.
>>  kill -s4 lrmd pid.
>>  1. STONITH
>>  2. STONITH
>>  3. hangs
>
> Its waiting on a value for default_ping_set
>
> It seems we're calling monitor for pingCheck but for some reason its not 
> performing an update:
>
> # grep 2632.*lrmd.*pingCheck 
> /Users/beekhof/Downloads/pcmk-Wed-19-Feb-2014/dev-cluster2-node2.unix.tensor.ru/corosync.log
> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd: 
> info: process_lrmd_get_rsc_info: Resource 'pingCheck' not found (3 active 
> resources)
> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd: 
> info: process_lrmd_get_rsc_info: Resource 'pingCheck:3' not found (3 active 
> resources)
> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd: 
> info: process_lrmd_rsc_register: Added 'pingCheck' to the rsc list (4 active 
> resources)
> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru   lrmd:    
> debug: log_execute: executing - rsc:pingCheck action:monitor call_id:19
> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru   

Re: [Pacemaker] hangs pending

2014-02-20 Thread Andrey Groshev


21.02.2014, 10:18, "Andrew Beekhof" :
> btw. Whats with all these entries:
>
> Feb 19 10:49:27 [1641] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root
> Feb 19 10:49:27 [1641] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_xml_cleanup: Cleaning up memory from libxml2
> Feb 19 10:49:27 [1772] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_log_init: Changed active directory to 
> /var/lib/heartbeat/cores/hacluster
> Feb 19 10:49:27 [1772] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_xml_cleanup: Cleaning up memory from libxml2
> Feb 19 10:49:29 [1851] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root
> Feb 19 10:49:29 [1851] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_xml_cleanup: Cleaning up memory from libxml2
> Feb 19 10:49:35 [2130] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root
> Feb 19 10:49:35 [2130] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_xml_cleanup: Cleaning up memory from libxml2
> Feb 19 10:49:35 [2191] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root
> Feb 19 10:49:35 [2191] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_xml_cleanup: Cleaning up memory from libxml2
> Feb 19 10:49:40 [2288] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root
> Feb 19 10:49:40 [2288] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_xml_cleanup: Cleaning up memory from libxml2
> Feb 19 10:49:45 [2388] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root
> Feb 19 10:49:45 [2388] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_xml_cleanup: Cleaning up memory from libxml2
> Feb 19 10:49:51 [2468] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root
> Feb 19 10:49:51 [2468] dev-cluster2-node2.unix.tensor.ru pacemakerd: 
> info: crm_xml_cleanup: Cleaning up memory from libxml2
>
> are you calling pacemakerd for some reason?
>


No, in this test, I did not touch pacemakerd.
Only kill -4 `lrmd.pid`

> On 19 Feb 2014, at 7:53 pm, Andrey Groshev  wrote:
>
>>  19.02.2014, 09:49, "Andrew Beekhof" :
>>>  On 19 Feb 2014, at 4:18 pm, Andrey Groshev  wrote:
>>>>   19.02.2014, 09:08, "Andrew Beekhof" :
>>>>>   On 19 Feb 2014, at 4:00 pm, Andrey Groshev  wrote:
>>>>>>    19.02.2014, 06:48, "Andrew Beekhof" :
>>>>>>>    On 18 Feb 2014, at 11:05 pm, Andrey Groshev  wrote:
>>>>>>>> Hi, ALL and Andrew!
>>>>>>>>
>>>>>>>> Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>> In general - I am happy (almost like an elephant)   :)
>>>>>>>> Except resources on the node are important to me eight processes: 
>>>>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>> I killed them with different signals (4,6,11 and even 9).
>>>>>>>> Behavior does not depend of number signal - it's good.
>>>>>>>> If STONITH send reboot to the node - it rebooted and rejoined the 
>>>>>>>> cluster - too it's good.
>>>>>>>> But the behavior is different from killing various demons.
>>>>>>>>
>>>>>>>> Turned four groups:
>>>>>>>> 1. corosync,cib - STONITH work 100%.
>>>>>>>> Kill via any signals - call STONITH and reboot.
>>>>>>>>
>>>>>>>> 2. lrmd,crmd - strange behavior STONITH.
>>>>>>>> Sometimes called STONITH - and the corresponding reaction.
>>>>>>>> Sometimes restart daemon and restart resources with large delay 
>>>>>>>> MS:pgsql.
>>>>>>>> One time after restart crmd - pgsql don't restart.
>>>>>>>>
>>>>>>>> 3. stonithd,attrd,pengine - not need STONITH
>>>>>>>> This daemons simple restart, resources - stay running.
>>>>>>>

Re: [Pacemaker] hangs pending

2014-02-20 Thread Andrey Groshev


20.02.2014, 13:57, "Andrew Beekhof" :
> On 20 Feb 2014, at 5:33 pm, Andrey Groshev  wrote:
>
>>  20.02.2014, 01:22, "Andrew Beekhof" :
>>>  On 20 Feb 2014, at 4:18 am, Andrey Groshev  wrote:
>>>>   19.02.2014, 06:47, "Andrew Beekhof" :
>>>>>   On 18 Feb 2014, at 9:29 pm, Andrey Groshev  wrote:
>>>>>>    Hi, ALL and Andrew!
>>>>>>
>>>>>>    Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>    In general - I am happy (almost like an elephant)   :)
>>>>>>    Except resources on the node are important to me eight processes: 
>>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>    I killed them with different signals (4,6,11 and even 9).
>>>>>>    Behavior does not depend of number signal - it's good.
>>>>>>    If STONITH send reboot to the node - it rebooted and rejoined the 
>>>>>> cluster - too it's good.
>>>>>>    But the behavior is different from killing various demons.
>>>>>>
>>>>>>    Turned four groups:
>>>>>>    1. corosync,cib - STONITH work 100%.
>>>>>>    Kill via any signals - call STONITH and reboot.
>>>>>   excellent
>>>>>>    3. stonithd,attrd,pengine - not need STONITH
>>>>>>    This daemons simple restart, resources - stay running.
>>>>>   right
>>>>>>    2. lrmd,crmd - strange behavior STONITH.
>>>>>>    Sometimes called STONITH - and the corresponding reaction.
>>>>>>    Sometimes restart daemon
>>>>>   The daemon will always try to restart, the only variable is how long it 
>>>>> takes the peer to notice and initiate fencing.
>>>>>   If the failure happens just before a they're due to receive totem 
>>>>> token, the failure will be very quickly detected and the node fenced.
>>>>>   If the failure happens just after, then detection will take longer - 
>>>>> giving the node longer to recover and not be fenced.
>>>>>
>>>>>   So fence/not fence is normal and to be expected.
>>>>>>    and restart resources with large delay MS:pgsql.
>>>>>>    One time after restart crmd - pgsql don't restart.
>>>>>   I would not expect pgsql to ever restart - if the RA does its job 
>>>>> properly anyway.
>>>>>   In the case the node is not fenced, the crmd will respawn and the the 
>>>>> PE will request that it re-detect the state of all resources.
>>>>>
>>>>>   If the agent reports "all good", then there is nothing more to do.
>>>>>   If the agent is not reporting "all good", you should really be asking 
>>>>> why.
>>>>>>    4. pacemakerd - nothing happens.
>>>>>   On non-systemd based machines, correct.
>>>>>
>>>>>   On a systemd based machine pacemakerd is respawned and reattaches to 
>>>>> the existing daemons.
>>>>>   Any subsequent daemon failure will be detected and the daemon respawned.
>>>>   And! I almost forgot about IT!
>>>>   Exist another (NORMAL) the variants, the methods, the ideas?
>>>>   Without this  ... @$%#$%&$%^&$%^&##@#$$^$%& !
>>>>   Otherwise - it's a full epic fail ;)
>>>  -ENOPARSE
>>  OK, I remove my personal attitude to "systemd".
>>  Let me explain.
>>
>>  Somewhere in the beginning of this topic, I wrote:
>>  A.G.:Who knows who runs lrmd?
>>  A.B.:Pacemakerd.
>>  That's one!
>>
>>  Let's see the list of processes:
>>  #ps -axf
>>  .
>>  6067 ?    Ssl    7:24 corosync
>>  6092 ?    S  0:25 pacemakerd
>>  6094 ?    Ss   116:13  \_ /usr/libexec/pacemaker/cib
>>  6095 ?    Ss 0:25  \_ /usr/libexec/pacemaker/stonithd
>>  6096 ?    Ss 1:27  \_ /usr/libexec/pacemaker/lrmd
>>  6097 ?    Ss 0:49  \_ /usr/libexec/pacemaker/attrd
>>  6098 ?    Ss 0:25  \_ /usr/libexec/pacemaker/pengine
>>  6099 ?    Ss 0:29  \_ /usr/libexec/pacemaker/crmd
>>  .
>>  That's two!
>
> Whats two?  I don't follow.
In the sense that it creates other processes. But it does not matter.


>>  And more, more...
>>  Now you must understand - why I want this process to work always.
>>  Even

Re: [Pacemaker] hangs pending

2014-02-19 Thread Andrey Groshev


20.02.2014, 01:22, "Andrew Beekhof" :
> On 20 Feb 2014, at 4:18 am, Andrey Groshev  wrote:
>
>>  19.02.2014, 06:47, "Andrew Beekhof" :
>>>  On 18 Feb 2014, at 9:29 pm, Andrey Groshev  wrote:
>>>>   Hi, ALL and Andrew!
>>>>
>>>>   Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>   In general - I am happy (almost like an elephant)   :)
>>>>   Except resources on the node are important to me eight processes: 
>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>   I killed them with different signals (4,6,11 and even 9).
>>>>   Behavior does not depend of number signal - it's good.
>>>>   If STONITH send reboot to the node - it rebooted and rejoined the 
>>>> cluster - too it's good.
>>>>   But the behavior is different from killing various demons.
>>>>
>>>>   Turned four groups:
>>>>   1. corosync,cib - STONITH work 100%.
>>>>   Kill via any signals - call STONITH and reboot.
>>>  excellent
>>>>   3. stonithd,attrd,pengine - not need STONITH
>>>>   This daemons simple restart, resources - stay running.
>>>  right
>>>>   2. lrmd,crmd - strange behavior STONITH.
>>>>   Sometimes called STONITH - and the corresponding reaction.
>>>>   Sometimes restart daemon
>>>  The daemon will always try to restart, the only variable is how long it 
>>> takes the peer to notice and initiate fencing.
>>>  If the failure happens just before a they're due to receive totem token, 
>>> the failure will be very quickly detected and the node fenced.
>>>  If the failure happens just after, then detection will take longer - 
>>> giving the node longer to recover and not be fenced.
>>>
>>>  So fence/not fence is normal and to be expected.
>>>>   and restart resources with large delay MS:pgsql.
>>>>   One time after restart crmd - pgsql don't restart.
>>>  I would not expect pgsql to ever restart - if the RA does its job properly 
>>> anyway.
>>>  In the case the node is not fenced, the crmd will respawn and the the PE 
>>> will request that it re-detect the state of all resources.
>>>
>>>  If the agent reports "all good", then there is nothing more to do.
>>>  If the agent is not reporting "all good", you should really be asking why.
>>>>   4. pacemakerd - nothing happens.
>>>  On non-systemd based machines, correct.
>>>
>>>  On a systemd based machine pacemakerd is respawned and reattaches to the 
>>> existing daemons.
>>>  Any subsequent daemon failure will be detected and the daemon respawned.
>>  And! I almost forgot about IT!
>>  Exist another (NORMAL) the variants, the methods, the ideas?
>>  Without this  ... @$%#$%&$%^&$%^&##@#$$^$%& !
>>  Otherwise - it's a full epic fail ;)
>
> -ENOPARSE

OK, I remove my personal attitude to "systemd".
Let me explain.

Somewhere in the beginning of this topic, I wrote:
A.G.:Who knows who runs lrmd? 
A.B.:Pacemakerd.
That's one!

Let's see the list of processes:
#ps -axf
.
 6067 ?Ssl7:24 corosync
 6092 ?S  0:25 pacemakerd
 6094 ?Ss   116:13  \_ /usr/libexec/pacemaker/cib
 6095 ?Ss 0:25  \_ /usr/libexec/pacemaker/stonithd
 6096 ?Ss 1:27  \_ /usr/libexec/pacemaker/lrmd
 6097 ?Ss 0:49  \_ /usr/libexec/pacemaker/attrd
 6098 ?Ss 0:25  \_ /usr/libexec/pacemaker/pengine
 6099 ?Ss 0:29  \_ /usr/libexec/pacemaker/crmd
.
That's two!
And more, more...
Now you must understand - why I want this process to work always. 
Even I think, No need for anyone here to explain it!

And Now you say about "pacemakerd nice work, but only on systemd distros" !!!
What should I do now?
* Integrate systemd in CentOS?
* Migrate to Fefora?
* Buy RHEL7 !?
Each a variants is great, but don't fit for me.

P.S. And I'm not talking distros which don't migrate to systemd (and will not 
do).
Do not be offended! We also do so. 
We are building a secret military factory, 
large concrete fence around it, 
wall barbed wire, but forget to install the gates. :)


>>>>   And then I can kill any process of the third group. They do not restart.
>>>  Until they become needed.
>>>  Eg. if the DC goes to invoke the policy engine, that will fail causing the 
>>> crmd to fail and the node to be fenced.
>>>>   Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>
>>>>   What

Re: [Pacemaker] hangs pending

2014-02-19 Thread Andrey Groshev


19.02.2014, 06:47, "Andrew Beekhof" :
> On 18 Feb 2014, at 9:29 pm, Andrey Groshev  wrote:
>
>>  Hi, ALL and Andrew!
>>
>>  Today is a good day - I killed a lot, and a lot of shooting at me.
>>  In general - I am happy (almost like an elephant)   :)
>>  Except resources on the node are important to me eight processes: 
>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>  I killed them with different signals (4,6,11 and even 9).
>>  Behavior does not depend of number signal - it's good.
>>  If STONITH send reboot to the node - it rebooted and rejoined the cluster - 
>> too it's good.
>>  But the behavior is different from killing various demons.
>>
>>  Turned four groups:
>>  1. corosync,cib - STONITH work 100%.
>>  Kill via any signals - call STONITH and reboot.
>
> excellent
>
>>  3. stonithd,attrd,pengine - not need STONITH
>>  This daemons simple restart, resources - stay running.
>
> right
>
>>  2. lrmd,crmd - strange behavior STONITH.
>>  Sometimes called STONITH - and the corresponding reaction.
>>  Sometimes restart daemon
>
> The daemon will always try to restart, the only variable is how long it takes 
> the peer to notice and initiate fencing.
> If the failure happens just before a they're due to receive totem token, the 
> failure will be very quickly detected and the node fenced.
> If the failure happens just after, then detection will take longer - giving 
> the node longer to recover and not be fenced.
>
> So fence/not fence is normal and to be expected.
>
>>  and restart resources with large delay MS:pgsql.
>>  One time after restart crmd - pgsql don't restart.
>
> I would not expect pgsql to ever restart - if the RA does its job properly 
> anyway.
> In the case the node is not fenced, the crmd will respawn and the the PE will 
> request that it re-detect the state of all resources.
>
> If the agent reports "all good", then there is nothing more to do.
> If the agent is not reporting "all good", you should really be asking why.
>
>>  4. pacemakerd - nothing happens.
>
> On non-systemd based machines, correct.
>
> On a systemd based machine pacemakerd is respawned and reattaches to the 
> existing daemons.
> Any subsequent daemon failure will be detected and the daemon respawned.

And! I almost forgot about IT! 
Exist another (NORMAL) the variants, the methods, the ideas?
Without this  ... @$%#$%&$%^&$%^&##@#$$^$%& !
Otherwise - it's a full epic fail ;)

>>  And then I can kill any process of the third group. They do not restart.
>
> Until they become needed.
> Eg. if the DC goes to invoke the policy engine, that will fail causing the 
> crmd to fail and the node to be fenced.
>
>>  Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>
>>  What do you think about this?
>>  The main question of this topic - we decided.
>>  But this varied behavior - another big problem.
>>
>>  17.02.2014, 08:52, "Andrey Groshev" :
>>>  17.02.2014, 02:27, "Andrew Beekhof" :
>>>>   With no quick follow-up, dare one hope that means the patch worked? :-)
>>>  Hi,
>>>  No, unfortunately the chief changed my plans on Friday and all day I was 
>>> engaged in a parallel project.
>>>  I hope that today have time to carry out the necessary tests.
>>>>   On 14 Feb 2014, at 3:37 pm, Andrey Groshev  wrote:
>>>>>    Yes, of course. Now beginning build world and test )
>>>>>
>>>>>    14.02.2014, 04:41, "Andrew Beekhof" :
>>>>>>    The previous patch wasn't quite right.
>>>>>>    Could you try this new one?
>>>>>>
>>>>>>   http://paste.fedoraproject.org/77123/13923376/
>>>>>>
>>>>>>    [11:23 AM] beekhof@f19 ~/Development/sources/pacemaker/devel ☺ # git 
>>>>>> diff
>>>>>>    diff --git a/crmd/callbacks.c b/crmd/callbacks.c
>>>>>>    index ac4b905..d49525b 100644
>>>>>>    --- a/crmd/callbacks.c
>>>>>>    +++ b/crmd/callbacks.c
>>>>>>    @@ -199,8 +199,7 @@ peer_update_callback(enum crm_status_type type, 
>>>>>> crm_node_t * node, const void *d
>>>>>> stop_te_timer(down->timer);
>>>>>>
>>>>>> flags |= node_update_join | node_update_expected;
>>>>>>    -    crm_update_peer_join(__FUNCTION__, node, 
>>>>>> crm

Re: [Pacemaker] hangs pending

2014-02-19 Thread Andrey Groshev


19.02.2014, 09:49, "Andrew Beekhof" :
> On 19 Feb 2014, at 4:18 pm, Andrey Groshev  wrote:
>
>>  19.02.2014, 09:08, "Andrew Beekhof" :
>>>  On 19 Feb 2014, at 4:00 pm, Andrey Groshev  wrote:
>>>>   19.02.2014, 06:48, "Andrew Beekhof" :
>>>>>   On 18 Feb 2014, at 11:05 pm, Andrey Groshev  wrote:
>>>>>>    Hi, ALL and Andrew!
>>>>>>
>>>>>>    Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>    In general - I am happy (almost like an elephant)   :)
>>>>>>    Except resources on the node are important to me eight processes: 
>>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>    I killed them with different signals (4,6,11 and even 9).
>>>>>>    Behavior does not depend of number signal - it's good.
>>>>>>    If STONITH send reboot to the node - it rebooted and rejoined the 
>>>>>> cluster - too it's good.
>>>>>>    But the behavior is different from killing various demons.
>>>>>>
>>>>>>    Turned four groups:
>>>>>>    1. corosync,cib - STONITH work 100%.
>>>>>>    Kill via any signals - call STONITH and reboot.
>>>>>>
>>>>>>    2. lrmd,crmd - strange behavior STONITH.
>>>>>>    Sometimes called STONITH - and the corresponding reaction.
>>>>>>    Sometimes restart daemon and restart resources with large delay 
>>>>>> MS:pgsql.
>>>>>>    One time after restart crmd - pgsql don't restart.
>>>>>>
>>>>>>    3. stonithd,attrd,pengine - not need STONITH
>>>>>>    This daemons simple restart, resources - stay running.
>>>>>>
>>>>>>    4. pacemakerd - nothing happens.
>>>>>>    And then I can kill any process of the third group. They do not 
>>>>>> restart.
>>>>>>    Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>
>>>>>>    What do you think about this?
>>>>>>    The main question of this topic - we decided.
>>>>>>    But this varied behavior - another big problem.
>>>>>>
>>>>>>    Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>>>>>   Which of the various conditions above do the logs cover?
>>>>   All various in day.
>>>  Are you trying to torture me?
>>>  Can you give me a rough idea what happened when?
>>  No, there is 8 processes on the 4th signal and repeats the experiments with 
>> unknown outcome :)
>>  Easier to conduct new experiments and individual new logs .
>>  Which variant is more interesting?
>
> The long delay in restarting pgsql.
> Everything else seems correct.
>

He even don't tried start pgsql.
In Logs tree the tests.  
kill -s4 lrmd pid.
1. STONITH
2. STONITH
3. hangs
http://send2me.ru/pcmk-Wed-19-Feb-2014.tar.bz2

> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hangs pending

2014-02-18 Thread Andrey Groshev


19.02.2014, 09:49, "Andrew Beekhof" :
> On 19 Feb 2014, at 4:18 pm, Andrey Groshev  wrote:
>
>>  19.02.2014, 09:08, "Andrew Beekhof" :
>>>  On 19 Feb 2014, at 4:00 pm, Andrey Groshev  wrote:
>>>>   19.02.2014, 06:48, "Andrew Beekhof" :
>>>>>   On 18 Feb 2014, at 11:05 pm, Andrey Groshev  wrote:
>>>>>>    Hi, ALL and Andrew!
>>>>>>
>>>>>>    Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>    In general - I am happy (almost like an elephant)   :)
>>>>>>    Except resources on the node are important to me eight processes: 
>>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>    I killed them with different signals (4,6,11 and even 9).
>>>>>>    Behavior does not depend of number signal - it's good.
>>>>>>    If STONITH send reboot to the node - it rebooted and rejoined the 
>>>>>> cluster - too it's good.
>>>>>>    But the behavior is different from killing various demons.
>>>>>>
>>>>>>    Turned four groups:
>>>>>>    1. corosync,cib - STONITH work 100%.
>>>>>>    Kill via any signals - call STONITH and reboot.
>>>>>>
>>>>>>    2. lrmd,crmd - strange behavior STONITH.
>>>>>>    Sometimes called STONITH - and the corresponding reaction.
>>>>>>    Sometimes restart daemon and restart resources with large delay 
>>>>>> MS:pgsql.
>>>>>>    One time after restart crmd - pgsql don't restart.
>>>>>>
>>>>>>    3. stonithd,attrd,pengine - not need STONITH
>>>>>>    This daemons simple restart, resources - stay running.
>>>>>>
>>>>>>    4. pacemakerd - nothing happens.
>>>>>>    And then I can kill any process of the third group. They do not 
>>>>>> restart.
>>>>>>    Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>
>>>>>>    What do you think about this?
>>>>>>    The main question of this topic - we decided.
>>>>>>    But this varied behavior - another big problem.
>>>>>>
>>>>>>    Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>>>>>   Which of the various conditions above do the logs cover?
>>>>   All various in day.
>>>  Are you trying to torture me?
>>>  Can you give me a rough idea what happened when?
>>  No, there is 8 processes on the 4th signal and repeats the experiments with 
>> unknown outcome :)
>>  Easier to conduct new experiments and individual new logs .
>>  Which variant is more interesting?
>
> The long delay in restarting pgsql.
> Everything else seems correct.
>
> ,
Now build tests, first and second - STONITH, third lrmd restart and wait.
clonePing work - but hir is "stateless"
I'll wait pgsql start and build crm_report (already 10 munuts)
While I see in crm_simulate -sL:
...
   debug: native_assign_node:   All nodes for resource pgsql:3 are unavailable, 
unclean or shutting down (dev-cluster2-node2: 1, -100)
   debug: native_assign_node:   Could not allocate a node for pgsql:3
info: native_color: Resource pgsql:3 cannot run anywhere
   debug: clone_color:  Allocated 3 msPostgresql instances of a possible 4
..

> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hangs pending

2014-02-18 Thread Andrey Groshev


19.02.2014, 09:08, "Andrew Beekhof" :
> On 19 Feb 2014, at 4:00 pm, Andrey Groshev  wrote:
>
>>  19.02.2014, 06:48, "Andrew Beekhof" :
>>>  On 18 Feb 2014, at 11:05 pm, Andrey Groshev  wrote:
>>>>   Hi, ALL and Andrew!
>>>>
>>>>   Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>   In general - I am happy (almost like an elephant)   :)
>>>>   Except resources on the node are important to me eight processes: 
>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>   I killed them with different signals (4,6,11 and even 9).
>>>>   Behavior does not depend of number signal - it's good.
>>>>   If STONITH send reboot to the node - it rebooted and rejoined the 
>>>> cluster - too it's good.
>>>>   But the behavior is different from killing various demons.
>>>>
>>>>   Turned four groups:
>>>>   1. corosync,cib - STONITH work 100%.
>>>>   Kill via any signals - call STONITH and reboot.
>>>>
>>>>   2. lrmd,crmd - strange behavior STONITH.
>>>>   Sometimes called STONITH - and the corresponding reaction.
>>>>   Sometimes restart daemon and restart resources with large delay MS:pgsql.
>>>>   One time after restart crmd - pgsql don't restart.
>>>>
>>>>   3. stonithd,attrd,pengine - not need STONITH
>>>>   This daemons simple restart, resources - stay running.
>>>>
>>>>   4. pacemakerd - nothing happens.
>>>>   And then I can kill any process of the third group. They do not restart.
>>>>   Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>
>>>>   What do you think about this?
>>>>   The main question of this topic - we decided.
>>>>   But this varied behavior - another big problem.
>>>>
>>>>   Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>>>  Which of the various conditions above do the logs cover?
>>  All various in day.
>
> Are you trying to torture me?
> Can you give me a rough idea what happened when?

No, there is 8 processes on the 4th signal and repeats the experiments with 
unknown outcome :) 
Easier to conduct new experiments and individual new logs .
Which variant is more interesting?

> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hangs pending

2014-02-18 Thread Andrey Groshev


19.02.2014, 06:48, "Andrew Beekhof" :
> On 18 Feb 2014, at 11:05 pm, Andrey Groshev  wrote:
>
>>  Hi, ALL and Andrew!
>>
>>  Today is a good day - I killed a lot, and a lot of shooting at me.
>>  In general - I am happy (almost like an elephant)   :)
>>  Except resources on the node are important to me eight processes: 
>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>  I killed them with different signals (4,6,11 and even 9).
>>  Behavior does not depend of number signal - it's good.
>>  If STONITH send reboot to the node - it rebooted and rejoined the cluster - 
>> too it's good.
>>  But the behavior is different from killing various demons.
>>
>>  Turned four groups:
>>  1. corosync,cib - STONITH work 100%.
>>  Kill via any signals - call STONITH and reboot.
>>
>>  2. lrmd,crmd - strange behavior STONITH.
>>  Sometimes called STONITH - and the corresponding reaction.
>>  Sometimes restart daemon and restart resources with large delay MS:pgsql.
>>  One time after restart crmd - pgsql don't restart.
>>
>>  3. stonithd,attrd,pengine - not need STONITH
>>  This daemons simple restart, resources - stay running.
>>
>>  4. pacemakerd - nothing happens.
>>  And then I can kill any process of the third group. They do not restart.
>>  Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>
>>  What do you think about this?
>>  The main question of this topic - we decided.
>>  But this varied behavior - another big problem.
>>
>>  Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>
> Which of the various conditions above do the logs cover?
>

All various in day.

> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hangs pending

2014-02-18 Thread Andrey Groshev
Hi, ALL and Andrew!

Today is a good day - I killed a lot, and a lot of shooting at me.
In general - I am happy (almost like an elephant)   :)
Except resources on the node are important to me eight processes: 
corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
I killed them with different signals (4,6,11 and even 9).
Behavior does not depend of number signal - it's good.
If STONITH send reboot to the node - it rebooted and rejoined the cluster - too 
it's good.
But the behavior is different from killing various demons.

Turned four groups:
1. corosync,cib - STONITH work 100%.
Kill via any signals - call STONITH and reboot.

2. lrmd,crmd - strange behavior STONITH.
Sometimes called STONITH - and the corresponding reaction.
Sometimes restart daemon and restart resources with large delay MS:pgsql.
One time after restart crmd - pgsql don't restart.

3. stonithd,attrd,pengine - not need STONITH
This daemons simple restart, resources - stay running.

4. pacemakerd - nothing happens.
And then I can kill any process of the third group. They do not restart.
Generaly don't touch corosync,cib and maybe lrmd,crmd.

What do you think about this?
The main question of this topic - we decided.
But this varied behavior - another big problem.

Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2


> 17.02.2014, 08:52, "Andrey Groshev" :
>
>>  17.02.2014, 02:27, "Andrew Beekhof" :
>>>   With no quick follow-up, dare one hope that means the patch worked? :-)
>>  Hi,
>>  No, unfortunately the chief changed my plans on Friday and all day I was 
>> engaged in a parallel project.
>>  I hope that today have time to carry out the necessary tests.
>>>   On 14 Feb 2014, at 3:37 pm, Andrey Groshev  wrote:
>>>>    Yes, of course. Now beginning build world and test )
>>>>
>>>>    14.02.2014, 04:41, "Andrew Beekhof" :
>>>>>    The previous patch wasn't quite right.
>>>>>    Could you try this new one?
>>>>>
>>>>>   http://paste.fedoraproject.org/77123/13923376/
>>>>>
>>>>>    [11:23 AM] beekhof@f19 ~/Development/sources/pacemaker/devel ☺ # git 
>>>>> diff
>>>>>    diff --git a/crmd/callbacks.c b/crmd/callbacks.c
>>>>>    index ac4b905..d49525b 100644
>>>>>    --- a/crmd/callbacks.c
>>>>>    +++ b/crmd/callbacks.c
>>>>>    @@ -199,8 +199,7 @@ peer_update_callback(enum crm_status_type type, 
>>>>> crm_node_t * node, const void *d
>>>>> stop_te_timer(down->timer);
>>>>>
>>>>> flags |= node_update_join | node_update_expected;
>>>>>    -    crm_update_peer_join(__FUNCTION__, node, 
>>>>> crm_join_none);
>>>>>    -    crm_update_peer_expected(__FUNCTION__, node, 
>>>>> CRMD_JOINSTATE_DOWN);
>>>>>    +    crmd_peer_down(node, FALSE);
>>>>> check_join_state(fsa_state, __FUNCTION__);
>>>>>
>>>>> update_graph(transition_graph, down);
>>>>>    diff --git a/crmd/crmd_utils.h b/crmd/crmd_utils.h
>>>>>    index bc472c2..1a2577a 100644
>>>>>    --- a/crmd/crmd_utils.h
>>>>>    +++ b/crmd/crmd_utils.h
>>>>>    @@ -100,6 +100,7 @@ void crmd_join_phase_log(int level);
>>>>> const char *get_timer_desc(fsa_timer_t * timer);
>>>>> gboolean too_many_st_failures(void);
>>>>> void st_fail_count_reset(const char * target);
>>>>>    +void crmd_peer_down(crm_node_t *peer, bool full);
>>>>>
>>>>> #  define fsa_register_cib_callback(id, flag, data, fn) do {  
>>>>> \
>>>>> fsa_cib_conn->cmds->register_callback(    
>>>>>   \
>>>>>    diff --git a/crmd/te_actions.c b/crmd/te_actions.c
>>>>>    index f31d4ec..3bfce59 100644
>>>>>    --- a/crmd/te_actions.c
>>>>>    +++ b/crmd/te_actions.c
>>>>>    @@ -80,11 +80,8 @@ send_stonith_update(crm_action_t * action, const 
>>>>> char *target, const char *uuid)
>>>>> crm_info("Recording uuid '%s' for node '%s'", uuid, target);
>>>>> peer->uuid = strdup(uuid);
>>>>> }
>>>>>    -    crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>>>    -    crm_update_peer_state(__FUN

Re: [Pacemaker] hangs pending

2014-02-18 Thread Andrey Groshev
Hi, ALL and Andrew!

Today is a good day - I killed a lot, and a lot of shooting at me.
In general - I am happy (almost like an elephant)   :) 
Except resources on the node are important to me eight processes: 
corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
I killed them with different signals (4,6,11 and even 9).
Behavior does not depend of number signal - it's good.
If STONITH send reboot to the node - it rebooted and rejoined the cluster - too 
it's good.
But the behavior is different from killing various demons.

Turned four groups:
1. corosync,cib - STONITH work 100%.
Kill via any signals - call STONITH and reboot.

2. lrmd,crmd - strange behavior STONITH.
Sometimes called STONITH - and the corresponding reaction.
Sometimes restart daemon and restart resources with large delay MS:pgsql.
One time after restart crmd - pgsql don't restart.

3. stonithd,attrd,pengine - not need STONITH 
This daemons simple restart, resources - stay running.

4. pacemakerd - nothing happens.
And then I can kill any process of the third group. They do not restart.
Generaly don't touch corosync,cib and maybe lrmd,crmd.

What do you think about this?
The main question of this topic - we decided. 
But this varied behavior - another big problem.




17.02.2014, 08:52, "Andrey Groshev" :
> 17.02.2014, 02:27, "Andrew Beekhof" :
>
>>  With no quick follow-up, dare one hope that means the patch worked? :-)
>
> Hi,
> No, unfortunately the chief changed my plans on Friday and all day I was 
> engaged in a parallel project.
> I hope that today have time to carry out the necessary tests.
>
>>  On 14 Feb 2014, at 3:37 pm, Andrey Groshev  wrote:
>>>   Yes, of course. Now beginning build world and test )
>>>
>>>   14.02.2014, 04:41, "Andrew Beekhof" :
>>>>   The previous patch wasn't quite right.
>>>>   Could you try this new one?
>>>>
>>>>  http://paste.fedoraproject.org/77123/13923376/
>>>>
>>>>   [11:23 AM] beekhof@f19 ~/Development/sources/pacemaker/devel ☺ # git diff
>>>>   diff --git a/crmd/callbacks.c b/crmd/callbacks.c
>>>>   index ac4b905..d49525b 100644
>>>>   --- a/crmd/callbacks.c
>>>>   +++ b/crmd/callbacks.c
>>>>   @@ -199,8 +199,7 @@ peer_update_callback(enum crm_status_type type, 
>>>> crm_node_t * node, const void *d
>>>>    stop_te_timer(down->timer);
>>>>
>>>>    flags |= node_update_join | node_update_expected;
>>>>   -    crm_update_peer_join(__FUNCTION__, node, crm_join_none);
>>>>   -    crm_update_peer_expected(__FUNCTION__, node, 
>>>> CRMD_JOINSTATE_DOWN);
>>>>   +    crmd_peer_down(node, FALSE);
>>>>    check_join_state(fsa_state, __FUNCTION__);
>>>>
>>>>    update_graph(transition_graph, down);
>>>>   diff --git a/crmd/crmd_utils.h b/crmd/crmd_utils.h
>>>>   index bc472c2..1a2577a 100644
>>>>   --- a/crmd/crmd_utils.h
>>>>   +++ b/crmd/crmd_utils.h
>>>>   @@ -100,6 +100,7 @@ void crmd_join_phase_log(int level);
>>>>    const char *get_timer_desc(fsa_timer_t * timer);
>>>>    gboolean too_many_st_failures(void);
>>>>    void st_fail_count_reset(const char * target);
>>>>   +void crmd_peer_down(crm_node_t *peer, bool full);
>>>>
>>>>    #  define fsa_register_cib_callback(id, flag, data, fn) do {    
>>>>   \
>>>>    fsa_cib_conn->cmds->register_callback(  
>>>> \
>>>>   diff --git a/crmd/te_actions.c b/crmd/te_actions.c
>>>>   index f31d4ec..3bfce59 100644
>>>>   --- a/crmd/te_actions.c
>>>>   +++ b/crmd/te_actions.c
>>>>   @@ -80,11 +80,8 @@ send_stonith_update(crm_action_t * action, const char 
>>>> *target, const char *uuid)
>>>>    crm_info("Recording uuid '%s' for node '%s'", uuid, target);
>>>>    peer->uuid = strdup(uuid);
>>>>    }
>>>>   -    crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>>   -    crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>>   -    crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
>>>>   -    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>
>>>>   +    crmd_peer_down(peer, TRUE);
>>>>    node_state =
>>>>    do_update_node_cib(peer,
>>>&g

Re: [Pacemaker] hangs pending

2014-02-16 Thread Andrey Groshev
17.02.2014, 02:27, "Andrew Beekhof" :
> With no quick follow-up, dare one hope that means the patch worked? :-)
>
Hi,
No, unfortunately the chief changed my plans on Friday and all day I was 
engaged in a parallel project. 
I hope that today have time to carry out the necessary tests.


> On 14 Feb 2014, at 3:37 pm, Andrey Groshev  wrote:
>
>>  Yes, of course. Now beginning build world and test )
>>
>>  14.02.2014, 04:41, "Andrew Beekhof" :
>>>  The previous patch wasn't quite right.
>>>  Could you try this new one?
>>>
>>> http://paste.fedoraproject.org/77123/13923376/
>>>
>>>  [11:23 AM] beekhof@f19 ~/Development/sources/pacemaker/devel ☺ # git diff
>>>  diff --git a/crmd/callbacks.c b/crmd/callbacks.c
>>>  index ac4b905..d49525b 100644
>>>  --- a/crmd/callbacks.c
>>>  +++ b/crmd/callbacks.c
>>>  @@ -199,8 +199,7 @@ peer_update_callback(enum crm_status_type type, 
>>> crm_node_t * node, const void *d
>>>   stop_te_timer(down->timer);
>>>
>>>   flags |= node_update_join | node_update_expected;
>>>  -    crm_update_peer_join(__FUNCTION__, node, crm_join_none);
>>>  -    crm_update_peer_expected(__FUNCTION__, node, 
>>> CRMD_JOINSTATE_DOWN);
>>>  +    crmd_peer_down(node, FALSE);
>>>   check_join_state(fsa_state, __FUNCTION__);
>>>
>>>   update_graph(transition_graph, down);
>>>  diff --git a/crmd/crmd_utils.h b/crmd/crmd_utils.h
>>>  index bc472c2..1a2577a 100644
>>>  --- a/crmd/crmd_utils.h
>>>  +++ b/crmd/crmd_utils.h
>>>  @@ -100,6 +100,7 @@ void crmd_join_phase_log(int level);
>>>   const char *get_timer_desc(fsa_timer_t * timer);
>>>   gboolean too_many_st_failures(void);
>>>   void st_fail_count_reset(const char * target);
>>>  +void crmd_peer_down(crm_node_t *peer, bool full);
>>>
>>>   #  define fsa_register_cib_callback(id, flag, data, fn) do {  
>>> \
>>>   fsa_cib_conn->cmds->register_callback(  \
>>>  diff --git a/crmd/te_actions.c b/crmd/te_actions.c
>>>  index f31d4ec..3bfce59 100644
>>>  --- a/crmd/te_actions.c
>>>  +++ b/crmd/te_actions.c
>>>  @@ -80,11 +80,8 @@ send_stonith_update(crm_action_t * action, const char 
>>> *target, const char *uuid)
>>>   crm_info("Recording uuid '%s' for node '%s'", uuid, target);
>>>   peer->uuid = strdup(uuid);
>>>   }
>>>  -    crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>  -    crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>  -    crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
>>>  -    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>
>>>  +    crmd_peer_down(peer, TRUE);
>>>   node_state =
>>>   do_update_node_cib(peer,
>>>  node_update_cluster | node_update_peer | 
>>> node_update_join |
>>>  diff --git a/crmd/te_utils.c b/crmd/te_utils.c
>>>  index ad7e573..0c92e95 100644
>>>  --- a/crmd/te_utils.c
>>>  +++ b/crmd/te_utils.c
>>>  @@ -247,10 +247,7 @@ tengine_stonith_notify(stonith_t * st, 
>>> stonith_event_t * st_event)
>>>
>>>   }
>>>
>>>  -    crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
>>>  -    crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>  -    crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
>>>  -    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>  +    crmd_peer_down(peer, TRUE);
>>>    }
>>>   }
>>>
>>>  diff --git a/crmd/utils.c b/crmd/utils.c
>>>  index 3988cfe..2df53ab 100644
>>>  --- a/crmd/utils.c
>>>  +++ b/crmd/utils.c
>>>  @@ -1077,3 +1077,13 @@ update_attrd_remote_node_removed(const char *host, 
>>> const char *user_name)
>>>   crm_trace("telling attrd to clear attributes for remote host %s", 
>>> host);
>>>   update_attrd_helper(host, NULL, NULL, user_name, TRUE, 'C');
>>>   }
>>>  +
>>>  +void crmd_peer_down(crm_node_t *peer, bool full)
>>>  +{
>>>  +    if(full && peer->state == NULL) {
>>>  +    crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>&

Re: [Pacemaker] hangs pending

2014-02-13 Thread Andrey Groshev
Yes, of course. Now beginning build world and test )

14.02.2014, 04:41, "Andrew Beekhof" :
> The previous patch wasn't quite right.
> Could you try this new one?
>
>    http://paste.fedoraproject.org/77123/13923376/
>
> [11:23 AM] beekhof@f19 ~/Development/sources/pacemaker/devel ☺ # git diff
> diff --git a/crmd/callbacks.c b/crmd/callbacks.c
> index ac4b905..d49525b 100644
> --- a/crmd/callbacks.c
> +++ b/crmd/callbacks.c
> @@ -199,8 +199,7 @@ peer_update_callback(enum crm_status_type type, 
> crm_node_t * node, const void *d
>  stop_te_timer(down->timer);
>
>  flags |= node_update_join | node_update_expected;
> -    crm_update_peer_join(__FUNCTION__, node, crm_join_none);
> -    crm_update_peer_expected(__FUNCTION__, node, 
> CRMD_JOINSTATE_DOWN);
> +    crmd_peer_down(node, FALSE);
>  check_join_state(fsa_state, __FUNCTION__);
>
>  update_graph(transition_graph, down);
> diff --git a/crmd/crmd_utils.h b/crmd/crmd_utils.h
> index bc472c2..1a2577a 100644
> --- a/crmd/crmd_utils.h
> +++ b/crmd/crmd_utils.h
> @@ -100,6 +100,7 @@ void crmd_join_phase_log(int level);
>  const char *get_timer_desc(fsa_timer_t * timer);
>  gboolean too_many_st_failures(void);
>  void st_fail_count_reset(const char * target);
> +void crmd_peer_down(crm_node_t *peer, bool full);
>
>  #  define fsa_register_cib_callback(id, flag, data, fn) do {  \
>  fsa_cib_conn->cmds->register_callback(  \
> diff --git a/crmd/te_actions.c b/crmd/te_actions.c
> index f31d4ec..3bfce59 100644
> --- a/crmd/te_actions.c
> +++ b/crmd/te_actions.c
> @@ -80,11 +80,8 @@ send_stonith_update(crm_action_t * action, const char 
> *target, const char *uuid)
>  crm_info("Recording uuid '%s' for node '%s'", uuid, target);
>  peer->uuid = strdup(uuid);
>  }
> -    crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
> -    crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
> -    crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
> -    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>
> +    crmd_peer_down(peer, TRUE);
>  node_state =
>  do_update_node_cib(peer,
> node_update_cluster | node_update_peer | 
> node_update_join |
> diff --git a/crmd/te_utils.c b/crmd/te_utils.c
> index ad7e573..0c92e95 100644
> --- a/crmd/te_utils.c
> +++ b/crmd/te_utils.c
> @@ -247,10 +247,7 @@ tengine_stonith_notify(stonith_t * st, stonith_event_t * 
> st_event)
>
>  }
>
> -    crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
> -    crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
> -    crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
> -    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
> +    crmd_peer_down(peer, TRUE);
>   }
>  }
>
> diff --git a/crmd/utils.c b/crmd/utils.c
> index 3988cfe..2df53ab 100644
> --- a/crmd/utils.c
> +++ b/crmd/utils.c
> @@ -1077,3 +1077,13 @@ update_attrd_remote_node_removed(const char *host, 
> const char *user_name)
>  crm_trace("telling attrd to clear attributes for remote host %s", host);
>  update_attrd_helper(host, NULL, NULL, user_name, TRUE, 'C');
>  }
> +
> +void crmd_peer_down(crm_node_t *peer, bool full)
> +{
> +    if(full && peer->state == NULL) {
> +    crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
> +    crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, NULL);
> +    }
> +    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
> +    crm_update_peer_expected(__FUNCTION__, peer, CRMD_JOINSTATE_DOWN);
> +}
>
> On 16 Jan 2014, at 7:24 pm, Andrey Groshev  wrote:
>
>>  16.01.2014, 01:30, "Andrew Beekhof" :
>>>  On 16 Jan 2014, at 12:41 am, Andrey Groshev  wrote:
>>>>   15.01.2014, 02:53, "Andrew Beekhof" :
>>>>>   On 15 Jan 2014, at 12:15 am, Andrey Groshev  wrote:
>>>>>>    14.01.2014, 10:00, "Andrey Groshev" :
>>>>>>>    14.01.2014, 07:47, "Andrew Beekhof" :
>>>>>>>> Ok, here's what happens:
>>>>>>>>
>>>>>>>> 1. node2 is lost
>>>>>>>> 2. fencing of node2 starts
>>>>>>>> 3. node2 reboots (and cluster starts)
>>>>>>>> 4. node2 returns to the membership
>>>>>>>> 5. node2 is marked as a c

Re: [Pacemaker] hangs pending

2014-02-06 Thread Andrey Groshev
Hi, Andrew and ALL!

Andrew, We did not bury this topic?

16.01.2014, 12:32, "Andrey Groshev" :
> 16.01.2014, 01:30, "Andrew Beekhof" :
>
>>  On 16 Jan 2014, at 12:41 am, Andrey Groshev  wrote:
>>>   15.01.2014, 02:53, "Andrew Beekhof" :
>>>>   On 15 Jan 2014, at 12:15 am, Andrey Groshev  wrote:
>>>>>    14.01.2014, 10:00, "Andrey Groshev" :
>>>>>>    14.01.2014, 07:47, "Andrew Beekhof" :
>>>>>>> Ok, here's what happens:
>>>>>>>
>>>>>>> 1. node2 is lost
>>>>>>> 2. fencing of node2 starts
>>>>>>> 3. node2 reboots (and cluster starts)
>>>>>>> 4. node2 returns to the membership
>>>>>>> 5. node2 is marked as a cluster member
>>>>>>> 6. DC tries to bring it into the cluster, but needs to cancel the 
>>>>>>> active transition first.
>>>>>>>    Which is a problem since the node2 fencing operation is part of 
>>>>>>> that
>>>>>>> 7. node2 is in a transition (pending) state until fencing passes or 
>>>>>>> fails
>>>>>>> 8a. fencing fails: transition completes and the node joins the 
>>>>>>> cluster
>>>>>>>
>>>>>>> Thats in theory, except we automatically try again. Which isn't 
>>>>>>> appropriate.
>>>>>>> This should be relatively easy to fix.
>>>>>>>
>>>>>>> 8b. fencing passes: the node is incorrectly marked as offline
>>>>>>>
>>>>>>> This I have no idea how to fix yet.
>>>>>>>
>>>>>>> On another note, it doesn't look like this agent works at all.
>>>>>>> The node has been back online for a long time and the agent is 
>>>>>>> still timing out after 10 minutes.
>>>>>>> So "Once the script makes sure that the victim will rebooted and 
>>>>>>> again available via ssh - it exit with 0." does not seem true.
>>>>>>    Damn. Looks like you're right. At some time I broke my agent and had 
>>>>>> not noticed it. Who will understand.
>>>>>    I repaired my agent - after send reboot he is wait STDIN.
>>>>>    Returned "normally" a behavior - hangs "pending", until manually send 
>>>>> reboot. :)
>>>>   Right. Now you're in case 8b.
>>>>
>>>>   Can you try this patch:  http://paste.fedoraproject.org/68450/38973966
>>>   Killed all day experiences.
>>>   It turns out here that:
>>>   1. Did cluster.
>>>   2. On the node-2 send signal (-4) - killed corosink
>>>   3. From node-1 (there DC) - stonith sent reboot
>>>   4. Noda rebooted and resources start.
>>>   5. Again. On the node-2 send signal (-4) - killed corosink
>>>   6. Again. From node-1 (there DC) - stonith sent reboot
>>>   7. Noda-2 rebooted and hangs in "pending"
>>>   8. Waiting, waiting. manually reboot.
>>>   9. Noda-2 reboot and raised resources start.
>>>   10. GOTO p.2
>>  Logs?
>
> Yesterday I wrote an additional letter why not put the logs.
> Read it please, it contains a few more questions.
> Today again began to hang and continue along the same cycle.
> Logs here http://send2me.ru/crmrep2.tar.bz2
>
>>>>>    New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>>>>     On 14 Jan 2014, at 1:19 pm, Andrew Beekhof  
>>>>>>> wrote:
>>>>>>>>  Apart from anything else, your timeout needs to be bigger:
>>>>>>>>
>>>>>>>>  Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru 
>>>>>>>> stonith-ng: (  commands.c:1321  )   error: log_operation: Operation 
>>>>>>>> 'reboot' [11331] (call 2 from crmd.17227) for host 
>>>>>>>> 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 
>>>>>>>> (Timer expired)
>>>>>>>>
>>>>>>>>  On 14 Jan 2014, at 7:18 am, Andrew Beekhof  
>>>>>>>> wrote:
>>>>>>>>>  On 13 Jan 2014, at 8:31 pm, Andrey Groshev  
>>>>>>>>> wrote:
>>

Re: [Pacemaker] Announce: SNMP agent for pacemaker

2014-01-22 Thread Andrey Groshev


22.01.2014, 14:39, "Michael Schwartzkopff" :
> Am Mittwoch, 22. Januar 2014, 14:31:57 schrieben Sie:
>
>>  22.01.2014, 12:43, "Michael Schwartzkopff" :
>>>  Hi,
>>>
>>>  I am working on a SNMP agent for pacemaker. it is written in perl. At the
>>>  moment it is in an alpha stadium.
>>  On each node local call crm_mon/crm_resource ?
>
> Partly yes. But in most cases I read the CIB and parse the config and status
> part. I included memshare caching to minimize impact.

Do not think that I dissuade You from your plan :) 
I had the same idea... 
But until we use solution through "external script" in zabbix.
Do you have a great excuse to use SNMP?

>
> Mit freundlichen Grüßen,
>
> Michael Schwartzkopff
>
> --
> [*] sys4 AG
>
> http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
> Franziskanerstraße 15, 81669 München
>
> Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
> Vorstand: Patrick Ben Koetter, Marc Schiffbauer
> Aufsichtsratsvorsitzender: Florian Kirstein

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Announce: SNMP agent for pacemaker

2014-01-22 Thread Andrey Groshev


22.01.2014, 12:43, "Michael Schwartzkopff" :
> Hi,
>
> I am working on a SNMP agent for pacemaker. it is written in perl. At the
> moment it is in an alpha stadium.

On each node local call crm_mon/crm_resource ?

> Any volunteers for testing?
>
> Please respond direct to me to get the code. Thanks.
>
> Mit freundlichen Grüßen,
>
> Michael Schwartzkopff
>
> --
> [*] sys4 AG
>
> http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
> Franziskanerstraße 15, 81669 München
>
> Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
> Vorstand: Patrick Ben Koetter, Marc Schiffbauer
> Aufsichtsratsvorsitzender: Florian Kirstein
>
> ,
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hangs pending

2014-01-16 Thread Andrey Groshev


16.01.2014, 01:30, "Andrew Beekhof" :
> On 16 Jan 2014, at 12:41 am, Andrey Groshev  wrote:
>
>>  15.01.2014, 02:53, "Andrew Beekhof" :
>>>  On 15 Jan 2014, at 12:15 am, Andrey Groshev  wrote:
>>>>   14.01.2014, 10:00, "Andrey Groshev" :
>>>>>   14.01.2014, 07:47, "Andrew Beekhof" :
>>>>>>    Ok, here's what happens:
>>>>>>
>>>>>>    1. node2 is lost
>>>>>>    2. fencing of node2 starts
>>>>>>    3. node2 reboots (and cluster starts)
>>>>>>    4. node2 returns to the membership
>>>>>>    5. node2 is marked as a cluster member
>>>>>>    6. DC tries to bring it into the cluster, but needs to cancel the 
>>>>>> active transition first.
>>>>>>   Which is a problem since the node2 fencing operation is part of 
>>>>>> that
>>>>>>    7. node2 is in a transition (pending) state until fencing passes or 
>>>>>> fails
>>>>>>    8a. fencing fails: transition completes and the node joins the cluster
>>>>>>
>>>>>>    Thats in theory, except we automatically try again. Which isn't 
>>>>>> appropriate.
>>>>>>    This should be relatively easy to fix.
>>>>>>
>>>>>>    8b. fencing passes: the node is incorrectly marked as offline
>>>>>>
>>>>>>    This I have no idea how to fix yet.
>>>>>>
>>>>>>    On another note, it doesn't look like this agent works at all.
>>>>>>    The node has been back online for a long time and the agent is still 
>>>>>> timing out after 10 minutes.
>>>>>>    So "Once the script makes sure that the victim will rebooted and 
>>>>>> again available via ssh - it exit with 0." does not seem true.
>>>>>   Damn. Looks like you're right. At some time I broke my agent and had 
>>>>> not noticed it. Who will understand.
>>>>   I repaired my agent - after send reboot he is wait STDIN.
>>>>   Returned "normally" a behavior - hangs "pending", until manually send 
>>>> reboot. :)
>>>  Right. Now you're in case 8b.
>>>
>>>  Can you try this patch:  http://paste.fedoraproject.org/68450/38973966
>>  Killed all day experiences.
>>  It turns out here that:
>>  1. Did cluster.
>>  2. On the node-2 send signal (-4) - killed corosink
>>  3. From node-1 (there DC) - stonith sent reboot
>>  4. Noda rebooted and resources start.
>>  5. Again. On the node-2 send signal (-4) - killed corosink
>>  6. Again. From node-1 (there DC) - stonith sent reboot
>>  7. Noda-2 rebooted and hangs in "pending"
>>  8. Waiting, waiting. manually reboot.
>>  9. Noda-2 reboot and raised resources start.
>>  10. GOTO p.2
>
> Logs?

Yesterday I wrote an additional letter why not put the logs. 
Read it please, it contains a few more questions.
Today again began to hang and continue along the same cycle.
Logs here http://send2me.ru/crmrep2.tar.bz2

>>>>   New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>>>    On 14 Jan 2014, at 1:19 pm, Andrew Beekhof  wrote:
>>>>>>> Apart from anything else, your timeout needs to be bigger:
>>>>>>>
>>>>>>> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru 
>>>>>>> stonith-ng: (  commands.c:1321  )   error: log_operation: Operation 
>>>>>>> 'reboot' [11331] (call 2 from crmd.17227) for host 
>>>>>>> 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 
>>>>>>> (Timer expired)
>>>>>>>
>>>>>>> On 14 Jan 2014, at 7:18 am, Andrew Beekhof  
>>>>>>> wrote:
>>>>>>>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev  
>>>>>>>> wrote:
>>>>>>>>> 13.01.2014, 02:51, "Andrew Beekhof" :
>>>>>>>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev  
>>>>>>>>>> wrote:
>>>>>>>>>>> 10.01.2014, 14:31, "Andrey Groshev" :
>>>>>>>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" :
>>>>>>>>>>>>> On 10 Jan 2014, at 5

Re: [Pacemaker] hangs pending

2014-01-15 Thread Andrey Groshev


15.01.2014, 02:53, "Andrew Beekhof" :
> On 15 Jan 2014, at 12:15 am, Andrey Groshev  wrote:
>
>>  14.01.2014, 10:00, "Andrey Groshev" :
>>>  14.01.2014, 07:47, "Andrew Beekhof" :
>>>>   Ok, here's what happens:
>>>>
>>>>   1. node2 is lost
>>>>   2. fencing of node2 starts
>>>>   3. node2 reboots (and cluster starts)
>>>>   4. node2 returns to the membership
>>>>   5. node2 is marked as a cluster member
>>>>   6. DC tries to bring it into the cluster, but needs to cancel the active 
>>>> transition first.
>>>>  Which is a problem since the node2 fencing operation is part of that
>>>>   7. node2 is in a transition (pending) state until fencing passes or fails
>>>>   8a. fencing fails: transition completes and the node joins the cluster
>>>>
>>>>   Thats in theory, except we automatically try again. Which isn't 
>>>> appropriate.
>>>>   This should be relatively easy to fix.
>>>>
>>>>   8b. fencing passes: the node is incorrectly marked as offline
>>>>
>>>>   This I have no idea how to fix yet.
>>>>
>>>>   On another note, it doesn't look like this agent works at all.
>>>>   The node has been back online for a long time and the agent is still 
>>>> timing out after 10 minutes.
>>>>   So "Once the script makes sure that the victim will rebooted and again 
>>>> available via ssh - it exit with 0." does not seem true.
>>>  Damn. Looks like you're right. At some time I broke my agent and had not 
>>> noticed it. Who will understand.
>>  I repaired my agent - after send reboot he is wait STDIN.
>>  Returned "normally" a behavior - hangs "pending", until manually send 
>> reboot. :)
>
> Right. Now you're in case 8b.
>
> Can you try this patch:  http://paste.fedoraproject.org/68450/38973966

Addition to the previous letter (I had to get away from work ).
I would add is this:
1. You "te_utils.c" bigger about 20 strings.
2. crm_mon -Anfc - during the election / re-election of strange behavior . 
It can display the names of nodes and their statuses, but one moment not 
display a list of resources and their statuses. 
And next tick show all as normal. Something like this:

node1 - online
  pgsql: master
node2 - pending
  pgsql: started
node3 - online
node4 - online
 pgsql: started.
.

Ie looks like as on node3 restarted a resource pgsql. Actually nothing will 
happen.

3 . More about output crm_mon and statuses. 
Showed nodes as not clean, but resources on this node have status - "started". 
This is misleading.


4 . crm_report... I not attach because before noon there was a lot of 
unnecessary , not systematic tests. 
And after dinner crm_report eaten all memory , pacemaker stopped reply and node 
was killed via stonith. 
Tomorrow ( already Today) I will do a series of only the necessary tests.



>
>>  New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>   On 14 Jan 2014, at 1:19 pm, Andrew Beekhof  wrote:
>>>>>    Apart from anything else, your timeout needs to be bigger:
>>>>>
>>>>>    Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: 
>>>>> (  commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] 
>>>>> (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' 
>>>>> with device 'st1' returned: -62 (Timer expired)
>>>>>
>>>>>    On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:
>>>>>>    On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
>>>>>>>    13.01.2014, 02:51, "Andrew Beekhof" :
>>>>>>>>    On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>>>>>>>>>    10.01.2014, 14:31, "Andrey Groshev" :
>>>>>>>>>>    10.01.2014, 14:01, "Andrew Beekhof" :
>>>>>>>>>>>    On 10 Jan 2014, at 5:03 pm, Andrey Groshev  
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" :
>>>>>>>>>>>>>  On 9 Jan 2014, at 11:11 pm, Andrey Groshev 
>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>   08.01.2014, 06:22, "Andrew Beekhof" :
>>>>>>>>>>>>>>>   On 29 Nov 2013, at 7:17 pm, Andrey G

Re: [Pacemaker] hangs pending

2014-01-15 Thread Andrey Groshev


15.01.2014, 02:53, "Andrew Beekhof" :
> On 15 Jan 2014, at 12:15 am, Andrey Groshev  wrote:
>
>>  14.01.2014, 10:00, "Andrey Groshev" :
>>>  14.01.2014, 07:47, "Andrew Beekhof" :
>>>>   Ok, here's what happens:
>>>>
>>>>   1. node2 is lost
>>>>   2. fencing of node2 starts
>>>>   3. node2 reboots (and cluster starts)
>>>>   4. node2 returns to the membership
>>>>   5. node2 is marked as a cluster member
>>>>   6. DC tries to bring it into the cluster, but needs to cancel the active 
>>>> transition first.
>>>>  Which is a problem since the node2 fencing operation is part of that
>>>>   7. node2 is in a transition (pending) state until fencing passes or fails
>>>>   8a. fencing fails: transition completes and the node joins the cluster
>>>>
>>>>   Thats in theory, except we automatically try again. Which isn't 
>>>> appropriate.
>>>>   This should be relatively easy to fix.
>>>>
>>>>   8b. fencing passes: the node is incorrectly marked as offline
>>>>
>>>>   This I have no idea how to fix yet.
>>>>
>>>>   On another note, it doesn't look like this agent works at all.
>>>>   The node has been back online for a long time and the agent is still 
>>>> timing out after 10 minutes.
>>>>   So "Once the script makes sure that the victim will rebooted and again 
>>>> available via ssh - it exit with 0." does not seem true.
>>>  Damn. Looks like you're right. At some time I broke my agent and had not 
>>> noticed it. Who will understand.
>>  I repaired my agent - after send reboot he is wait STDIN.
>>  Returned "normally" a behavior - hangs "pending", until manually send 
>> reboot. :)
>
> Right. Now you're in case 8b.
>
> Can you try this patch:  http://paste.fedoraproject.org/68450/38973966


Killed all day experiences.
It turns out here that:
1. Did cluster.
2. On the node-2 send signal (-4) - killed corosink
3. From node-1 (there DC) - stonith sent reboot
4. Noda rebooted and resources start.
5. Again. On the node-2 send signal (-4) - killed corosink
6. Again. From node-1 (there DC) - stonith sent reboot
7. Noda-2 rebooted and hangs in "pending"
8. Waiting, waiting. manually reboot.
9. Noda-2 reboot and raised resources start.
10. GOTO p.2



>>  New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>   On 14 Jan 2014, at 1:19 pm, Andrew Beekhof  wrote:
>>>>>    Apart from anything else, your timeout needs to be bigger:
>>>>>
>>>>>    Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: 
>>>>> (  commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] 
>>>>> (call 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' 
>>>>> with device 'st1' returned: -62 (Timer expired)
>>>>>
>>>>>    On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:
>>>>>>    On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
>>>>>>>    13.01.2014, 02:51, "Andrew Beekhof" :
>>>>>>>>    On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>>>>>>>>>    10.01.2014, 14:31, "Andrey Groshev" :
>>>>>>>>>>    10.01.2014, 14:01, "Andrew Beekhof" :
>>>>>>>>>>>    On 10 Jan 2014, at 5:03 pm, Andrey Groshev  
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" :
>>>>>>>>>>>>>  On 9 Jan 2014, at 11:11 pm, Andrey Groshev 
>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>   08.01.2014, 06:22, "Andrew Beekhof" :
>>>>>>>>>>>>>>>   On 29 Nov 2013, at 7:17 pm, Andrey Groshev 
>>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>>>    Hi, ALL.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    I'm still trying to cope with the fact that after the 
>>>>>>>>>>>>>>>> fence - node hangs in "pending".
>>>>>>>>>>>>>>>   Please define "pending".  Whe

Re: [Pacemaker] hangs pending

2014-01-14 Thread Andrey Groshev


14.01.2014, 10:00, "Andrey Groshev" :
> 14.01.2014, 07:47, "Andrew Beekhof" :
>
>>  Ok, here's what happens:
>>
>>  1. node2 is lost
>>  2. fencing of node2 starts
>>  3. node2 reboots (and cluster starts)
>>  4. node2 returns to the membership
>>  5. node2 is marked as a cluster member
>>  6. DC tries to bring it into the cluster, but needs to cancel the active 
>> transition first.
>> Which is a problem since the node2 fencing operation is part of that
>>  7. node2 is in a transition (pending) state until fencing passes or fails
>>  8a. fencing fails: transition completes and the node joins the cluster
>>
>>  Thats in theory, except we automatically try again. Which isn't appropriate.
>>  This should be relatively easy to fix.
>>
>>  8b. fencing passes: the node is incorrectly marked as offline
>>
>>  This I have no idea how to fix yet.
>>
>>  On another note, it doesn't look like this agent works at all.
>>  The node has been back online for a long time and the agent is still timing 
>> out after 10 minutes.
>>  So "Once the script makes sure that the victim will rebooted and again 
>> available via ssh - it exit with 0." does not seem true.
>
> Damn. Looks like you're right. At some time I broke my agent and had not 
> noticed it. Who will understand.

I repaired my agent - after send reboot he is wait STDIN.
Returned "normally" a behavior - hangs "pending", until manually send reboot. 
:) 
New logs: http://send2me.ru/crmrep1.tar.bz2

>
>>  On 14 Jan 2014, at 1:19 pm, Andrew Beekhof  wrote:
>>>   Apart from anything else, your timeout needs to be bigger:
>>>
>>>   Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
>>> commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 
>>> 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
>>> 'st1' returned: -62 (Timer expired)
>>>
>>>   On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:
>>>>   On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
>>>>>   13.01.2014, 02:51, "Andrew Beekhof" :
>>>>>>   On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>>>>>>>   10.01.2014, 14:31, "Andrey Groshev" :
>>>>>>>>   10.01.2014, 14:01, "Andrew Beekhof" :
>>>>>>>>>   On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>>>>>>>>>>    10.01.2014, 05:29, "Andrew Beekhof" :
>>>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev  
>>>>>>>>>>> wrote:
>>>>>>>>>>>>  08.01.2014, 06:22, "Andrew Beekhof" :
>>>>>>>>>>>>>  On 29 Nov 2013, at 7:17 pm, Andrey Groshev 
>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>   Hi, ALL.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   I'm still trying to cope with the fact that after the 
>>>>>>>>>>>>>> fence - node hangs in "pending".
>>>>>>>>>>>>>  Please define "pending".  Where did you see this?
>>>>>>>>>>>>  In crm_mon:
>>>>>>>>>>>>  ..
>>>>>>>>>>>>  Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>>>  ..
>>>>>>>>>>>>
>>>>>>>>>>>>  The experiment was like this:
>>>>>>>>>>>>  Four nodes in cluster.
>>>>>>>>>>>>  On one of them kill corosync or pacemakerd (signal 4 or 6 oк 
>>>>>>>>>>>> 11).
>>>>>>>>>>>>  Thereafter, the remaining start it constantly reboot, under 
>>>>>>>>>>>> various pretexts, "softly whistling", "fly low", "not a cluster 
>>>>>>>>>>>> member!" ...
>>>>>>>>>>>>  Then in the log fell out "Too many failures "
>>>>>>>>>>>>  All this time in the status in crm_mon is "pending".
>>>>>>>>>>>&

Re: [Pacemaker] [Linux-HA] Better way to change master in 3 node pgsql cluster

2014-01-14 Thread Andrey Groshev
  14.01.2014, 15:37, "Andrey Rogovsky" :I understand it. So, no way change master better without cluster software update?  You can send a node in standby (crm_stanbdy).All resources will move to another node (if available).Pgsql resource agent will promote postgresql on another node.Anyway ex Master you must resync.   2014/1/14 Andrey Groshev <gre...@yandex.ru>  14.01.2014, 12:39, "Andrey Rogovsky" <a.rogov...@gmail.com>:I use Debian 7 and got:Reconnecting...root@a:~# crm_resource --resource msPostgresql --ban --master --host a.geocluster.e-autopay.comcrm_resource: unrecognized option '--ban'  No other way to move master? 2014/1/13 Andrew Beekhof <and...@beekhof.net> On 13 Jan 2014, at 8:32 pm, Andrey Rogovsky <a.rogov...@gmail.com> wrote:  > Hi > > I have 3 node postgresql cluster. > It work well. But I have some trobule with change master. > > For now, if I need change master, I must: > 1) Stop PGSQL on each node and cluster service > 2) Start Setup new manual PGSQL replication > 3) Change attributes on each node for point to new master > 4) Stop PGSQL on each node > 5) Celanup resource and start cluster service > > It take a lot of time. Is it exist better way to change master?Newer versions support:     crm_resource --resource msPostgresql --ban --master --host a.geocluster.e-autopay.com > > > > This is my cluster service status: > Node Attributes: > * Node a.geocluster.e-autopay.com: >    + master-pgsql:0                   : 1000 >    + pgsql-data-status               : LATEST >    + pgsql-master-baseline           : 2F90 >    + pgsql-status                     : PRI > * Node c.geocluster.e-autopay.com: >    + master-pgsql:0                   : 1000 >    + pgsql-data-status               : SYNC >    + pgsql-status                     : STOP > * Node b.geocluster.e-autopay.com: >    + master-pgsql:0                   : 1000 >    + pgsql-data-status               : SYNC >    + pgsql-status                     : STOP > > I was use http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster for my 3 > nodes cluster without hard stik. > Now I got strange situation all nodes stay slave: >  > Last updated: Sat Dec  7 04:33:47 2013 > Last change: Sat Dec  7 12:56:23 2013 via crmd on a > Stack: openais > Current DC: c - partition with quorum > Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff You use 1.1.7 version.Option "--ban" added in 1.1.9 See: https://github.com/ClusterLabs/pacemaker/blob/master/ChangeLog  > 5 Nodes configured, 3 expected votes > 4 Resources configured. >  > > Online: [ a c b ] > > Master/Slave Set: msPostgresql [pgsql] >     Slaves: [ a c b ] > > My config is: > node a \ > attributes pgsql-data-status="DISCONNECT" > node b \ > attributes pgsql-data-status="DISCONNECT" > node c \ > attributes pgsql-data-status="DISCONNECT" > primitive pgsql ocf:heartbeat:pgsql \ > params pgctl="/usr/lib/postgresql/9.3/bin/pg_ctl" psql="/usr/bin/psql" > pgdata="/var/lib/postgresql/9.3/main" start_opt="-p 5432" rep_mode="sync" > node_list="a b c" restore_command="cp /var/lib/postgresql/9.3/pg_archive/%f > %p" master_ip="192.168.10.200" restart_on_promote="true" > config="/etc/postgresql/9.3/main/postgresql.conf" \ > op start interval="0s" timeout="60s" on-fail="restart" \ > op monitor interval="4s" timeout="60s" on-fail="restart" \ > op monitor interval="3s" role="Master" timeout="60s" on-fail="restart" \ > op promote interval="0s" timeout="60s" on-fail="restart" \ > op demote interval="0s" timeout="60s" on-fail="stop" \ > op stop interval="0s" timeout="60s" on-fail="block" \ > op notify interval="0s" timeout="60s" > primitive pgsql-master-ip ocf:heartbeat:IPaddr2 \ > params ip="192.168.10.200" nic="peervpn0" \ > op start interval="0s" timeout="60s" on-fail="restart" \ > op monitor interval="10s" timeout="60s" on-fail="restart" \ > op stop interval="0s" timeout="60s" on-fail="block" > group master pgsql-master-ip > ms msPostgresql pgsql \ > meta master-max="1" master-node-max="1" clone-max="3" clone-node-max="1" > notify="true" > colocation set_ip inf: master msPostgresql:Master > order ip_down 0: msPostgresql:demote master:stop symmetrical=false > order ip_up 0: msPostgr

Re: [Pacemaker] [Linux-HA] Better way to change master in 3 node pgsql cluster

2014-01-14 Thread Andrey Groshev
  14.01.2014, 12:39, "Andrey Rogovsky" :I use Debian 7 and got:Reconnecting...root@a:~# crm_resource --resource msPostgresql --ban --master --host a.geocluster.e-autopay.comcrm_resource: unrecognized option '--ban'  No other way to move master? 2014/1/13 Andrew Beekhof  On 13 Jan 2014, at 8:32 pm, Andrey Rogovsky  wrote:  > Hi > > I have 3 node postgresql cluster. > It work well. But I have some trobule with change master. > > For now, if I need change master, I must: > 1) Stop PGSQL on each node and cluster service > 2) Start Setup new manual PGSQL replication > 3) Change attributes on each node for point to new master > 4) Stop PGSQL on each node > 5) Celanup resource and start cluster service > > It take a lot of time. Is it exist better way to change master? Newer versions support:     crm_resource --resource msPostgresql --ban --master --host a.geocluster.e-autopay.com > > > > This is my cluster service status: > Node Attributes: > * Node a.geocluster.e-autopay.com: >    + master-pgsql:0                   : 1000 >    + pgsql-data-status               : LATEST >    + pgsql-master-baseline           : 2F90 >    + pgsql-status                     : PRI > * Node c.geocluster.e-autopay.com: >    + master-pgsql:0                   : 1000 >    + pgsql-data-status               : SYNC >    + pgsql-status                     : STOP > * Node b.geocluster.e-autopay.com: >    + master-pgsql:0                   : 1000 >    + pgsql-data-status               : SYNC >    + pgsql-status                     : STOP > > I was use http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster for my 3 > nodes cluster without hard stik. > Now I got strange situation all nodes stay slave: >  > Last updated: Sat Dec  7 04:33:47 2013 > Last change: Sat Dec  7 12:56:23 2013 via crmd on a > Stack: openais > Current DC: c - partition with quorum > Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff You use 1.1.7 version.Option "--ban" added in 1.1.9 See: https://github.com/ClusterLabs/pacemaker/blob/master/ChangeLog  > 5 Nodes configured, 3 expected votes > 4 Resources configured. >  > > Online: [ a c b ] > > Master/Slave Set: msPostgresql [pgsql] >     Slaves: [ a c b ] > > My config is: > node a \ > attributes pgsql-data-status="DISCONNECT" > node b \ > attributes pgsql-data-status="DISCONNECT" > node c \ > attributes pgsql-data-status="DISCONNECT" > primitive pgsql ocf:heartbeat:pgsql \ > params pgctl="/usr/lib/postgresql/9.3/bin/pg_ctl" psql="/usr/bin/psql" > pgdata="/var/lib/postgresql/9.3/main" start_opt="-p 5432" rep_mode="sync" > node_list="a b c" restore_command="cp /var/lib/postgresql/9.3/pg_archive/%f > %p" master_ip="192.168.10.200" restart_on_promote="true" > config="/etc/postgresql/9.3/main/postgresql.conf" \ > op start interval="0s" timeout="60s" on-fail="restart" \ > op monitor interval="4s" timeout="60s" on-fail="restart" \ > op monitor interval="3s" role="Master" timeout="60s" on-fail="restart" \ > op promote interval="0s" timeout="60s" on-fail="restart" \ > op demote interval="0s" timeout="60s" on-fail="stop" \ > op stop interval="0s" timeout="60s" on-fail="block" \ > op notify interval="0s" timeout="60s" > primitive pgsql-master-ip ocf:heartbeat:IPaddr2 \ > params ip="192.168.10.200" nic="peervpn0" \ > op start interval="0s" timeout="60s" on-fail="restart" \ > op monitor interval="10s" timeout="60s" on-fail="restart" \ > op stop interval="0s" timeout="60s" on-fail="block" > group master pgsql-master-ip > ms msPostgresql pgsql \ > meta master-max="1" master-node-max="1" clone-max="3" clone-node-max="1" > notify="true" > colocation set_ip inf: master msPostgresql:Master > order ip_down 0: msPostgresql:demote master:stop symmetrical=false > order ip_up 0: msPostgresql:promote master:start symmetrical=false > property $id="cib-bootstrap-options" \ > dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="3" \ > no-quorum-policy="ignore" \ > stonith-enabled="false" \ > crmd-transition-delay="0" \ > last-lrm-refresh="1386404222" > rsc_defaults $id="rsc-options" \ > resource-stickiness="100" \ > migration-threshold="1"> ___ > Linux-HA mailing list > linux...@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems  ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker  Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ,___Pacemaker mailing list: Pacemaker@oss.clusterlabs.orghttp://oss.clusterlabs.org/mailman/listinfo/pacemakerProject Home: http://www.clusterlabs.orgGetting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrey Groshev


14.01.2014, 07:47, "Andrew Beekhof" :
> Ok, here's what happens:
>
> 1. node2 is lost
> 2. fencing of node2 starts
> 3. node2 reboots (and cluster starts)
> 4. node2 returns to the membership
> 5. node2 is marked as a cluster member
> 6. DC tries to bring it into the cluster, but needs to cancel the active 
> transition first.
>    Which is a problem since the node2 fencing operation is part of that
> 7. node2 is in a transition (pending) state until fencing passes or fails
> 8a. fencing fails: transition completes and the node joins the cluster
>
> Thats in theory, except we automatically try again. Which isn't appropriate.
> This should be relatively easy to fix.
>
> 8b. fencing passes: the node is incorrectly marked as offline
>
> This I have no idea how to fix yet.
>
> On another note, it doesn't look like this agent works at all.
> The node has been back online for a long time and the agent is still timing 
> out after 10 minutes.
> So "Once the script makes sure that the victim will rebooted and again 
> available via ssh - it exit with 0." does not seem true.

Damn. Looks like you're right. At some time I broke my agent and had not 
noticed it. Who will understand.

> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof  wrote:
>
>>  Apart from anything else, your timeout needs to be bigger:
>>
>>  Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
>> commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 
>> 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
>> 'st1' returned: -62 (Timer expired)
>>
>>  On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:
>>>  On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
>>>>  13.01.2014, 02:51, "Andrew Beekhof" :
>>>>>  On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>>>>>>  10.01.2014, 14:31, "Andrey Groshev" :
>>>>>>>  10.01.2014, 14:01, "Andrew Beekhof" :
>>>>>>>>  On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>>>>>>>>>   10.01.2014, 05:29, "Andrew Beekhof" :
>>>>>>>>>>    On 9 Jan 2014, at 11:11 pm, Andrey Groshev  
>>>>>>>>>> wrote:
>>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" :
>>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>  Hi, ALL.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  I'm still trying to cope with the fact that after the fence 
>>>>>>>>>>>>> - node hangs in "pending".
>>>>>>>>>>>> Please define "pending".  Where did you see this?
>>>>>>>>>>> In crm_mon:
>>>>>>>>>>> ..
>>>>>>>>>>> Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>> ..
>>>>>>>>>>>
>>>>>>>>>>> The experiment was like this:
>>>>>>>>>>> Four nodes in cluster.
>>>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 
>>>>>>>>>>> 11).
>>>>>>>>>>> Thereafter, the remaining start it constantly reboot, under 
>>>>>>>>>>> various pretexts, "softly whistling", "fly low", "not a cluster 
>>>>>>>>>>> member!" ...
>>>>>>>>>>> Then in the log fell out "Too many failures "
>>>>>>>>>>> All this time in the status in crm_mon is "pending".
>>>>>>>>>>> Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>> Much time has passed and I can not accurately describe the 
>>>>>>>>>>> behavior...
>>>>>>>>>>>
>>>>>>>>>>> Now I am in the following state:
>>>>>>>>>>> I tried locate the problem. Came here with this.
>>>>>>>>>>> I set big value in property stonith-timeout="600s".
>>>>

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrey Groshev


14.01.2014, 07:00, "Andrew Beekhof" :
> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof  wrote:
>
>>  Apart from anything else, your timeout needs to be bigger:
>>
>>  Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
>> commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 
>> 2 from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
>> 'st1' returned: -62 (Timer expired)
>
> also:
>
> Jan 13 12:04:54 [17226] dev-cluster2-node1.unix.tensor.ru    pengine: ( 
> utils.c:723   )   error: unpack_operation: Specifying on_fail=fence and 
> stonith-enabled=false makes no sense

After load full config option changes to stonith-enabled=true

>>  On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:
>>>  On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
>>>>  13.01.2014, 02:51, "Andrew Beekhof" :
>>>>>  On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>>>>>>  10.01.2014, 14:31, "Andrey Groshev" :
>>>>>>>  10.01.2014, 14:01, "Andrew Beekhof" :
>>>>>>>>  On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>>>>>>>>>   10.01.2014, 05:29, "Andrew Beekhof" :
>>>>>>>>>>    On 9 Jan 2014, at 11:11 pm, Andrey Groshev  
>>>>>>>>>> wrote:
>>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" :
>>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>  Hi, ALL.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  I'm still trying to cope with the fact that after the fence 
>>>>>>>>>>>>> - node hangs in "pending".
>>>>>>>>>>>> Please define "pending".  Where did you see this?
>>>>>>>>>>> In crm_mon:
>>>>>>>>>>> ..
>>>>>>>>>>> Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>> ..
>>>>>>>>>>>
>>>>>>>>>>> The experiment was like this:
>>>>>>>>>>> Four nodes in cluster.
>>>>>>>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 
>>>>>>>>>>> 11).
>>>>>>>>>>> Thereafter, the remaining start it constantly reboot, under 
>>>>>>>>>>> various pretexts, "softly whistling", "fly low", "not a cluster 
>>>>>>>>>>> member!" ...
>>>>>>>>>>> Then in the log fell out "Too many failures "
>>>>>>>>>>> All this time in the status in crm_mon is "pending".
>>>>>>>>>>> Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>> Much time has passed and I can not accurately describe the 
>>>>>>>>>>> behavior...
>>>>>>>>>>>
>>>>>>>>>>> Now I am in the following state:
>>>>>>>>>>> I tried locate the problem. Came here with this.
>>>>>>>>>>> I set big value in property stonith-timeout="600s".
>>>>>>>>>>> And got the following behavior:
>>>>>>>>>>> 1. pkill -4 corosync
>>>>>>>>>>> 2. from node with DC call my fence agent "sshbykey"
>>>>>>>>>>> 3. It sends reboot victim and waits until she comes to life 
>>>>>>>>>>> again.
>>>>>>>>>>    Hmmm what version of pacemaker?
>>>>>>>>>>    This sounds like a timing issue that we fixed a while back
>>>>>>>>>   Was a version 1.1.11 from December 3.
>>>>>>>>>   Now try full update and retest.
>>>>>>>>  That should be recent enough.  Can you create a crm_report the next 
>>>>>>>> time you reproduce?
>>>>>>>  Of course yes. Little delay :)
>>>>>>

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrey Groshev


14.01.2014, 06:25, "Andrew Beekhof" :
> Apart from anything else, your timeout needs to be bigger:
>
> Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
> commands.c:1321  )   error: log_operation: Operation 'reboot' [11331] (call 2 
> from crmd.17227) for host 'dev-cluster2-node2.unix.tensor.ru' with device 
> 'st1' returned: -62 (Timer expired)
>

Bigger than that?
In :21 node2 A long time ago already booted and work (almost).
#cat /var/log/cluster/mystonith.log
.
Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH DEBUG(): getinfo-devdescr
Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH DEBUG(): getinfo-devid
Mon Jan 13 11:48:43 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH DEBUG(): getinfo-xml
Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH 
DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): 
getconfignames
Mon Jan 13 11:48:46 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH 
DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): status
Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH 
DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): 
getconfignames
Mon Jan 13 12:11:37 MSK 2014 greenx dev-cluster2-node1.unix.tensor.ru () 
STONITH 
DEBUG(/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru): reset 
dev-cluster2-node2.unix.tensor.ru
Mon Jan 13 12:11:37 MSK 2014 Now boot time 1389256739, send reboot
...

> On 14 Jan 2014, at 7:18 am, Andrew Beekhof  wrote:
>
>>  On 13 Jan 2014, at 8:31 pm, Andrey Groshev  wrote:
>>>  13.01.2014, 02:51, "Andrew Beekhof" :
>>>>  On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>>>>>  10.01.2014, 14:31, "Andrey Groshev" :
>>>>>>  10.01.2014, 14:01, "Andrew Beekhof" :
>>>>>>>   On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>>>>>>>>    10.01.2014, 05:29, "Andrew Beekhof" :
>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev  
>>>>>>>>> wrote:
>>>>>>>>>>  08.01.2014, 06:22, "Andrew Beekhof" :
>>>>>>>>>>>  On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
>>>>>>>>>>> wrote:
>>>>>>>>>>>>   Hi, ALL.
>>>>>>>>>>>>
>>>>>>>>>>>>   I'm still trying to cope with the fact that after the fence 
>>>>>>>>>>>> - node hangs in "pending".
>>>>>>>>>>>  Please define "pending".  Where did you see this?
>>>>>>>>>>  In crm_mon:
>>>>>>>>>>  ..
>>>>>>>>>>  Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>  ..
>>>>>>>>>>
>>>>>>>>>>  The experiment was like this:
>>>>>>>>>>  Four nodes in cluster.
>>>>>>>>>>  On one of them kill corosync or pacemakerd (signal 4 or 6 oк 
>>>>>>>>>> 11).
>>>>>>>>>>  Thereafter, the remaining start it constantly reboot, under 
>>>>>>>>>> various pretexts, "softly whistling", "fly low", "not a cluster 
>>>>>>>>>> member!" ...
>>>>>>>>>>  Then in the log fell out "Too many failures "
>>>>>>>>>>  All this time in the status in crm_mon is "pending".
>>>>>>>>>>  Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>  Much time has passed and I can not accurately describe the 
>>>>>>>>>> behavior...
>>>>>>>>>>
>>>>>>>>>>  Now I am in the following state:
>>>>>>>>>>  I tried locate the problem. Came here with this.
>>>>>>>>>>  I set big value in property stonith-timeout="600s".
>>>>>>>>>>  And got the following behavior:
>>>>>>>>>>  1. pkill -4 corosync
>>>>>>>>>>  2. from node with DC call my fence agent "sshbykey"
>>>

Re: [Pacemaker] again "return code", now in crm_attribute

2014-01-13 Thread Andrey Groshev


13.01.2014, 02:51, "Andrew Beekhof" :
> On 10 Jan 2014, at 6:18 pm, Andrey Groshev  wrote:
>
>>  10.01.2014, 10:15, "Andrew Beekhof" :
>>>  On 10 Jan 2014, at 4:38 pm, Andrey Groshev  wrote:
>>>>   10.01.2014, 09:06, "Andrew Beekhof" :
>>>>>   On 10 Jan 2014, at 3:51 pm, Andrey Groshev  wrote:
>>>>>>    10.01.2014, 03:28, "Andrew Beekhof" :
>>>>>>>    On 9 Jan 2014, at 4:44 pm, Andrey Groshev  wrote:
>>>>>>>> 09.01.2014, 02:39, "Andrew Beekhof" :
>>>>>>>>>  On 18 Dec 2013, at 11:55 pm, Andrey Groshev  
>>>>>>>>> wrote:
>>>>>>>>>>   Hi, Andrew and ALL.
>>>>>>>>>>
>>>>>>>>>>   I'm sorry, but I again found an error. :)
>>>>>>>>>>   Crux of the problem:
>>>>>>>>>>
>>>>>>>>>>   # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>>>>>> --query; echo $?
>>>>>>>>>>   scope=crm_config  name=stonith-enabled value=true
>>>>>>>>>>   0
>>>>>>>>>>
>>>>>>>>>>   # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>>>>>> --update firstval ; echo $?
>>>>>>>>>>   0
>>>>>>>>>>
>>>>>>>>>>   # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>>>>>> --query; echo $?
>>>>>>>>>>   scope=crm_config  name=stonith-enabled value=firstval
>>>>>>>>>>   0
>>>>>>>>>>
>>>>>>>>>>   # crm_attribute --type crm_config --attr-name stonith-enabled  
>>>>>>>>>> --update secondval --lifetime=reboot ; echo $?
>>>>>>>>>>   0
>>>>>>>>>>
>>>>>>>>>>   # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>>>>>> --query; echo $?
>>>>>>>>>>   scope=crm_config  name=stonith-enabled value=firstval
>>>>>>>>>>   0
>>>>>>>>>>
>>>>>>>>>>   # crm_attribute --type crm_config --attr-name stonith-enabled  
>>>>>>>>>> --update thirdval --lifetime=forever ; echo $?
>>>>>>>>>>   0
>>>>>>>>>>
>>>>>>>>>>   # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>>>>>> --query; echo $?
>>>>>>>>>>   scope=crm_config  name=stonith-enabled value=firstval
>>>>>>>>>>   0
>>>>>>>>>>
>>>>>>>>>>   Ie if specify the lifetime of an attribute, then a attribure 
>>>>>>>>>> is not updated.
>>>>>>>>>>
>>>>>>>>>>   If impossible setup the lifetime of the attribute when it is 
>>>>>>>>>> installing, it must be return an error.
>>>>>>>>>  Agreed. I'll reproduce and get back to you.
>>>>>>>> How, I was able to review code, problem comes when used both 
>>>>>>>> options "--type" and options "--lifetime".
>>>>>>>> One variant in "case" without break;
>>>>>>>> Unfortunately, I did not have time to dive into the logic.
>>>>>>>    Actually, the logic is correct.  The command:
>>>>>>>
>>>>>>>    # crm_attribute --type crm_config --attr-name stonith-enabled  
>>>>>>> --update secondval --lifetime=reboot ; echo $?
>>>>>>>
>>>>>>>    is invalid.  You only get to specify --type OR --lifetime, not both.
>>>>>>>    By specifying --lifetime, you're creating a node attribute, not a 
>>>>>>> cluster proprerty.
>>>>>>    With this, I do not argue. I think that should be the exit code is 
>>>>>> NOT ZERO, ie it's error!
>>>>>   No, its setting a value, just not where you thought (or

Re: [Pacemaker] hangs pending

2014-01-13 Thread Andrey Groshev


13.01.2014, 02:51, "Andrew Beekhof" :
> On 10 Jan 2014, at 9:55 pm, Andrey Groshev  wrote:
>
>>  10.01.2014, 14:31, "Andrey Groshev" :
>>>  10.01.2014, 14:01, "Andrew Beekhof" :
>>>>   On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>>>>>    10.01.2014, 05:29, "Andrew Beekhof" :
>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev  wrote:
>>>>>>>  08.01.2014, 06:22, "Andrew Beekhof" :
>>>>>>>>  On 29 Nov 2013, at 7:17 pm, Andrey Groshev  
>>>>>>>> wrote:
>>>>>>>>>   Hi, ALL.
>>>>>>>>>
>>>>>>>>>   I'm still trying to cope with the fact that after the fence - 
>>>>>>>>> node hangs in "pending".
>>>>>>>>  Please define "pending".  Where did you see this?
>>>>>>>  In crm_mon:
>>>>>>>  ..
>>>>>>>  Node dev-cluster2-node2 (172793105): pending
>>>>>>>  ..
>>>>>>>
>>>>>>>  The experiment was like this:
>>>>>>>  Four nodes in cluster.
>>>>>>>  On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>>>>  Thereafter, the remaining start it constantly reboot, under 
>>>>>>> various pretexts, "softly whistling", "fly low", "not a cluster 
>>>>>>> member!" ...
>>>>>>>  Then in the log fell out "Too many failures "
>>>>>>>  All this time in the status in crm_mon is "pending".
>>>>>>>  Depending on the wind direction changed to "UNCLEAN"
>>>>>>>  Much time has passed and I can not accurately describe the 
>>>>>>> behavior...
>>>>>>>
>>>>>>>  Now I am in the following state:
>>>>>>>  I tried locate the problem. Came here with this.
>>>>>>>  I set big value in property stonith-timeout="600s".
>>>>>>>  And got the following behavior:
>>>>>>>  1. pkill -4 corosync
>>>>>>>  2. from node with DC call my fence agent "sshbykey"
>>>>>>>  3. It sends reboot victim and waits until she comes to life again.
>>>>>> Hmmm what version of pacemaker?
>>>>>> This sounds like a timing issue that we fixed a while back
>>>>>    Was a version 1.1.11 from December 3.
>>>>>    Now try full update and retest.
>>>>   That should be recent enough.  Can you create a crm_report the next time 
>>>> you reproduce?
>>>  Of course yes. Little delay :)
>>>
>>>  ..
>>>  cc1: warnings being treated as errors
>>>  upstart.c: In function ‘upstart_job_property’:
>>>  upstart.c:264: error: implicit declaration of function 
>>> ‘g_variant_lookup_value’
>>>  upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
>>>  upstart.c:264: error: assignment makes pointer from integer without a cast
>>>  gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>  gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>  make[1]: *** [all-recursive] Error 1
>>>  make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>  make: *** [core] Error 1
>>>
>>>  I'm trying to solve this a problem.
>>  Do not get solved quickly...
>>
>>  
>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>  g_variant_lookup_value () Since 2.28
>>
>>  # yum list installed glib2
>>  Loaded plugins: fastestmirror, rhnplugin, security
>>  This system is receiving updates from RHN Classic or Red Hat Satellite.
>>  Loading mirror speeds from cached hostfile
>>  Installed Packages
>>  glib2.x86_64  
>> 2.26.1-3.el6   
>> installed
>>
>>  # cat /etc/issue
>>  CentOS release 6.5 (Final)
>>  Kernel \r on an \m
>
> Can you try this patch?
> Upstart jobs wont work, but the code will compile
>
> diff --git a/lib/services/upstart.c b/lib/services/upstart.c
> index 831e7cf..195c3a4 100644
> --- a/lib/services/u

Re: [Pacemaker] hangs pending

2014-01-10 Thread Andrey Groshev


10.01.2014, 14:31, "Andrey Groshev" :
> 10.01.2014, 14:01, "Andrew Beekhof" :
>
>>  On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>>>   10.01.2014, 05:29, "Andrew Beekhof" :
>>>>    On 9 Jan 2014, at 11:11 pm, Andrey Groshev  wrote:
>>>>>     08.01.2014, 06:22, "Andrew Beekhof" :
>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev  wrote:
>>>>>>>  Hi, ALL.
>>>>>>>
>>>>>>>  I'm still trying to cope with the fact that after the fence - node 
>>>>>>> hangs in "pending".
>>>>>> Please define "pending".  Where did you see this?
>>>>> In crm_mon:
>>>>> ..
>>>>> Node dev-cluster2-node2 (172793105): pending
>>>>> ..
>>>>>
>>>>> The experiment was like this:
>>>>> Four nodes in cluster.
>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>> Thereafter, the remaining start it constantly reboot, under various 
>>>>> pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>> Then in the log fell out "Too many failures "
>>>>> All this time in the status in crm_mon is "pending".
>>>>> Depending on the wind direction changed to "UNCLEAN"
>>>>> Much time has passed and I can not accurately describe the behavior...
>>>>>
>>>>> Now I am in the following state:
>>>>> I tried locate the problem. Came here with this.
>>>>> I set big value in property stonith-timeout="600s".
>>>>> And got the following behavior:
>>>>> 1. pkill -4 corosync
>>>>> 2. from node with DC call my fence agent "sshbykey"
>>>>> 3. It sends reboot victim and waits until she comes to life again.
>>>>    Hmmm what version of pacemaker?
>>>>    This sounds like a timing issue that we fixed a while back
>>>   Was a version 1.1.11 from December 3.
>>>   Now try full update and retest.
>>  That should be recent enough.  Can you create a crm_report the next time 
>> you reproduce?
>
> Of course yes. Little delay :)
>
> ..
> cc1: warnings being treated as errors
> upstart.c: In function ‘upstart_job_property’:
> upstart.c:264: error: implicit declaration of function 
> ‘g_variant_lookup_value’
> upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
> upstart.c:264: error: assignment makes pointer from integer without a cast
> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/root/ha/pacemaker/lib'
> make: *** [core] Error 1
>
> I'm trying to solve this a problem.


Do not get solved quickly...

https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
g_variant_lookup_value () Since 2.28

# yum list installed glib2
Loaded plugins: fastestmirror, rhnplugin, security
This system is receiving updates from RHN Classic or Red Hat Satellite.
Loading mirror speeds from cached hostfile
Installed Packages
glib2.x86_64  
2.26.1-3.el6   
installed

# cat /etc/issue
CentOS release 6.5 (Final)
Kernel \r on an \m


>>>>>   Once the script makes sure that the victim will rebooted and again 
>>>>> available via ssh - it exit with 0.
>>>>>   All command is logged both the victim and the killer - all right.
>>>>> 4. A little later, the status of the (victim) nodes in crm_mon 
>>>>> changes to online.
>>>>> 5. BUT... not one resource don't start! Despite the fact that 
>>>>> "crm_simalate -sL" shows the correct resource to start:
>>>>>   * Start   pingCheck:3  (dev-cluster2-node2)
>>>>> 6. In this state, we spend the next 600 seconds.
>>>>>   After completing this timeout causes another node (not DC) decides 
>>>>> to kill again our victim.
>>>>>   All command again is logged both the victim and the killer - All 
>>>>> documented :)
>>>>> 7. NOW all resource started in right sequence.
>>>>&g

Re: [Pacemaker] hangs pending

2014-01-10 Thread Andrey Groshev


10.01.2014, 14:01, "Andrew Beekhof" :
> On 10 Jan 2014, at 5:03 pm, Andrey Groshev  wrote:
>
>>  10.01.2014, 05:29, "Andrew Beekhof" :
>>>   On 9 Jan 2014, at 11:11 pm, Andrey Groshev  wrote:
>>>>    08.01.2014, 06:22, "Andrew Beekhof" :
>>>>>    On 29 Nov 2013, at 7:17 pm, Andrey Groshev  wrote:
>>>>>> Hi, ALL.
>>>>>>
>>>>>> I'm still trying to cope with the fact that after the fence - node 
>>>>>> hangs in "pending".
>>>>>    Please define "pending".  Where did you see this?
>>>>    In crm_mon:
>>>>    ..
>>>>    Node dev-cluster2-node2 (172793105): pending
>>>>    ..
>>>>
>>>>    The experiment was like this:
>>>>    Four nodes in cluster.
>>>>    On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>    Thereafter, the remaining start it constantly reboot, under various 
>>>> pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>    Then in the log fell out "Too many failures "
>>>>    All this time in the status in crm_mon is "pending".
>>>>    Depending on the wind direction changed to "UNCLEAN"
>>>>    Much time has passed and I can not accurately describe the behavior...
>>>>
>>>>    Now I am in the following state:
>>>>    I tried locate the problem. Came here with this.
>>>>    I set big value in property stonith-timeout="600s".
>>>>    And got the following behavior:
>>>>    1. pkill -4 corosync
>>>>    2. from node with DC call my fence agent "sshbykey"
>>>>    3. It sends reboot victim and waits until she comes to life again.
>>>   Hmmm what version of pacemaker?
>>>   This sounds like a timing issue that we fixed a while back
>>  Was a version 1.1.11 from December 3.
>>  Now try full update and retest.
>
> That should be recent enough.  Can you create a crm_report the next time you 
> reproduce?
>

Of course yes. Little delay :) 

..
cc1: warnings being treated as errors
upstart.c: In function ‘upstart_job_property’:
upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’
upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
upstart.c:264: error: assignment makes pointer from integer without a cast
gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/ha/pacemaker/lib'
make: *** [core] Error 1

I'm trying to solve this a problem.


>>>>  Once the script makes sure that the victim will rebooted and again 
>>>> available via ssh - it exit with 0.
>>>>  All command is logged both the victim and the killer - all right.
>>>>    4. A little later, the status of the (victim) nodes in crm_mon changes 
>>>> to online.
>>>>    5. BUT... not one resource don't start! Despite the fact that 
>>>> "crm_simalate -sL" shows the correct resource to start:
>>>>  * Start   pingCheck:3  (dev-cluster2-node2)
>>>>    6. In this state, we spend the next 600 seconds.
>>>>  After completing this timeout causes another node (not DC) decides to 
>>>> kill again our victim.
>>>>  All command again is logged both the victim and the killer - All 
>>>> documented :)
>>>>    7. NOW all resource started in right sequence.
>>>>
>>>>    I almost happy, but I do not like: two reboots and 10 minutes of 
>>>> waiting ;)
>>>>    And if something happens on another node, this the behavior is 
>>>> superimposed on old and not any resources not start until the last node 
>>>> will not reload twice.
>>>>
>>>>    I tried understood this behavior.
>>>>    As I understand it:
>>>>    1. Ultimately, in ./lib/fencing/st_client.c call 
>>>> internal_stonith_action_execute().
>>>>    2. It make fork and pipe from tham.
>>>>    3. Async call mainloop_child_add with callback to  
>>>> stonith_action_async_done.
>>>>    4. Add timeout  g_timeout_add to TERM and KILL signals.
>>>>
>>>>    If all right must - call stonith_action_async_done, remove timeout.
>>>>    For some reason this does not ha

Re: [Pacemaker] again "return code", now in crm_attribute

2014-01-09 Thread Andrey Groshev


10.01.2014, 10:15, "Andrew Beekhof" :
> On 10 Jan 2014, at 4:38 pm, Andrey Groshev  wrote:
>
>>  10.01.2014, 09:06, "Andrew Beekhof" :
>>>  On 10 Jan 2014, at 3:51 pm, Andrey Groshev  wrote:
>>>>   10.01.2014, 03:28, "Andrew Beekhof" :
>>>>>   On 9 Jan 2014, at 4:44 pm, Andrey Groshev  wrote:
>>>>>>    09.01.2014, 02:39, "Andrew Beekhof" :
>>>>>>> On 18 Dec 2013, at 11:55 pm, Andrey Groshev  
>>>>>>> wrote:
>>>>>>>>  Hi, Andrew and ALL.
>>>>>>>>
>>>>>>>>  I'm sorry, but I again found an error. :)
>>>>>>>>  Crux of the problem:
>>>>>>>>
>>>>>>>>  # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>>>> --query; echo $?
>>>>>>>>  scope=crm_config  name=stonith-enabled value=true
>>>>>>>>  0
>>>>>>>>
>>>>>>>>  # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>>>> --update firstval ; echo $?
>>>>>>>>  0
>>>>>>>>
>>>>>>>>  # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>>>> --query; echo $?
>>>>>>>>  scope=crm_config  name=stonith-enabled value=firstval
>>>>>>>>  0
>>>>>>>>
>>>>>>>>  # crm_attribute --type crm_config --attr-name stonith-enabled  
>>>>>>>> --update secondval --lifetime=reboot ; echo $?
>>>>>>>>  0
>>>>>>>>
>>>>>>>>  # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>>>> --query; echo $?
>>>>>>>>  scope=crm_config  name=stonith-enabled value=firstval
>>>>>>>>  0
>>>>>>>>
>>>>>>>>  # crm_attribute --type crm_config --attr-name stonith-enabled  
>>>>>>>> --update thirdval --lifetime=forever ; echo $?
>>>>>>>>  0
>>>>>>>>
>>>>>>>>  # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>>>> --query; echo $?
>>>>>>>>  scope=crm_config  name=stonith-enabled value=firstval
>>>>>>>>  0
>>>>>>>>
>>>>>>>>  Ie if specify the lifetime of an attribute, then a attribure is 
>>>>>>>> not updated.
>>>>>>>>
>>>>>>>>  If impossible setup the lifetime of the attribute when it is 
>>>>>>>> installing, it must be return an error.
>>>>>>> Agreed. I'll reproduce and get back to you.
>>>>>>    How, I was able to review code, problem comes when used both options 
>>>>>> "--type" and options "--lifetime".
>>>>>>    One variant in "case" without break;
>>>>>>    Unfortunately, I did not have time to dive into the logic.
>>>>>   Actually, the logic is correct.  The command:
>>>>>
>>>>>   # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
>>>>> secondval --lifetime=reboot ; echo $?
>>>>>
>>>>>   is invalid.  You only get to specify --type OR --lifetime, not both.
>>>>>   By specifying --lifetime, you're creating a node attribute, not a 
>>>>> cluster proprerty.
>>>>   With this, I do not argue. I think that should be the exit code is NOT 
>>>> ZERO, ie it's error!
>>>  No, its setting a value, just not where you thought (or where you're 
>>> looking for it in the next command).
>>>
>>>  Its the same as writing:
>>>
>>>    crm_attribute --type crm_config --type status --attr-name 
>>> stonith-enabled  --update secondval; echo $?
>>>
>>>  Only the last value for --type wins
>>  Because of this confusion is obtained. Here is an example of the old 
>> cluster:
>>  #crm_attribute --type crm_config --attr-name test1  --update val1 
>> --lifetime=reboot ; echo $?
>>  0
>>  # cibadmin -Q|grep test1
>>   
>

Re: [Pacemaker] hangs pending

2014-01-09 Thread Andrey Groshev
10.01.2014, 05:29, "Andrew Beekhof" :

>  On 9 Jan 2014, at 11:11 pm, Andrey Groshev  wrote:
>>   08.01.2014, 06:22, "Andrew Beekhof" :
>>>   On 29 Nov 2013, at 7:17 pm, Andrey Groshev  wrote:
>>>>    Hi, ALL.
>>>>
>>>>    I'm still trying to cope with the fact that after the fence - node 
>>>> hangs in "pending".
>>>   Please define "pending".  Where did you see this?
>>   In crm_mon:
>>   ..
>>   Node dev-cluster2-node2 (172793105): pending
>>   ..
>>
>>   The experiment was like this:
>>   Four nodes in cluster.
>>   On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>   Thereafter, the remaining start it constantly reboot, under various 
>> pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>   Then in the log fell out "Too many failures "
>>   All this time in the status in crm_mon is "pending".
>>   Depending on the wind direction changed to "UNCLEAN"
>>   Much time has passed and I can not accurately describe the behavior...
>>
>>   Now I am in the following state:
>>   I tried locate the problem. Came here with this.
>>   I set big value in property stonith-timeout="600s".
>>   And got the following behavior:
>>   1. pkill -4 corosync
>>   2. from node with DC call my fence agent "sshbykey"
>>   3. It sends reboot victim and waits until she comes to life again.
>  Hmmm what version of pacemaker?
>  This sounds like a timing issue that we fixed a while back

Was a version 1.1.11 from December 3.
Now try full update and retest.

>> Once the script makes sure that the victim will rebooted and again 
>> available via ssh - it exit with 0.
>> All command is logged both the victim and the killer - all right.
>>   4. A little later, the status of the (victim) nodes in crm_mon changes to 
>> online.
>>   5. BUT... not one resource don't start! Despite the fact that 
>> "crm_simalate -sL" shows the correct resource to start:
>> * Start   pingCheck:3  (dev-cluster2-node2)
>>   6. In this state, we spend the next 600 seconds.
>> After completing this timeout causes another node (not DC) decides to 
>> kill again our victim.
>> All command again is logged both the victim and the killer - All 
>> documented :)
>>   7. NOW all resource started in right sequence.
>>
>>   I almost happy, but I do not like: two reboots and 10 minutes of waiting ;)
>>   And if something happens on another node, this the behavior is 
>> superimposed on old and not any resources not start until the last node will 
>> not reload twice.
>>
>>   I tried understood this behavior.
>>   As I understand it:
>>   1. Ultimately, in ./lib/fencing/st_client.c call 
>> internal_stonith_action_execute().
>>   2. It make fork and pipe from tham.
>>   3. Async call mainloop_child_add with callback to  
>> stonith_action_async_done.
>>   4. Add timeout  g_timeout_add to TERM and KILL signals.
>>
>>   If all right must - call stonith_action_async_done, remove timeout.
>>   For some reason this does not happen. I sit and think 
>>>>    At this time, there are constant re-election.
>>>>    Also, I noticed the difference when you start pacemaker.
>>>>    At normal startup:
>>>>    * corosync
>>>>    * pacemakerd
>>>>    * attrd
>>>>    * pengine
>>>>    * lrmd
>>>>    * crmd
>>>>    * cib
>>>>
>>>>    When hangs start:
>>>>    * corosync
>>>>    * pacemakerd
>>>>    * attrd
>>>>    * pengine
>>>>    * crmd
>>>>    * lrmd
>>>>    * cib.
>>>   Are you referring to the order of the daemons here?
>>>   The cib should not be at the bottom in either case.
>>>>    Who knows who runs lrmd?
>>>   Pacemakerd.
>>>>    ___
>>>>    Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>>    Project Home: http://www.clusterlabs.org
>>>>    Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>    Bugs: http://bugs.clusterlabs.org
>>>   ,
>>>   ___
>>>   Pacemaker mailing list: Pacemake

Re: [Pacemaker] Breaking dependency loop && stonith

2014-01-09 Thread Andrey Groshev


10.01.2014, 02:36, "Andrew Beekhof" :
> On 9 Jan 2014, at 5:05 pm, Andrey Groshev  wrote:
>
>>  08.01.2014, 06:15, "Andrew Beekhof" :
>>>  On 27 Nov 2013, at 12:26 am, Andrey Groshev  wrote:
>>>>   Hi, ALL.
>>>>
>>>>   I want to clarify two more questions.
>>>>   After stonith reboot - this node hangs with status "pending".
>>>>   The logs found string .
>>>>
>>>>  info: rsc_merge_weights:    pgsql:1: Breaking dependency loop at 
>>>> msPostgresql
>>>>  info: rsc_merge_weights:    pgsql:2: Breaking dependency loop at 
>>>> msPostgresql
>>>>
>>>>   This means that breaking search the depends, because they are no more.
>>>>   Or interrupted by an infinite loop for search the dependency?
>>>  The second one, but it has nothing to do with a node being in the 
>>> "pending" state.
>>>  Where did you see this?
>>  Ok, I've already understood this the problem.
>>  I have "location" for right promote|demote resource.
>>  And too same logic trough "collocation"/"order".
>>  As I thought, they do the same thing
>
> No, collocation and ordering are orthogonal concepts and do not at all do the 
> same thing.
> See the docs.

Yes, I said wrong. Was meant the logic of cluster behavior.

>
>>  and collisions should not happen.
>>  At least on the old cluster it works :)
>>  Now I have removed all unnecessary.
>>>>   And two.
>>>>   Do I need to clone the stonith resource now (In PCMK 1.1.11)?
>>>  No.
>>>>   On the one hand, I see this resource on all nodes through command.
>>>>   # cibadmin -Q|grep stonith
>>>>  >>> id="cib-bootstrap-options-stonith-enabled"/>
>>>>    
>>>>    
>>>>    
>>>>    
>>>>   (without pending node)
>>>  Like all resources, we check all nodes at startup to see if it is already 
>>> active.
>>>>   On the other hand, another command I see only one instance on a 
>>>> particular node.
>>>>   # crm_verify -L
>>>>  info: main: =#=#=#=#= Getting XML =#=#=#=#=
>>>>  info: main: Reading XML from: live cluster
>>>>  info: validate_with_relaxng:    Creating RNG parser context
>>>>  info: determine_online_status_fencing:  Node dev-cluster2-node4 
>>>> is active
>>>>  info: determine_online_status:  Node dev-cluster2-node4 is online
>>>>  info: determine_online_status_fencing:  - Node dev-cluster2-node1 
>>>> is not ready to run resources
>>>>  info: determine_online_status_fencing:  Node dev-cluster2-node2 
>>>> is active
>>>>  info: determine_online_status:  Node dev-cluster2-node2 is online
>>>>  info: determine_online_status_fencing:  Node dev-cluster2-node3 
>>>> is active
>>>>  info: determine_online_status:  Node dev-cluster2-node3 is online
>>>>  info: determine_op_status:  Operation monitor found resource 
>>>> pingCheck:0 active on dev-cluster2-node4
>>>>  info: native_print: VirtualIP   (ocf::heartbeat:IPaddr2): 
>>>>   Started dev-cluster2-node4
>>>>  info: clone_print:   Master/Slave Set: msPostgresql [pgsql]
>>>>  info: short_print:   Masters: [ dev-cluster2-node4 ]
>>>>  info: short_print:   Slaves: [ dev-cluster2-node2 
>>>> dev-cluster2-node3 ]
>>>>  info: short_print:   Stopped: [ dev-cluster2-node1 ]
>>>>  info: clone_print:   Clone Set: clnPingCheck [pingCheck]
>>>>  info: short_print:   Started: [ dev-cluster2-node2 
>>>> dev-cluster2-node3 dev-cluster2-node4 ]
>>>>  info: short_print:   Stopped: [ dev-cluster2-node1 ]
>>>>  info: native_print: st1 (stonith:external/sshbykey):    
>>>> Started dev-cluster2-node4
>>>>  info: native_color: Resource pingCheck:3 cannot run anywhere
>>>>  info: native_color: Resource pgsql:3 cannot run anywhere
>>>>  info: rsc_merge_weights:    pgsql:1: Breaking dependency loop at 
>>>> msPostgresql
>>>>  info: rsc_merge_weights:    pgsql:2: Breaking dependency loop at 
>>>> msPostgresql
>>>>  info: master_color: Promoting pgsql:0

Re: [Pacemaker] again "return code", now in crm_attribute

2014-01-09 Thread Andrey Groshev


10.01.2014, 03:28, "Andrew Beekhof" :
> On 9 Jan 2014, at 4:44 pm, Andrey Groshev  wrote:
>
>>  09.01.2014, 02:39, "Andrew Beekhof" :
>>>   On 18 Dec 2013, at 11:55 pm, Andrey Groshev  wrote:
>>>>    Hi, Andrew and ALL.
>>>>
>>>>    I'm sorry, but I again found an error. :)
>>>>    Crux of the problem:
>>>>
>>>>    # crm_attribute --type crm_config --attr-name stonith-enabled --query; 
>>>> echo $?
>>>>    scope=crm_config  name=stonith-enabled value=true
>>>>    0
>>>>
>>>>    # crm_attribute --type crm_config --attr-name stonith-enabled --update 
>>>> firstval ; echo $?
>>>>    0
>>>>
>>>>    # crm_attribute --type crm_config --attr-name stonith-enabled --query; 
>>>> echo $?
>>>>    scope=crm_config  name=stonith-enabled value=firstval
>>>>    0
>>>>
>>>>    # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
>>>> secondval --lifetime=reboot ; echo $?
>>>>    0
>>>>
>>>>    # crm_attribute --type crm_config --attr-name stonith-enabled --query; 
>>>> echo $?
>>>>    scope=crm_config  name=stonith-enabled value=firstval
>>>>    0
>>>>
>>>>    # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
>>>> thirdval --lifetime=forever ; echo $?
>>>>    0
>>>>
>>>>    # crm_attribute --type crm_config --attr-name stonith-enabled --query; 
>>>> echo $?
>>>>    scope=crm_config  name=stonith-enabled value=firstval
>>>>    0
>>>>
>>>>    Ie if specify the lifetime of an attribute, then a attribure is not 
>>>> updated.
>>>>
>>>>    If impossible setup the lifetime of the attribute when it is 
>>>> installing, it must be return an error.
>>>   Agreed. I'll reproduce and get back to you.
>>  How, I was able to review code, problem comes when used both options 
>> "--type" and options "--lifetime".
>>  One variant in "case" without break;
>>  Unfortunately, I did not have time to dive into the logic.
>
> Actually, the logic is correct.  The command:
>
> # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
> secondval --lifetime=reboot ; echo $?
>
> is invalid.  You only get to specify --type OR --lifetime, not both.
> By specifying --lifetime, you're creating a node attribute, not a cluster 
> proprerty.

With this, I do not argue. I think that should be the exit code is NOT ZERO, ie 
it's error!

>>>>    And if possible then the value should be established.
>>>>    In general, something is wrong.
>>>>    Denser unfortunately not yet looked, because I struggle with "STONITH" 
>>>> :)
>>>>
>>>>    P.S. Andrew! Late to congratulate you on your new addition to the 
>>>> family.
>>>>    This fine time - now you will have toys which was not in your childhood.
>>>>
>>>>    ___
>>>>    Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>>    Project Home: http://www.clusterlabs.org
>>>>    Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>    Bugs: http://bugs.clusterlabs.org
>>>   ,
>>>   ___
>>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>   Project Home: http://www.clusterlabs.org
>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>   Bugs: http://bugs.clusterlabs.org
>>  ___
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] again "return code", now in crm_attribute

2014-01-09 Thread Andrey Groshev


10.01.2014, 09:06, "Andrew Beekhof" :
> On 10 Jan 2014, at 3:51 pm, Andrey Groshev  wrote:
>
>>  10.01.2014, 03:28, "Andrew Beekhof" :
>>>  On 9 Jan 2014, at 4:44 pm, Andrey Groshev  wrote:
>>>>   09.01.2014, 02:39, "Andrew Beekhof" :
>>>>>    On 18 Dec 2013, at 11:55 pm, Andrey Groshev  wrote:
>>>>>> Hi, Andrew and ALL.
>>>>>>
>>>>>> I'm sorry, but I again found an error. :)
>>>>>> Crux of the problem:
>>>>>>
>>>>>> # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>> --query; echo $?
>>>>>> scope=crm_config  name=stonith-enabled value=true
>>>>>> 0
>>>>>>
>>>>>> # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>> --update firstval ; echo $?
>>>>>> 0
>>>>>>
>>>>>> # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>> --query; echo $?
>>>>>> scope=crm_config  name=stonith-enabled value=firstval
>>>>>> 0
>>>>>>
>>>>>> # crm_attribute --type crm_config --attr-name stonith-enabled  
>>>>>> --update secondval --lifetime=reboot ; echo $?
>>>>>> 0
>>>>>>
>>>>>> # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>> --query; echo $?
>>>>>> scope=crm_config  name=stonith-enabled value=firstval
>>>>>> 0
>>>>>>
>>>>>> # crm_attribute --type crm_config --attr-name stonith-enabled  
>>>>>> --update thirdval --lifetime=forever ; echo $?
>>>>>> 0
>>>>>>
>>>>>> # crm_attribute --type crm_config --attr-name stonith-enabled 
>>>>>> --query; echo $?
>>>>>> scope=crm_config  name=stonith-enabled value=firstval
>>>>>> 0
>>>>>>
>>>>>> Ie if specify the lifetime of an attribute, then a attribure is not 
>>>>>> updated.
>>>>>>
>>>>>> If impossible setup the lifetime of the attribute when it is 
>>>>>> installing, it must be return an error.
>>>>>    Agreed. I'll reproduce and get back to you.
>>>>   How, I was able to review code, problem comes when used both options 
>>>> "--type" and options "--lifetime".
>>>>   One variant in "case" without break;
>>>>   Unfortunately, I did not have time to dive into the logic.
>>>  Actually, the logic is correct.  The command:
>>>
>>>  # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
>>> secondval --lifetime=reboot ; echo $?
>>>
>>>  is invalid.  You only get to specify --type OR --lifetime, not both.
>>>  By specifying --lifetime, you're creating a node attribute, not a cluster 
>>> proprerty.
>>  With this, I do not argue. I think that should be the exit code is NOT 
>> ZERO, ie it's error!
>
> No, its setting a value, just not where you thought (or where you're looking 
> for it in the next command).
>
> Its the same as writing:
>
>   crm_attribute --type crm_config --type status --attr-name stonith-enabled  
> --update secondval; echo $?
>
> Only the last value for --type wins
>

Because of this confusion is obtained. Here is an example of the old cluster:
#crm_attribute --type crm_config --attr-name test1  --update val1 
--lifetime=reboot ; echo $?
0
# cibadmin -Q|grep test1
  
Win "--lifetime" ?
Is not it easier to produce an error when trying to use incompatible options?
Then there is this uncertainty and "was meant...", "was meant...", "was 
meant...".


>>>>>> And if possible then the value should be established.
>>>>>> In general, something is wrong.
>>>>>> Denser unfortunately not yet looked, because I struggle with 
>>>>>> "STONITH" :)
>>>>>>
>>>>>> P.S. Andrew! Late to congratulate you on your new addition to the 
>>>>>> family.
>>>>>> This fine time - now you will have toys which was not in your 
>>>>>> childhood.
>>>>>>
>>>>>> 

Re: [Pacemaker] hangs pending

2014-01-09 Thread Andrey Groshev


08.01.2014, 06:22, "Andrew Beekhof" :
> On 29 Nov 2013, at 7:17 pm, Andrey Groshev  wrote:
>
>>  Hi, ALL.
>>
>>  I'm still trying to cope with the fact that after the fence - node hangs in 
>> "pending".
>
> Please define "pending".  Where did you see this?
In crm_mon:
..
Node dev-cluster2-node2 (172793105): pending
..


The experiment was like this:
Four nodes in cluster.
On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
Thereafter, the remaining start it constantly reboot, under various pretexts, 
"softly whistling", "fly low", "not a cluster member!" ...
Then in the log fell out "Too many failures "
All this time in the status in crm_mon is "pending".
Depending on the wind direction changed to "UNCLEAN"
Much time has passed and I can not accurately describe the behavior...

Now I am in the following state:
I tried locate the problem. Came here with this.
I set big value in property stonith-timeout="600s".
And got the following behavior:
1. pkill -4 corosync
2. from node with DC call my fence agent "sshbykey"
3. It sends reboot victim and waits until she comes to life again. 
   Once the script makes sure that the victim will rebooted and again available 
via ssh - it exit with 0. 
   All command is logged both the victim and the killer - all right.
4. A little later, the status of the (victim) nodes in crm_mon changes to 
online.
5. BUT... not one resource don't start! Despite the fact that "crm_simalate 
-sL" shows the correct resource to start:
   * Start   pingCheck:3  (dev-cluster2-node2)
6. In this state, we spend the next 600 seconds. 
   After completing this timeout causes another node (not DC) decides to kill 
again our victim. 
   All command again is logged both the victim and the killer - All documented 
:)
7. NOW all resource started in right sequence.

I almost happy, but I do not like: two reboots and 10 minutes of waiting ;)
And if something happens on another node, this the behavior is superimposed on 
old and not any resources not start until the last node will not reload twice.

I tried understood this behavior.
As I understand it:
1. Ultimately, in ./lib/fencing/st_client.c call 
internal_stonith_action_execute().
2. It make fork and pipe from tham.
3. Async call mainloop_child_add with callback to  stonith_action_async_done.
4. Add timeout  g_timeout_add to TERM and KILL signals.

If all right must - call stonith_action_async_done, remove timeout.
For some reason this does not happen. I sit and think 




>>  At this time, there are constant re-election.
>>  Also, I noticed the difference when you start pacemaker.
>>  At normal startup:
>>  * corosync
>>  * pacemakerd
>>  * attrd
>>  * pengine
>>  * lrmd
>>  * crmd
>>  * cib
>>
>>  When hangs start:
>>  * corosync
>>  * pacemakerd
>>  * attrd
>>  * pengine
>>  * crmd
>>  * lrmd
>>  * cib.
>
> Are you referring to the order of the daemons here?
> The cib should not be at the bottom in either case.
>
>>  Who knows who runs lrmd?
>
> Pacemakerd.
>
>>  ___
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Breaking dependency loop && stonith

2014-01-08 Thread Andrey Groshev


08.01.2014, 06:15, "Andrew Beekhof" :
> On 27 Nov 2013, at 12:26 am, Andrey Groshev  wrote:
>
>>  Hi, ALL.
>>
>>  I want to clarify two more questions.
>>  After stonith reboot - this node hangs with status "pending".
>>  The logs found string .
>>
>> info: rsc_merge_weights:    pgsql:1: Breaking dependency loop at 
>> msPostgresql
>> info: rsc_merge_weights:    pgsql:2: Breaking dependency loop at 
>> msPostgresql
>>
>>  This means that breaking search the depends, because they are no more.
>>  Or interrupted by an infinite loop for search the dependency?
>
> The second one, but it has nothing to do with a node being in the "pending" 
> state.
> Where did you see this?

Ok, I've already understood this the problem.
I have "location" for right promote|demote resource.
And too same logic trough "collocation"/"order".
As I thought, they do the same thing and collisions should not happen.
At least on the old cluster it works :)
Now I have removed all unnecessary.


>
>>  And two.
>>  Do I need to clone the stonith resource now (In PCMK 1.1.11)?
>
> No.
>
>>  On the one hand, I see this resource on all nodes through command.
>>  # cibadmin -Q|grep stonith
>> > id="cib-bootstrap-options-stonith-enabled"/>
>>   
>>   
>>   
>>   
>>  (without pending node)
>
> Like all resources, we check all nodes at startup to see if it is already 
> active.
>
>>  On the other hand, another command I see only one instance on a particular 
>> node.
>>  # crm_verify -L
>> info: main: =#=#=#=#= Getting XML =#=#=#=#=
>> info: main: Reading XML from: live cluster
>> info: validate_with_relaxng:    Creating RNG parser context
>> info: determine_online_status_fencing:  Node dev-cluster2-node4 is 
>> active
>> info: determine_online_status:  Node dev-cluster2-node4 is online
>> info: determine_online_status_fencing:  - Node dev-cluster2-node1 is 
>> not ready to run resources
>> info: determine_online_status_fencing:  Node dev-cluster2-node2 is 
>> active
>> info: determine_online_status:  Node dev-cluster2-node2 is online
>> info: determine_online_status_fencing:  Node dev-cluster2-node3 is 
>> active
>> info: determine_online_status:  Node dev-cluster2-node3 is online
>> info: determine_op_status:  Operation monitor found resource pingCheck:0 
>> active on dev-cluster2-node4
>> info: native_print: VirtualIP   (ocf::heartbeat:IPaddr2):    
>>    Started dev-cluster2-node4
>> info: clone_print:   Master/Slave Set: msPostgresql [pgsql]
>> info: short_print:   Masters: [ dev-cluster2-node4 ]
>> info: short_print:   Slaves: [ dev-cluster2-node2 dev-cluster2-node3 
>> ]
>> info: short_print:   Stopped: [ dev-cluster2-node1 ]
>> info: clone_print:   Clone Set: clnPingCheck [pingCheck]
>> info: short_print:   Started: [ dev-cluster2-node2 
>> dev-cluster2-node3 dev-cluster2-node4 ]
>> info: short_print:   Stopped: [ dev-cluster2-node1 ]
>> info: native_print: st1 (stonith:external/sshbykey):    
>> Started dev-cluster2-node4
>> info: native_color: Resource pingCheck:3 cannot run anywhere
>> info: native_color: Resource pgsql:3 cannot run anywhere
>> info: rsc_merge_weights:    pgsql:1: Breaking dependency loop at 
>> msPostgresql
>> info: rsc_merge_weights:    pgsql:2: Breaking dependency loop at 
>> msPostgresql
>> info: master_color: Promoting pgsql:0 (Master dev-cluster2-node4)
>> info: master_color: msPostgresql: Promoted 1 instances of a 
>> possible 1 to master
>> info: LogActions:   Leave   VirtualIP   (Started dev-cluster2-node4)
>> info: LogActions:   Leave   pgsql:0 (Master dev-cluster2-node4)
>> info: LogActions:   Leave   pgsql:1 (Slave dev-cluster2-node2)
>> info: LogActions:   Leave   pgsql:2 (Slave dev-cluster2-node3)
>> info: LogActions:   Leave   pgsql:3 (Stopped)
>> info: LogActions:   Leave   pingCheck:0 (Started dev-cluster2-node4)
>> info: LogActions:   Leave   pingCheck:1 (Started dev-cluster2-node2)
>> info: LogActions:   Leave   pingCheck:2 (Started dev-cluster2-node3)
>> info: LogActions:   Leave   pingCheck:3 (Stopped)
>> info: LogActions:   Leave   st1 (Started dev-cluster2-node4)
>>
>>  Howev

Re: [Pacemaker] again "return code", now in crm_attribute

2014-01-08 Thread Andrey Groshev
09.01.2014, 02:39, "Andrew Beekhof" :

>  On 18 Dec 2013, at 11:55 pm, Andrey Groshev  wrote:
>>   Hi, Andrew and ALL.
>>
>>   I'm sorry, but I again found an error. :)
>>   Crux of the problem:
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled --query; 
>> echo $?
>>   scope=crm_config  name=stonith-enabled value=true
>>   0
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled --update 
>> firstval ; echo $?
>>   0
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled --query; 
>> echo $?
>>   scope=crm_config  name=stonith-enabled value=firstval
>>   0
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
>> secondval --lifetime=reboot ; echo $?
>>   0
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled --query; 
>> echo $?
>>   scope=crm_config  name=stonith-enabled value=firstval
>>   0
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled  --update 
>> thirdval --lifetime=forever ; echo $?
>>   0
>>
>>   # crm_attribute --type crm_config --attr-name stonith-enabled --query; 
>> echo $?
>>   scope=crm_config  name=stonith-enabled value=firstval
>>   0
>>
>>   Ie if specify the lifetime of an attribute, then a attribure is not 
>> updated.
>>
>>   If impossible setup the lifetime of the attribute when it is installing, 
>> it must be return an error.
>  Agreed. I'll reproduce and get back to you.

How, I was able to review code, problem comes when used both options "--type" 
and options "--lifetime".
One variant in "case" without break;
Unfortunately, I did not have time to dive into the logic.

>>   And if possible then the value should be established.
>>   In general, something is wrong.
>>   Denser unfortunately not yet looked, because I struggle with "STONITH" :)
>>
>>   P.S. Andrew! Late to congratulate you on your new addition to the family.
>>   This fine time - now you will have toys which was not in your childhood.
>>
>>   ___
>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>   Project Home: http://www.clusterlabs.org
>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>   Bugs: http://bugs.clusterlabs.org
>  ,
>  ___
>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>  Project Home: http://www.clusterlabs.org
>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>  Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] again "return code", now in crm_attribute

2013-12-18 Thread Andrey Groshev
Hi, Andrew and ALL.

I'm sorry, but I again found an error. :)
Crux of the problem:

# crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $?
scope=crm_config  name=stonith-enabled value=true
0

# crm_attribute --type crm_config --attr-name stonith-enabled --update firstval 
; echo $?
0

# crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $?
scope=crm_config  name=stonith-enabled value=firstval
0

# crm_attribute --type crm_config --attr-name stonith-enabled  --update 
secondval --lifetime=reboot ; echo $?
0

# crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $?
scope=crm_config  name=stonith-enabled value=firstval
0

# crm_attribute --type crm_config --attr-name stonith-enabled  --update 
thirdval --lifetime=forever ; echo $?
0

# crm_attribute --type crm_config --attr-name stonith-enabled --query; echo $?
scope=crm_config  name=stonith-enabled value=firstval
0

Ie if specify the lifetime of an attribute, then a attribure is not updated.

If impossible setup the lifetime of the attribute when it is installing, it 
must be return an error.

And if possible then the value should be established.
In general, something is wrong.
Denser unfortunately not yet looked, because I struggle with "STONITH" :)

P.S. Andrew! Late to congratulate you on your new addition to the family. 
This fine time - now you will have toys which was not in your childhood.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmsh: New syntax for location constraints, suggestions / comments

2013-12-13 Thread Andrey Groshev


13.12.2013, 14:27, "Lars Marowsky-Bree" :
> On 2013-12-13T13:51:27, Andrey Groshev  wrote:
>
>>  Just thought that I was missing in "location", something like: node=any :)
>
> Can you describe what this is supposed to achieve?
>
> "any" is the default for symmetric clusters anyway.

For example, in an asynchronous cluster, start a resource on all nodes, but 
don't write a "location" for each.

> Regards,
> Lars
>
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
> HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] parametr "kind" in "order"

2013-12-13 Thread Andrey Groshev
Hi All,
I see in PCMK Expl.
Explein 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-ordering.html#_mandatory_ordering

What the parameter "kind"? Written "since 1.1.2"
This unrealized plans or something I do not understand?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmsh: New syntax for location constraints, suggestions / comments

2013-12-13 Thread Andrey Groshev
Hi,

Just thought that I was missing in "location", something like: node=any :)

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Time to get ready for 1.1.11

2013-12-12 Thread Andrey Groshev
And why not include it? 
https://github.com/beekhof/pacemaker/commit/a4bdc9a 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource

2013-12-03 Thread Andrey Groshev


04.12.2013, 03:30, "David Vossel" :
> - Original Message -
>
>>  From: "Brian J. Murrell" 
>>  To: pacema...@clusterlabs.org
>>  Sent: Monday, December 2, 2013 2:50:41 PM
>>  Subject: [Pacemaker] catch-22: can't fence node A because node A has the 
>> fencing resource
>>
>>  So, I'm migrating my working pacemaker configuration from 1.1.7 to
>>  1.1.10 and am finding what appears to be a new behavior in 1.1.10.
>>
>>  If a given node is running a fencing resource and that node goes AWOL,
>>  it needs to be fenced (of course).  But any other node trying to take
>>  over the fencing resource to fence it appears to first want the current
>>  owner of the fencing resource to fence the node.  Of course that can't
>>  happen since the node that needs to do the fencing is AWOL.
>>
>>  So while I can buy into the general policy that a node needs to be
>>  fenced in order to take over it's resources, fencing resources have to
>>  be excepted from this or there can be this catch-22.
>
> We did away with all of the policy engine logic involved with trying to move 
> fencing devices off of the target node before executing the fencing action. 
> Behind the scenes all fencing devices are now essentially clones.  If the 
> target node to be fenced has a fencing device running on it, that device can 
> execute anywhere in the cluster to avoid the "suicide" situation.
>
> When you are looking at crm_mon output and see a fencing device is running on 
> a specific node, all that really means is that we are going to attempt to 
> execute fencing actions for that device from that node first. 

Means... means... means... 
There are baseline principles of programming, one of which is "obvious better 
not obvious."


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] hangs pending

2013-11-29 Thread Andrey Groshev
Hi, ALL.

I'm still trying to cope with the fact that after the fence - node hangs in 
"pending".
At this time, there are constant re-election.
Also, I noticed the difference when you start pacemaker.
At normal startup:
* corosync
* pacemakerd
* attrd
* pengine
* lrmd
* crmd
* cib

When hangs start:
* corosync
* pacemakerd
* attrd
* pengine
* crmd
* lrmd
* cib.

Who knows who runs lrmd? 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Breaking dependency loop && stonith

2013-11-26 Thread Andrey Groshev
Hi, ALL.

I want to clarify two more questions.
After stonith reboot - this node hangs with status "pending".
The logs found string .

info: rsc_merge_weights:pgsql:1: Breaking dependency loop at 
msPostgresql
info: rsc_merge_weights:pgsql:2: Breaking dependency loop at 
msPostgresql

This means that breaking search the depends, because they are no more.
Or interrupted by an infinite loop for search the dependency?

And two.
Do I need to clone the stonith resource now (In PCMK 1.1.11)?
On the one hand, I see this resource on all nodes through command.
# cibadmin -Q|grep stonith

  
  
  
  
(without pending node)

On the other hand, another command I see only one instance on a particular node.
# crm_verify -L
info: main: =#=#=#=#= Getting XML =#=#=#=#=
info: main: Reading XML from: live cluster
info: validate_with_relaxng:Creating RNG parser context
info: determine_online_status_fencing:  Node dev-cluster2-node4 is 
active
info: determine_online_status:  Node dev-cluster2-node4 is online
info: determine_online_status_fencing:  - Node dev-cluster2-node1 is 
not ready to run resources
info: determine_online_status_fencing:  Node dev-cluster2-node2 is 
active
info: determine_online_status:  Node dev-cluster2-node2 is online
info: determine_online_status_fencing:  Node dev-cluster2-node3 is 
active
info: determine_online_status:  Node dev-cluster2-node3 is online
info: determine_op_status:  Operation monitor found resource pingCheck:0 
active on dev-cluster2-node4
info: native_print: VirtualIP   (ocf::heartbeat:IPaddr2):   
Started dev-cluster2-node4
info: clone_print:   Master/Slave Set: msPostgresql [pgsql]
info: short_print:   Masters: [ dev-cluster2-node4 ]
info: short_print:   Slaves: [ dev-cluster2-node2 dev-cluster2-node3 ]
info: short_print:   Stopped: [ dev-cluster2-node1 ]
info: clone_print:   Clone Set: clnPingCheck [pingCheck]
info: short_print:   Started: [ dev-cluster2-node2 dev-cluster2-node3 
dev-cluster2-node4 ]
info: short_print:   Stopped: [ dev-cluster2-node1 ]
info: native_print: st1 (stonith:external/sshbykey):Started 
dev-cluster2-node4
info: native_color: Resource pingCheck:3 cannot run anywhere
info: native_color: Resource pgsql:3 cannot run anywhere
info: rsc_merge_weights:pgsql:1: Breaking dependency loop at 
msPostgresql
info: rsc_merge_weights:pgsql:2: Breaking dependency loop at 
msPostgresql
info: master_color: Promoting pgsql:0 (Master dev-cluster2-node4)
info: master_color: msPostgresql: Promoted 1 instances of a 
possible 1 to master
info: LogActions:   Leave   VirtualIP   (Started dev-cluster2-node4)
info: LogActions:   Leave   pgsql:0 (Master dev-cluster2-node4)
info: LogActions:   Leave   pgsql:1 (Slave dev-cluster2-node2)
info: LogActions:   Leave   pgsql:2 (Slave dev-cluster2-node3)
info: LogActions:   Leave   pgsql:3 (Stopped)
info: LogActions:   Leave   pingCheck:0 (Started dev-cluster2-node4)
info: LogActions:   Leave   pingCheck:1 (Started dev-cluster2-node2)
info: LogActions:   Leave   pingCheck:2 (Started dev-cluster2-node3)
info: LogActions:   Leave   pingCheck:3 (Stopped)
info: LogActions:   Leave   st1 (Started dev-cluster2-node4)


However, if I do a "clone" - it turns out the same garbage.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] some questions about STONITH

2013-11-25 Thread Andrey Groshev
>...snip...
>>  Make next test:
>>  #stonith_admin --reboot=dev-cluster2-node2
>>  Node reboot, but resource don't start.
>>  In crm_mon status - Node dev-cluster2-node2 (172793105): pending.
>>  And it will be hung.
>
> That is *probably* a race - the node reboots too fast, or still
> communicates for a bit after the fence has supposedly completed (if it's
> not a reboot -nf, but a mere reboot). We have had problems here in the
> past.
>
> You may want to file a proper bug report with crm_report included, and
> preferably corosync/pacemaker debugging enabled.

It was found that he hangs not forever.
Triggered timeout - in 20 minutes.
crm_report archive - http://send2me.ru/pen2.tar.bz2
Of course in the logs many type entries:

pgsql:1: Breaking dependency loop at msPostgresql

But where does this relationship after a timeout, I do not understand.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] exit code crm_attibute

2013-11-21 Thread Andrey Groshev
Hi, Andrew!

I'm trying to find the source of my problems. 
This touble exist only on "--query" 
I learned crm_attribute.c 
IMHO, when call
rc = read_attr_delegate(the_cib, type, dest_node, set_type, set_name,
 attr_id, attr_name, &read_value, TRUE, NULL);

I thick that dest_node == NULL

Since in following piece of code ignored return value.

238if (pcmk_ok != query_node_uuid(the_cib, dest_uname, &dest_node, 
&is_remote_node)) {
239 fprintf(stderr, "Could not map name=%s to a UUID\n", dest_uname);
240}

Maybe it should look like this?
238rc = query_node_uuid(the_cib, dest_uname, &dest_node, &is_remote_node)) 
239if (rc != pcmk_ok) {
239 fprintf(stderr, "Could not map name=%s to a UUID\n", dest_uname);
240 return crm_exit(rc);
241}


19.11.2013, 16:12, "Andrey Groshev" :
> Hellow Andrew!
>
> I'm sorry, forgot about this thread, and now again came across the same 
> problem.
> # crm_attribute --type nodes --node-uname fackename.node.org --attr-name 
> notexistattibute --query > /dev/null; echo $?
> Could not map name=fackename.node.org to a UUID
> 0
>
> Version PCMK 1.1.11
>
> 23.09.2013, 08:23, "Andrew Beekhof" :
>
>>  On 20/09/2013, at 5:53 PM, Andrey Groshev  wrote:
>>>   Hi again!
>>>
>>>   Today again met a strange behavior.
>>>   I asked for a non-existent attribute of an existing node.
>>>
>>>   # crm_attribute --type nodes --node-uname exist.node.domain.com 
>>> --attr-name notexistattibute --query  ; echo $?
>>>   Could not map name=dev-cluster2-node2.unix.tensor.ru to a UUID
>>>   0
>>>
>>>   That is, to STDERR - swore, but the exit code - 0.
>>  That probably shouldn't happen.  Version?
>>>   ___
>>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>   Project Home: http://www.clusterlabs.org
>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>   Bugs: http://bugs.clusterlabs.org
>>  ,
>>  ___
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] some questions about STONITH

2013-11-19 Thread Andrey Groshev


19.11.2013, 23:17, "Lars Marowsky-Bree" :
> On 2013-11-19T23:06:04, Andrey Groshev  wrote:
>
>>>  First, like digimer wrote, clearly stonith-by-ssh is useless for
>>>  production since you can't fence nodes that are having problems. But for
>>>  testing, it's worth a try.
>>  Maybe I do not quite understand correctly the term "fence"
>
> A "fence" request is executed when a node is deemed to be in an
> untrustworthy state - when a stop has failed, or when a network error
> occurs. Note that in the last case, login via ssh is obviously no longer
> possible at all.

In last case the node conditional fenced. )
As I understand it, under the "fence" all you mean "power off" node or 
disconnect it from a network. Yes?

> With the new fence-topology, you could try ssh first before escalating
> to a real fencing mechanism, but why bother?
>
>>>  Note that cluster-glue actually does include an external/ssh script.
>>>  You're reinventing the wheel ;-)
>>  I've seen your script, thanks for the example
>>  But my wheels are hard! :)
>>  I need authorization by key, but but I do not want to mix them with 
>> /root/.ssh/...
>
> Why not extend the existing agent rather than writing your own?
In Your code is very much tied to the host list.
I was not sure what quickly realizing my idea based on your code.
I certainly share my code if it will turn out something worthwhile and I'm not 
ashamed to show it. :) 

>
>>  I am indifferent what server reboot if the key matches.
>>  I exactly know that the server was rebooted.
>
> I'm not sure about the first sentence; clearly you care which server is
> rebooted, namely the one the cluster wants to have rebooted (or powered
> off), right? That must be a misunderstanding.

That's right!
In my case - each cluster has a unique private key.
This key only for nodes in this cluster.
Hence, I do not check: exist node, a member node.
IMHO, the main task STONITH - shoot.
He shoots fine.
If he could not do this - it will return an error.
But he will try "reboot" the target even if it's a server of NSA.  ;-)

> Regards,
> Lars
>
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
> HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] some questions about STONITH

2013-11-19 Thread Andrey Groshev


19.11.2013, 22:30, "Lars Marowsky-Bree" :
> On 2013-11-19T22:10:29, Andrey Groshev  wrote:
>
> First, like digimer wrote, clearly stonith-by-ssh is useless for
> production since you can't fence nodes that are having problems. But for
> testing, it's worth a try.

Maybe I do not quite understand correctly the term "fence"
If I want I can restart the server with the software profile, which will turn 
off all services except SSH.

> Note that cluster-glue actually does include an external/ssh script.
> You're reinventing the wheel ;-)

I've seen your script, thanks for the example
But my wheels are hard! :)
I need authorization by key, but but I do not want to mix them with 
/root/.ssh/...
I am indifferent what server reboot if the key matches.
I exactly know that the server was rebooted.

>>  Make next test:
>>  #stonith_admin --reboot=dev-cluster2-node2
>>  Node reboot, but resource don't start.
>>  In crm_mon status - Node dev-cluster2-node2 (172793105): pending.
>>  And it will be hung.
>
> That is *probably* a race - the node reboots too fast, or still
> communicates for a bit after the fence has supposedly completed (if it's
> not a reboot -nf, but a mere reboot). We have had problems here in the
> past.
>
> You may want to file a proper bug report with crm_report included, and
> preferably corosync/pacemaker debugging enabled.

I'll try to do it in the morning.

>>  2.
>>  There is a slight discrepancy in the Pacemaker Expl. and stonith_admin 
>> --help.
>>  stonith_admin --reboot nodename.
>>  In one case, the sign of equality is, in other - no.
>>  Not very important, because operate both.
>
> Yeah, like you said, both work. So it's not actually a problem.
>
> Regards,
> Lars
>
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
> HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] some questions about STONITH

2013-11-19 Thread Andrey Groshev
Hi everyone again.

I started training with STONITH.
I wrote a little STONITH external script.
Its basic moments:
* send the command "reboot" with SSH authentication using a key.
* The script takes a single argument - the path to the private key.
* Any node can send reboot any node (even yourself).

In the crm config it looks like this:
property $id="cib-bootstrap-options" \
stonith-enabled="true"
primitive st1 stonith:external/sshbykey \
params path2key="/opt/cluster_tools_2/keys/root@dev-cluster2-master" 
pcmk_host_check="none"
clone cloneStonith st1

Made the first test - Ok, node was rebooted and resource are started.
#export  
path2key=/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru
# stonith -t external/sshbykey -E dev-cluster2-node1
info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset 
dev-cluster2-node1' output: Now boot time 1384850888, send reboot

info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset 
dev-cluster2-node1' output: Daration: 1340 sec.

info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset 
dev-cluster2-node1' output: GOOD NEWS: dev-cluster2-node1 booted in 1384864288

Do not worry about attention to the "Duration", this because of the jump time 
before synchronization time in the virtual machine and the server. Here the 
meaning of a change, rather than a specific number of seconds. Next time reboot 
 10 - 20 sec.

But farther, there are problems and questions. :)
1. 
Make next test:
#stonith_admin --reboot=dev-cluster2-node2
Node reboot, but resource don't start.
In crm_mon status - Node dev-cluster2-node2 (172793105): pending.
And it will be hung.
Next, if I reboot this node in console, or stonith or stonith_admin (the same 
command!) - resources stats.

Portions of the logs:
   trace: unpack_status:Processing node id=172793105, 
uname=dev-cluster2-node2
   trace: find_xml_node:Could not find transient_attributes in 
node_state.
   trace: unpack_instance_attributes:   No instance attributes
   trace: unpack_status:determining node state
   trace: determine_online_status_fencing:  dev-cluster2-node2: 
in_cluster=false, is_peer=online, join=down, expected=down, term=0
info: determine_online_status_fencing:  - Node dev-cluster2-node2 is 
not ready to run resources
   trace: determine_online_status:  Node dev-cluster2-node2 is offline

   
   
   trace: unpack_status:Processing lrm resource entries on healthy 
node: dev-cluster2-node2
   trace: find_xml_node:Could not find lrm in node_state.
   trace: find_xml_node:Could not find lrm_resources in .
   trace: unpack_lrm_resources: Unpacking resources on 
dev-cluster2-node2

   ..
   trace: can_run_resources:dev-cluster2-node2: online=0, unclean=0, 
standby=1, maintenance=0
   trace: check_actions:Skipping param check for dev-cluster2-node2: 
cant run resources
...
   trace: native_color: Pre-allloc: VirtualIP allocation score on 
dev-cluster2-node2: 0
...


  

  
  
  

  

Why do that behavior?

2. 
There is a slight discrepancy in the Pacemaker Expl. and stonith_admin --help.
stonith_admin --reboot nodename. 
In one case, the sign of equality is, in other - no.
Not very important, because operate both.
But when you start to work and something goes wrong, do you think at all 
suspicious things. :)

3. 
Andrew! You promised post about STONITH debug.

4. (to ALL)
Also, please tell me the real arguments against the use of the SSH in STONITH.
I have my own guesses and thoughts, but I would like to know your experience.

My environment:
corosync-2.3.2
resource-agents-3.9.5
pacemaker 1.1.11

Thanks in advance,
Andrey Groshev

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] exit code crm_attibute

2013-11-19 Thread Andrey Groshev
Hellow Andrew!

I'm sorry, forgot about this thread, and now again came across the same problem.
# crm_attribute --type nodes --node-uname fackename.node.org --attr-name 
notexistattibute --query > /dev/null; echo $?
Could not map name=fackename.node.org to a UUID
0

Version PCMK 1.1.11

23.09.2013, 08:23, "Andrew Beekhof" :
> On 20/09/2013, at 5:53 PM, Andrey Groshev  wrote:
>
>>  Hi again!
>>
>>  Today again met a strange behavior.
>>  I asked for a non-existent attribute of an existing node.
>>
>>  # crm_attribute --type nodes --node-uname exist.node.domain.com --attr-name 
>> notexistattibute --query  ; echo $?
>>  Could not map name=dev-cluster2-node2.unix.tensor.ru to a UUID
>>  0
>>
>>  That is, to STDERR - swore, but the exit code - 0.
>
> That probably shouldn't happen.  Version?
>
>>  ___
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] why pacemaker does not control the resources

2013-11-18 Thread Andrey Groshev


15.11.2013, 03:19, "Andrew Beekhof" :
> On 14 Nov 2013, at 5:06 pm, Andrey Groshev  wrote:
>
>>  14.11.2013, 02:22, "Andrew Beekhof" :
>>>  On 14 Nov 2013, at 6:13 am, Andrey Groshev  wrote:
>>>>   13.11.2013, 03:22, "Andrew Beekhof" :
>>>>>   On 12 Nov 2013, at 4:42 pm, Andrey Groshev  wrote:
>>>>>>    11.11.2013, 03:44, "Andrew Beekhof" :
>>>>>>>    On 8 Nov 2013, at 7:49 am, Andrey Groshev  wrote:
>>>>>>>> Hi, PPL!
>>>>>>>> I need help. I do not understand... Why has stopped working.
>>>>>>>> This configuration work on other cluster, but on corosync1.
>>>>>>>>
>>>>>>>> So... cluster postgres with master/slave.
>>>>>>>> Classic config as in wiki.
>>>>>>>> I build cluster, start, he is working.
>>>>>>>> Next I kill postgres on Master with 6 signal, as if "disk space 
>>>>>>>> left"
>>>>>>>>
>>>>>>>> # pkill -6 postgres
>>>>>>>> # ps axuww|grep postgres
>>>>>>>> root  9032  0.0  0.1 103236   860 pts/0    S+   00:37   0:00 
>>>>>>>> grep postgres
>>>>>>>>
>>>>>>>> PostgreSQL die, But crm_mon shows that the master is still running.
>>>>>>>>
>>>>>>>> Last updated: Fri Nov  8 00:42:08 2013
>>>>>>>> Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on 
>>>>>>>> dev-cluster2-node4
>>>>>>>> Stack: corosync
>>>>>>>> Current DC: dev-cluster2-node4 (172793107) - partition with quorum
>>>>>>>> Version: 1.1.10-1.el6-368c726
>>>>>>>> 3 Nodes configured
>>>>>>>> 7 Resources configured
>>>>>>>>
>>>>>>>> Node dev-cluster2-node2 (172793105): online
>>>>>>>>    pingCheck   (ocf::pacemaker:ping):  Started
>>>>>>>>    pgsql   (ocf::heartbeat:pgsql): Started
>>>>>>>> Node dev-cluster2-node3 (172793106): online
>>>>>>>>    pingCheck   (ocf::pacemaker:ping):  Started
>>>>>>>>    pgsql   (ocf::heartbeat:pgsql): Started
>>>>>>>> Node dev-cluster2-node4 (172793107): online
>>>>>>>>    pgsql   (ocf::heartbeat:pgsql): Master
>>>>>>>>    pingCheck   (ocf::pacemaker:ping):  Started
>>>>>>>>    VirtualIP   (ocf::heartbeat:IPaddr2):   Started
>>>>>>>>
>>>>>>>> Node Attributes:
>>>>>>>> * Node dev-cluster2-node2:
>>>>>>>>    + default_ping_set  : 100
>>>>>>>>    + master-pgsql  : -INFINITY
>>>>>>>>    + pgsql-data-status : STREAMING|ASYNC
>>>>>>>>    + pgsql-status  : HS:async
>>>>>>>> * Node dev-cluster2-node3:
>>>>>>>>    + default_ping_set  : 100
>>>>>>>>    + master-pgsql  : -INFINITY
>>>>>>>>    + pgsql-data-status : STREAMING|ASYNC
>>>>>>>>    + pgsql-status  : HS:async
>>>>>>>> * Node dev-cluster2-node4:
>>>>>>>>    + default_ping_set  : 100
>>>>>>>>    + master-pgsql  : 1000
>>>>>>>>    + pgsql-data-status : LATEST
>>>>>>>>    + pgsql-master-baseline : 0278
>>>>>>>>    + pgsql-status  : PRI
>>>>>>>>
>>>>>>>> Migration summary:
>>>>>>>> * Node dev-cluster2-node4:
>>>>>>>> * Node dev-cluster2-node2:
>>>>>>>> * Node dev-cluster2-node3:
>>>>>>>>
>>>>>>>> Tickets:
>>>>>>>>
>>>>>>>> CONFIG:
>>>>>>>>  

Re: [Pacemaker] why pacemaker does not control the resources

2013-11-13 Thread Andrey Groshev


14.11.2013, 02:22, "Andrew Beekhof" :
> On 14 Nov 2013, at 6:13 am, Andrey Groshev  wrote:
>
>>  13.11.2013, 03:22, "Andrew Beekhof" :
>>>  On 12 Nov 2013, at 4:42 pm, Andrey Groshev  wrote:
>>>>   11.11.2013, 03:44, "Andrew Beekhof" :
>>>>>   On 8 Nov 2013, at 7:49 am, Andrey Groshev  wrote:
>>>>>>    Hi, PPL!
>>>>>>    I need help. I do not understand... Why has stopped working.
>>>>>>    This configuration work on other cluster, but on corosync1.
>>>>>>
>>>>>>    So... cluster postgres with master/slave.
>>>>>>    Classic config as in wiki.
>>>>>>    I build cluster, start, he is working.
>>>>>>    Next I kill postgres on Master with 6 signal, as if "disk space left"
>>>>>>
>>>>>>    # pkill -6 postgres
>>>>>>    # ps axuww|grep postgres
>>>>>>    root  9032  0.0  0.1 103236   860 pts/0    S+   00:37   0:00 grep 
>>>>>> postgres
>>>>>>
>>>>>>    PostgreSQL die, But crm_mon shows that the master is still running.
>>>>>>
>>>>>>    Last updated: Fri Nov  8 00:42:08 2013
>>>>>>    Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on 
>>>>>> dev-cluster2-node4
>>>>>>    Stack: corosync
>>>>>>    Current DC: dev-cluster2-node4 (172793107) - partition with quorum
>>>>>>    Version: 1.1.10-1.el6-368c726
>>>>>>    3 Nodes configured
>>>>>>    7 Resources configured
>>>>>>
>>>>>>    Node dev-cluster2-node2 (172793105): online
>>>>>>   pingCheck   (ocf::pacemaker:ping):  Started
>>>>>>   pgsql   (ocf::heartbeat:pgsql): Started
>>>>>>    Node dev-cluster2-node3 (172793106): online
>>>>>>   pingCheck   (ocf::pacemaker:ping):  Started
>>>>>>   pgsql   (ocf::heartbeat:pgsql): Started
>>>>>>    Node dev-cluster2-node4 (172793107): online
>>>>>>   pgsql   (ocf::heartbeat:pgsql): Master
>>>>>>   pingCheck   (ocf::pacemaker:ping):  Started
>>>>>>   VirtualIP   (ocf::heartbeat:IPaddr2):   Started
>>>>>>
>>>>>>    Node Attributes:
>>>>>>    * Node dev-cluster2-node2:
>>>>>>   + default_ping_set  : 100
>>>>>>   + master-pgsql  : -INFINITY
>>>>>>   + pgsql-data-status : STREAMING|ASYNC
>>>>>>   + pgsql-status  : HS:async
>>>>>>    * Node dev-cluster2-node3:
>>>>>>   + default_ping_set  : 100
>>>>>>   + master-pgsql  : -INFINITY
>>>>>>   + pgsql-data-status : STREAMING|ASYNC
>>>>>>   + pgsql-status  : HS:async
>>>>>>    * Node dev-cluster2-node4:
>>>>>>   + default_ping_set  : 100
>>>>>>   + master-pgsql  : 1000
>>>>>>   + pgsql-data-status : LATEST
>>>>>>   + pgsql-master-baseline : 0278
>>>>>>   + pgsql-status  : PRI
>>>>>>
>>>>>>    Migration summary:
>>>>>>    * Node dev-cluster2-node4:
>>>>>>    * Node dev-cluster2-node2:
>>>>>>    * Node dev-cluster2-node3:
>>>>>>
>>>>>>    Tickets:
>>>>>>
>>>>>>    CONFIG:
>>>>>>    node $id="172793105" dev-cluster2-node2. \
>>>>>>   attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>>>>    node $id="172793106" dev-cluster2-node3. \
>>>>>>   attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>>>>    node $id="172793107" dev-cluster2-node4. \
>>>>>>   attributes pgsql-data-status="LATEST"
>>>>>>    primitive VirtualIP ocf:heartbeat:IPaddr2 \
>>>>>>   params ip="10.76.157.194" \
>>>>>>   op start interval="0"

Re: [Pacemaker] why pacemaker does not control the resources

2013-11-13 Thread Andrey Groshev


13.11.2013, 03:22, "Andrew Beekhof" :
> On 12 Nov 2013, at 4:42 pm, Andrey Groshev  wrote:
>
>>  11.11.2013, 03:44, "Andrew Beekhof" :
>>>  On 8 Nov 2013, at 7:49 am, Andrey Groshev  wrote:
>>>>   Hi, PPL!
>>>>   I need help. I do not understand... Why has stopped working.
>>>>   This configuration work on other cluster, but on corosync1.
>>>>
>>>>   So... cluster postgres with master/slave.
>>>>   Classic config as in wiki.
>>>>   I build cluster, start, he is working.
>>>>   Next I kill postgres on Master with 6 signal, as if "disk space left"
>>>>
>>>>   # pkill -6 postgres
>>>>   # ps axuww|grep postgres
>>>>   root  9032  0.0  0.1 103236   860 pts/0    S+   00:37   0:00 grep 
>>>> postgres
>>>>
>>>>   PostgreSQL die, But crm_mon shows that the master is still running.
>>>>
>>>>   Last updated: Fri Nov  8 00:42:08 2013
>>>>   Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on 
>>>> dev-cluster2-node4
>>>>   Stack: corosync
>>>>   Current DC: dev-cluster2-node4 (172793107) - partition with quorum
>>>>   Version: 1.1.10-1.el6-368c726
>>>>   3 Nodes configured
>>>>   7 Resources configured
>>>>
>>>>   Node dev-cluster2-node2 (172793105): online
>>>>  pingCheck   (ocf::pacemaker:ping):  Started
>>>>  pgsql   (ocf::heartbeat:pgsql): Started
>>>>   Node dev-cluster2-node3 (172793106): online
>>>>  pingCheck   (ocf::pacemaker:ping):  Started
>>>>  pgsql   (ocf::heartbeat:pgsql): Started
>>>>   Node dev-cluster2-node4 (172793107): online
>>>>  pgsql   (ocf::heartbeat:pgsql): Master
>>>>  pingCheck   (ocf::pacemaker:ping):  Started
>>>>  VirtualIP   (ocf::heartbeat:IPaddr2):   Started
>>>>
>>>>   Node Attributes:
>>>>   * Node dev-cluster2-node2:
>>>>  + default_ping_set  : 100
>>>>  + master-pgsql  : -INFINITY
>>>>  + pgsql-data-status : STREAMING|ASYNC
>>>>  + pgsql-status  : HS:async
>>>>   * Node dev-cluster2-node3:
>>>>  + default_ping_set  : 100
>>>>  + master-pgsql  : -INFINITY
>>>>  + pgsql-data-status : STREAMING|ASYNC
>>>>  + pgsql-status  : HS:async
>>>>   * Node dev-cluster2-node4:
>>>>  + default_ping_set  : 100
>>>>  + master-pgsql  : 1000
>>>>  + pgsql-data-status : LATEST
>>>>  + pgsql-master-baseline : 0278
>>>>  + pgsql-status  : PRI
>>>>
>>>>   Migration summary:
>>>>   * Node dev-cluster2-node4:
>>>>   * Node dev-cluster2-node2:
>>>>   * Node dev-cluster2-node3:
>>>>
>>>>   Tickets:
>>>>
>>>>   CONFIG:
>>>>   node $id="172793105" dev-cluster2-node2. \
>>>>  attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>>   node $id="172793106" dev-cluster2-node3. \
>>>>  attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>>   node $id="172793107" dev-cluster2-node4. \
>>>>  attributes pgsql-data-status="LATEST"
>>>>   primitive VirtualIP ocf:heartbeat:IPaddr2 \
>>>>  params ip="10.76.157.194" \
>>>>  op start interval="0" timeout="60s" on-fail="stop" \
>>>>  op monitor interval="10s" timeout="60s" on-fail="restart" \
>>>>  op stop interval="0" timeout="60s" on-fail="block"
>>>>   primitive pgsql ocf:heartbeat:pgsql \
>>>>  params pgctl="/usr/pgsql-9.1/bin/pg_ctl" 
>>>> psql="/usr/pgsql-9.1/bin/psql" pgdata="/var/lib/pgsql/9.1/data" 
>>>> tmpdir="/tmp/pg" start_opt="-p 5432" 
>>>> logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" node_list=" 
>>>> dev-cluster2-node2. dev-cluster2-node3. dev-c

Re: [Pacemaker] why pacemaker does not control the resources

2013-11-11 Thread Andrey Groshev


11.11.2013, 03:44, "Andrew Beekhof" :
> On 8 Nov 2013, at 7:49 am, Andrey Groshev  wrote:
>
>>  Hi, PPL!
>>  I need help. I do not understand... Why has stopped working.
>>  This configuration work on other cluster, but on corosync1.
>>
>>  So... cluster postgres with master/slave.
>>  Classic config as in wiki.
>>  I build cluster, start, he is working.
>>  Next I kill postgres on Master with 6 signal, as if "disk space left"
>>
>>  # pkill -6 postgres
>>  # ps axuww|grep postgres
>>  root  9032  0.0  0.1 103236   860 pts/0    S+   00:37   0:00 grep 
>> postgres
>>
>>  PostgreSQL die, But crm_mon shows that the master is still running.
>>
>>  Last updated: Fri Nov  8 00:42:08 2013
>>  Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on 
>> dev-cluster2-node4
>>  Stack: corosync
>>  Current DC: dev-cluster2-node4 (172793107) - partition with quorum
>>  Version: 1.1.10-1.el6-368c726
>>  3 Nodes configured
>>  7 Resources configured
>>
>>  Node dev-cluster2-node2 (172793105): online
>> pingCheck   (ocf::pacemaker:ping):  Started
>> pgsql   (ocf::heartbeat:pgsql): Started
>>  Node dev-cluster2-node3 (172793106): online
>> pingCheck   (ocf::pacemaker:ping):  Started
>> pgsql   (ocf::heartbeat:pgsql): Started
>>  Node dev-cluster2-node4 (172793107): online
>> pgsql   (ocf::heartbeat:pgsql): Master
>> pingCheck   (ocf::pacemaker:ping):  Started
>> VirtualIP   (ocf::heartbeat:IPaddr2):   Started
>>
>>  Node Attributes:
>>  * Node dev-cluster2-node2:
>> + default_ping_set  : 100
>> + master-pgsql  : -INFINITY
>> + pgsql-data-status : STREAMING|ASYNC
>> + pgsql-status  : HS:async
>>  * Node dev-cluster2-node3:
>> + default_ping_set  : 100
>> + master-pgsql  : -INFINITY
>> + pgsql-data-status : STREAMING|ASYNC
>> + pgsql-status  : HS:async
>>  * Node dev-cluster2-node4:
>> + default_ping_set  : 100
>> + master-pgsql  : 1000
>> + pgsql-data-status : LATEST
>> + pgsql-master-baseline : 0278
>> + pgsql-status  : PRI
>>
>>  Migration summary:
>>  * Node dev-cluster2-node4:
>>  * Node dev-cluster2-node2:
>>  * Node dev-cluster2-node3:
>>
>>  Tickets:
>>
>>  CONFIG:
>>  node $id="172793105" dev-cluster2-node2. \
>> attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>  node $id="172793106" dev-cluster2-node3. \
>> attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>  node $id="172793107" dev-cluster2-node4. \
>> attributes pgsql-data-status="LATEST"
>>  primitive VirtualIP ocf:heartbeat:IPaddr2 \
>> params ip="10.76.157.194" \
>> op start interval="0" timeout="60s" on-fail="stop" \
>> op monitor interval="10s" timeout="60s" on-fail="restart" \
>> op stop interval="0" timeout="60s" on-fail="block"
>>  primitive pgsql ocf:heartbeat:pgsql \
>> params pgctl="/usr/pgsql-9.1/bin/pg_ctl" 
>> psql="/usr/pgsql-9.1/bin/psql" pgdata="/var/lib/pgsql/9.1/data" 
>> tmpdir="/tmp/pg" start_opt="-p 5432" 
>> logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" node_list=" 
>> dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. " 
>> restore_command="gzip -cd 
>> /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz > %p" 
>> primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 
>> keepalives_count=5" master_ip="10.76.157.194" \
>> op start interval="0" timeout="60s" on-fail="restart" \
>> op monitor interval="5s" timeout="61s" on-fail="restart" \
>> op monitor interval="1s" role="Master" timeout="62s" 
>> on-fail="restart" \
>> op promote interval="0" timeout="63s" on-fail="restart" \
>> op demote interval="0" timeout="64s" 

[Pacemaker] why pacemaker does not control the resources

2013-11-07 Thread Andrey Groshev
Hi, PPL!
I need help. I do not understand... Why has stopped working.
This configuration work on other cluster, but on corosync1.

So... cluster postgres with master/slave.
Classic config as in wiki.
I build cluster, start, he is working.
Next I kill postgres on Master with 6 signal, as if "disk space left"

# pkill -6 postgres
# ps axuww|grep postgres
root  9032  0.0  0.1 103236   860 pts/0S+   00:37   0:00 grep postgres 

PostgreSQL die, But crm_mon shows that the master is still running.

Last updated: Fri Nov  8 00:42:08 2013
Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on dev-cluster2-node4
Stack: corosync
Current DC: dev-cluster2-node4 (172793107) - partition with quorum
Version: 1.1.10-1.el6-368c726
3 Nodes configured
7 Resources configured


Node dev-cluster2-node2 (172793105): online
pingCheck   (ocf::pacemaker:ping):  Started
pgsql   (ocf::heartbeat:pgsql): Started
Node dev-cluster2-node3 (172793106): online
pingCheck   (ocf::pacemaker:ping):  Started
pgsql   (ocf::heartbeat:pgsql): Started
Node dev-cluster2-node4 (172793107): online
pgsql   (ocf::heartbeat:pgsql): Master
pingCheck   (ocf::pacemaker:ping):  Started
VirtualIP   (ocf::heartbeat:IPaddr2):   Started

Node Attributes:
* Node dev-cluster2-node2:
+ default_ping_set  : 100
+ master-pgsql  : -INFINITY 
+ pgsql-data-status : STREAMING|ASYNC
+ pgsql-status  : HS:async  
* Node dev-cluster2-node3:
+ default_ping_set  : 100
+ master-pgsql  : -INFINITY 
+ pgsql-data-status : STREAMING|ASYNC
+ pgsql-status  : HS:async  
* Node dev-cluster2-node4:
+ default_ping_set  : 100
+ master-pgsql  : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0278
+ pgsql-status  : PRI

Migration summary:
* Node dev-cluster2-node4: 
* Node dev-cluster2-node2: 
* Node dev-cluster2-node3: 

Tickets:

CONFIG:
node $id="172793105" dev-cluster2-node2. \
attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
node $id="172793106" dev-cluster2-node3. \
attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
node $id="172793107" dev-cluster2-node4. \
attributes pgsql-data-status="LATEST"
primitive VirtualIP ocf:heartbeat:IPaddr2 \
params ip="10.76.157.194" \
op start interval="0" timeout="60s" on-fail="stop" \
op monitor interval="10s" timeout="60s" on-fail="restart" \
op stop interval="0" timeout="60s" on-fail="block"
primitive pgsql ocf:heartbeat:pgsql \
params pgctl="/usr/pgsql-9.1/bin/pg_ctl" psql="/usr/pgsql-9.1/bin/psql" 
pgdata="/var/lib/pgsql/9.1/data" tmpdir="/tmp/pg" start_opt="-p 5432" 
logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" node_list=" 
dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. " 
restore_command="gzip -cd /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz 
> %p" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 
keepalives_count=5" master_ip="10.76.157.194" \
op start interval="0" timeout="60s" on-fail="restart" \
op monitor interval="5s" timeout="61s" on-fail="restart" \
op monitor interval="1s" role="Master" timeout="62s" on-fail="restart" \
op promote interval="0" timeout="63s" on-fail="restart" \
op demote interval="0" timeout="64s" on-fail="stop" \
op stop interval="0" timeout="65s" on-fail="block" \
op notify interval="0" timeout="66s"
primitive pingCheck ocf:pacemaker:ping \
params name="default_ping_set" host_list="10.76.156.1" multiplier="100" 
\
op start interval="0" timeout="60s" on-fail="restart" \
op monitor interval="10s" timeout="60s" on-fail="restart" \
op stop interval="0" timeout="60s" on-fail="ignore"
ms msPostgresql pgsql \
meta master-max="1" master-node-max="1" clone-node-max="1" 
notify="true" target-role="Master" clone-max="3"
clone clnPingCheck pingCheck \
meta clone-max="3"
location l0_DontRunPgIfNotPingGW msPostgresql \
rule $id="l0_DontRunPgIfNotPingGW-rule" -inf: not_defined 
default_ping_set or default_ping_set lt 100
colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck
colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master
order rsc_order-1 0: clnPingCheck msPostgresql
order rsc_order-2 0: msPostgresql:promote VirtualIP:start symmetrical=false
order rsc_order-3 0: msPostgresql:demote VirtualIP:stop symmetrical=false
property $id="cib-bootstrap-options" \
dc-version="1.1.10-1.el6-368c726" \
cluster-infrastructure="corosync" \
stonith-enabled="false" \
no-quorum-policy="stop"
rsc_defaults $id="rsc-options" \
resource-stickiness="INFINITY" \
   

[Pacemaker] What value should be in the $OCF_RESKEY_CRM_meta_notify_slave_uname when a quorum is lost?

2013-11-05 Thread Andrey Groshev
Hi All!
I am interested in this subject, because happens is the following situation.
I build cluster on four nodes with postgres master/slave configuration.
Set quorum-policy=stop
Run the cluster and conducted an experiment - turned off the two nodes.
Resources were off, but not as it I thought.

In code resource-agents exist next strings

if  [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = " " ]; then
ocf_log info "Removing $PGSQL_LOCK."
rm -f $PGSQL_LOCK
fi

And when event "lost quorum" this variable is not equal to the "empty".
It equal last exist slave.
What has to be right?

I wrote Takatoshi, he said it was not behaving correctly Paeymaker.
https://github.com/t-matsuo/resource-agents/issues/28


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] exit code crm_attibute

2013-09-20 Thread Andrey Groshev
Hi again!

Today again met a strange behavior.
I asked for a non-existent attribute of an existing node.

# crm_attribute --type nodes --node-uname exist.node.domain.com --attr-name 
notexistattibute --query  ; echo $?
Could not map name=dev-cluster2-node2.unix.tensor.ru to a UUID
0

That is, to STDERR - swore, but the exit code - 0.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-09-10 Thread Andrey Groshev
Hello Christine, Andrew and all.

I'm sorry - a little was unwell, so did not answer.
What we end this stream of messages?
Who will change? corosync or pacemaker?


05.09.2013, 15:49, "Christine Caulfield" :
> On 05/09/13 11:33, Andrew Beekhof wrote:
>
>>  On 05/09/2013, at 6:37 PM, Christine Caulfield  wrote:
>>>  On 03/09/13 22:03, Andrew Beekhof wrote:
>>>>  On 03/09/2013, at 11:49 PM, Christine Caulfield  
>>>> wrote:
>>>>>  On 03/09/13 05:20, Andrew Beekhof wrote:
>>>>>>  On 02/09/2013, at 5:27 PM, Andrey Groshev  wrote:
>>>>>>>  30.08.2013, 07:18, "Andrew Beekhof" :
>>>>>>>>  On 29/08/2013, at 7:31 PM, Andrey Groshev  wrote:
>>>>>>>>>    29.08.2013, 12:25, "Andrey Groshev" :
>>>>>>>>>>    29.08.2013, 02:55, "Andrew Beekhof" :
>>>>>>>>>>> On 28/08/2013, at 5:38 PM, Andrey Groshev  
>>>>>>>>>>> wrote:
>>>>>>>>>>>>  28.08.2013, 04:06, "Andrew Beekhof" :
>>>>>>>>>>>>>  On 27/08/2013, at 1:13 PM, Andrey Groshev  
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>   27.08.2013, 05:39, "Andrew Beekhof" :
>>>>>>>>>>>>>>>   On 26/08/2013, at 3:09 PM, Andrey Groshev 
>>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>>>    26.08.2013, 03:34, "Andrew Beekhof" 
>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>    On 23/08/2013, at 9:39 PM, Andrey Groshev 
>>>>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Today I try remake my test cluster from cman to 
>>>>>>>>>>>>>>>>>> corosync2.
>>>>>>>>>>>>>>>>>> I drew attention to the following:
>>>>>>>>>>>>>>>>>> If I reset cluster with cman through cibadmin 
>>>>>>>>>>>>>>>>>> --erase --force
>>>>>>>>>>>>>>>>>> In cib is still there exist names of nodes.
>>>>>>>>>>>>>>>>>    Yes, the cluster puts back entries for all the nodes 
>>>>>>>>>>>>>>>>> it know about automagically.
>>>>>>>>>>>>>>>>>> cibadmin -Ql
>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>    
>>>>>>>>>>>>>>>>>>  >>>>>>>>>>>>>>>>> uname="dev-cluster2-node2"/>
>>>>>>>>>>>>>>>>>>  >>>>>>>>>>>>>>>>> uname="dev-cluster2-node4"/>
>>>>>>>>>>>>>>>>>>  >>>>>>>>>>>>>>>>> uname="dev-cluster2-node3"/>
>>>>>>>>>>>>>>>>>>    
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Even if cman and pacemaker running only one node.
>>>>>>>>>>>>>>>>>    I'm assuming all three are configured in cluster.conf?
>>>>>>>>>>>>>>>>    Yes, there exist list nodes.
>>>>>>>>>>>>>>>>>> And if I do too on cluster with corosync2
>>>>>>>>>>>>>>>>>> I see only names of nodes which run corosync and 
>>>>>>>>>>>>>>>>>> pacemaker.
>>>>>>>>>>>>>>>>>    Si

Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-09-03 Thread Andrey Groshev


04.09.2013, 01:06, "Andrew Beekhof" :
> On 03/09/2013, at 11:46 PM, Andrey Groshev  wrote:
>
>>  03.09.2013, 08:27, "Andrew Beekhof" :
>>>  On 02/09/2013, at 5:27 PM, Andrey Groshev  wrote:
>>>>   30.08.2013, 07:18, "Andrew Beekhof" :
>>>>>   On 29/08/2013, at 7:31 PM, Andrey Groshev  wrote:
>>>>>>    29.08.2013, 12:25, "Andrey Groshev" :
>>>>>>>    29.08.2013, 02:55, "Andrew Beekhof" :
>>>>>>>>     On 28/08/2013, at 5:38 PM, Andrey Groshev  wrote:
>>>>>>>>>  28.08.2013, 04:06, "Andrew Beekhof" :
>>>>>>>>>>  On 27/08/2013, at 1:13 PM, Andrey Groshev  
>>>>>>>>>> wrote:
>>>>>>>>>>>   27.08.2013, 05:39, "Andrew Beekhof" :
>>>>>>>>>>>>   On 26/08/2013, at 3:09 PM, Andrey Groshev  
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>    26.08.2013, 03:34, "Andrew Beekhof" :
>>>>>>>>>>>>>>    On 23/08/2013, at 9:39 PM, Andrey Groshev 
>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Today I try remake my test cluster from cman to 
>>>>>>>>>>>>>>> corosync2.
>>>>>>>>>>>>>>> I drew attention to the following:
>>>>>>>>>>>>>>> If I reset cluster with cman through cibadmin --erase 
>>>>>>>>>>>>>>> --force
>>>>>>>>>>>>>>> In cib is still there exist names of nodes.
>>>>>>>>>>>>>>    Yes, the cluster puts back entries for all the nodes it 
>>>>>>>>>>>>>> know about automagically.
>>>>>>>>>>>>>>> cibadmin -Ql
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>    
>>>>>>>>>>>>>>>  >>>>>>>>>>>>>> uname="dev-cluster2-node2"/>
>>>>>>>>>>>>>>>  >>>>>>>>>>>>>> uname="dev-cluster2-node4"/>
>>>>>>>>>>>>>>>  >>>>>>>>>>>>>> uname="dev-cluster2-node3"/>
>>>>>>>>>>>>>>>    
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Even if cman and pacemaker running only one node.
>>>>>>>>>>>>>>    I'm assuming all three are configured in cluster.conf?
>>>>>>>>>>>>>    Yes, there exist list nodes.
>>>>>>>>>>>>>>> And if I do too on cluster with corosync2
>>>>>>>>>>>>>>> I see only names of nodes which run corosync and 
>>>>>>>>>>>>>>> pacemaker.
>>>>>>>>>>>>>>    Since you're not included your config, I can only guess 
>>>>>>>>>>>>>> that your corosync.conf does not have a nodelist.
>>>>>>>>>>>>>>    If it did, you should get the same behaviour.
>>>>>>>>>>>>>    I try and expected_node and nodelist.
>>>>>>>>>>>>   And it didn't work? What version of pacemaker?
>>>>>>>>>>>   It does not work as I expected.
>>>>>>>>>>  Thats because you've used IP addresses in the node list.
>>>>>>>>>>  ie.
>>>>>>>>>>
>>>>>>>>>>  node {
>>>>>>>>>>    ring0_addr: 10.76.157.17
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>>

Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-09-03 Thread Andrey Groshev


03.09.2013, 17:52, "Christine Caulfield" :
> On 03/09/13 05:20, Andrew Beekhof wrote:
>
>>  On 02/09/2013, at 5:27 PM, Andrey Groshev  wrote:
>>>  30.08.2013, 07:18, "Andrew Beekhof" :
>>>>  On 29/08/2013, at 7:31 PM, Andrey Groshev  wrote:
>>>>>    29.08.2013, 12:25, "Andrey Groshev" :
>>>>>>    29.08.2013, 02:55, "Andrew Beekhof" :
>>>>>>> On 28/08/2013, at 5:38 PM, Andrey Groshev  wrote:
>>>>>>>>  28.08.2013, 04:06, "Andrew Beekhof" :
>>>>>>>>>  On 27/08/2013, at 1:13 PM, Andrey Groshev  
>>>>>>>>> wrote:
>>>>>>>>>>   27.08.2013, 05:39, "Andrew Beekhof" :
>>>>>>>>>>>   On 26/08/2013, at 3:09 PM, Andrey Groshev  
>>>>>>>>>>> wrote:
>>>>>>>>>>>>    26.08.2013, 03:34, "Andrew Beekhof" :
>>>>>>>>>>>>>    On 23/08/2013, at 9:39 PM, Andrey Groshev 
>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Today I try remake my test cluster from cman to 
>>>>>>>>>>>>>> corosync2.
>>>>>>>>>>>>>> I drew attention to the following:
>>>>>>>>>>>>>> If I reset cluster with cman through cibadmin --erase 
>>>>>>>>>>>>>> --force
>>>>>>>>>>>>>> In cib is still there exist names of nodes.
>>>>>>>>>>>>>    Yes, the cluster puts back entries for all the nodes it 
>>>>>>>>>>>>> know about automagically.
>>>>>>>>>>>>>> cibadmin -Ql
>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>>  >>>>>>>>>>>>> uname="dev-cluster2-node2"/>
>>>>>>>>>>>>>>  >>>>>>>>>>>>> uname="dev-cluster2-node4"/>
>>>>>>>>>>>>>>  >>>>>>>>>>>>> uname="dev-cluster2-node3"/>
>>>>>>>>>>>>>>    
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Even if cman and pacemaker running only one node.
>>>>>>>>>>>>>    I'm assuming all three are configured in cluster.conf?
>>>>>>>>>>>>    Yes, there exist list nodes.
>>>>>>>>>>>>>> And if I do too on cluster with corosync2
>>>>>>>>>>>>>> I see only names of nodes which run corosync and 
>>>>>>>>>>>>>> pacemaker.
>>>>>>>>>>>>>    Since you're not included your config, I can only guess 
>>>>>>>>>>>>> that your corosync.conf does not have a nodelist.
>>>>>>>>>>>>>    If it did, you should get the same behaviour.
>>>>>>>>>>>>    I try and expected_node and nodelist.
>>>>>>>>>>>   And it didn't work? What version of pacemaker?
>>>>>>>>>>   It does not work as I expected.
>>>>>>>>>  Thats because you've used IP addresses in the node list.
>>>>>>>>>  ie.
>>>>>>>>>
>>>>>>>>>  node {
>>>>>>>>>    ring0_addr: 10.76.157.17
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>>  try including the node name as well, eg.
>>>>>>>>>
>>>>>>>>>  node {
>>>>>>>>>    name: dev-cluster2-node2
>>>>>>>>>    ring0_addr: 10.76.157.17
>>>>>>&

Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-09-03 Thread Andrey Groshev


03.09.2013, 08:27, "Andrew Beekhof" :
> On 02/09/2013, at 5:27 PM, Andrey Groshev  wrote:
>
>>  30.08.2013, 07:18, "Andrew Beekhof" :
>>>  On 29/08/2013, at 7:31 PM, Andrey Groshev  wrote:
>>>>   29.08.2013, 12:25, "Andrey Groshev" :
>>>>>   29.08.2013, 02:55, "Andrew Beekhof" :
>>>>>>    On 28/08/2013, at 5:38 PM, Andrey Groshev  wrote:
>>>>>>> 28.08.2013, 04:06, "Andrew Beekhof" :
>>>>>>>>     On 27/08/2013, at 1:13 PM, Andrey Groshev  wrote:
>>>>>>>>>  27.08.2013, 05:39, "Andrew Beekhof" :
>>>>>>>>>>  On 26/08/2013, at 3:09 PM, Andrey Groshev  
>>>>>>>>>> wrote:
>>>>>>>>>>>   26.08.2013, 03:34, "Andrew Beekhof" :
>>>>>>>>>>>>   On 23/08/2013, at 9:39 PM, Andrey Groshev  
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>    Hello,
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Today I try remake my test cluster from cman to corosync2.
>>>>>>>>>>>>>    I drew attention to the following:
>>>>>>>>>>>>>    If I reset cluster with cman through cibadmin --erase 
>>>>>>>>>>>>> --force
>>>>>>>>>>>>>    In cib is still there exist names of nodes.
>>>>>>>>>>>>   Yes, the cluster puts back entries for all the nodes it know 
>>>>>>>>>>>> about automagically.
>>>>>>>>>>>>>    cibadmin -Ql
>>>>>>>>>>>>>    .
>>>>>>>>>>>>>   
>>>>>>>>>>>>> >>>>>>>>>>>> uname="dev-cluster2-node2"/>
>>>>>>>>>>>>> >>>>>>>>>>>> uname="dev-cluster2-node4"/>
>>>>>>>>>>>>> >>>>>>>>>>>> uname="dev-cluster2-node3"/>
>>>>>>>>>>>>>   
>>>>>>>>>>>>>    
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Even if cman and pacemaker running only one node.
>>>>>>>>>>>>   I'm assuming all three are configured in cluster.conf?
>>>>>>>>>>>   Yes, there exist list nodes.
>>>>>>>>>>>>>    And if I do too on cluster with corosync2
>>>>>>>>>>>>>    I see only names of nodes which run corosync and pacemaker.
>>>>>>>>>>>>   Since you're not included your config, I can only guess that 
>>>>>>>>>>>> your corosync.conf does not have a nodelist.
>>>>>>>>>>>>   If it did, you should get the same behaviour.
>>>>>>>>>>>   I try and expected_node and nodelist.
>>>>>>>>>>  And it didn't work? What version of pacemaker?
>>>>>>>>>  It does not work as I expected.
>>>>>>>> Thats because you've used IP addresses in the node list.
>>>>>>>> ie.
>>>>>>>>
>>>>>>>> node {
>>>>>>>>   ring0_addr: 10.76.157.17
>>>>>>>> }
>>>>>>>>
>>>>>>>> try including the node name as well, eg.
>>>>>>>>
>>>>>>>> node {
>>>>>>>>   name: dev-cluster2-node2
>>>>>>>>   ring0_addr: 10.76.157.17
>>>>>>>> }
>>>>>>> The same thing.
>>>>>>    I don't know what to say.  I tested it here yesterday and it worked 
>>>>>> as expected.
>>>>>   I found that the reason that You and I have different results - I did 
>>>>> not have reverse DNS zone for these nodes.
>>>>>   I know what it should be, but (PACEMAKER + CMAN) worked without a 
>>>>

Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-09-02 Thread Andrey Groshev


30.08.2013, 07:18, "Andrew Beekhof" :
> On 29/08/2013, at 7:31 PM, Andrey Groshev  wrote:
>
>>  29.08.2013, 12:25, "Andrey Groshev" :
>>>  29.08.2013, 02:55, "Andrew Beekhof" :
>>>>   On 28/08/2013, at 5:38 PM, Andrey Groshev  wrote:
>>>>>    28.08.2013, 04:06, "Andrew Beekhof" :
>>>>>>    On 27/08/2013, at 1:13 PM, Andrey Groshev  wrote:
>>>>>>> 27.08.2013, 05:39, "Andrew Beekhof" :
>>>>>>>>     On 26/08/2013, at 3:09 PM, Andrey Groshev  wrote:
>>>>>>>>>  26.08.2013, 03:34, "Andrew Beekhof" :
>>>>>>>>>>  On 23/08/2013, at 9:39 PM, Andrey Groshev  
>>>>>>>>>> wrote:
>>>>>>>>>>>   Hello,
>>>>>>>>>>>
>>>>>>>>>>>   Today I try remake my test cluster from cman to corosync2.
>>>>>>>>>>>   I drew attention to the following:
>>>>>>>>>>>   If I reset cluster with cman through cibadmin --erase --force
>>>>>>>>>>>   In cib is still there exist names of nodes.
>>>>>>>>>>  Yes, the cluster puts back entries for all the nodes it know 
>>>>>>>>>> about automagically.
>>>>>>>>>>>   cibadmin -Ql
>>>>>>>>>>>   .
>>>>>>>>>>>  
>>>>>>>>>>>    >>>>>>>>>> uname="dev-cluster2-node2"/>
>>>>>>>>>>>    >>>>>>>>>> uname="dev-cluster2-node4"/>
>>>>>>>>>>>    >>>>>>>>>> uname="dev-cluster2-node3"/>
>>>>>>>>>>>  
>>>>>>>>>>>   
>>>>>>>>>>>
>>>>>>>>>>>   Even if cman and pacemaker running only one node.
>>>>>>>>>>  I'm assuming all three are configured in cluster.conf?
>>>>>>>>>  Yes, there exist list nodes.
>>>>>>>>>>>   And if I do too on cluster with corosync2
>>>>>>>>>>>   I see only names of nodes which run corosync and pacemaker.
>>>>>>>>>>  Since you're not included your config, I can only guess that 
>>>>>>>>>> your corosync.conf does not have a nodelist.
>>>>>>>>>>  If it did, you should get the same behaviour.
>>>>>>>>>  I try and expected_node and nodelist.
>>>>>>>> And it didn't work? What version of pacemaker?
>>>>>>> It does not work as I expected.
>>>>>>    Thats because you've used IP addresses in the node list.
>>>>>>    ie.
>>>>>>
>>>>>>    node {
>>>>>>  ring0_addr: 10.76.157.17
>>>>>>    }
>>>>>>
>>>>>>    try including the node name as well, eg.
>>>>>>
>>>>>>    node {
>>>>>>  name: dev-cluster2-node2
>>>>>>  ring0_addr: 10.76.157.17
>>>>>>    }
>>>>>    The same thing.
>>>>   I don't know what to say.  I tested it here yesterday and it worked as 
>>>> expected.
>>>  I found that the reason that You and I have different results - I did not 
>>> have reverse DNS zone for these nodes.
>>>  I know what it should be, but (PACEMAKER + CMAN) worked without a reverse 
>>> area!
>>  Hasty. Deleted all. Reinstalled. Configured. Not working again. Damn!
>
> It would have surprised me... pacemaker 1.1.11 doesn't do any dns lookups - 
> reverse or otherwise.
> Can you set
>
>  PCMK_trace_files=corosync.c
>
> in your environment and retest?
>
> On RHEL6 that means putting the following in /etc/sysconfig/pacemaker
>   export PCMK_trace_files=corosync.c
>
> It should produce additional logging[1] that will help diagnose the issue.
>
> [1] http://blog.clusterlabs.org/blog/2013/pacemaker-logging/
>

Hello, Andrew.

You are a little misunderstood me.
I wrote that I rushed to judgment.
After I did the reverse DNS zone, the cluster behaved correctly.
BUT aft

Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-08-29 Thread Andrey Groshev


29.08.2013, 12:25, "Andrey Groshev" :
> 29.08.2013, 02:55, "Andrew Beekhof" :
>
>>  On 28/08/2013, at 5:38 PM, Andrey Groshev  wrote:
>>>   28.08.2013, 04:06, "Andrew Beekhof" :
>>>>   On 27/08/2013, at 1:13 PM, Andrey Groshev  wrote:
>>>>>    27.08.2013, 05:39, "Andrew Beekhof" :
>>>>>>    On 26/08/2013, at 3:09 PM, Andrey Groshev  wrote:
>>>>>>> 26.08.2013, 03:34, "Andrew Beekhof" :
>>>>>>>> On 23/08/2013, at 9:39 PM, Andrey Groshev  wrote:
>>>>>>>>>  Hello,
>>>>>>>>>
>>>>>>>>>  Today I try remake my test cluster from cman to corosync2.
>>>>>>>>>  I drew attention to the following:
>>>>>>>>>  If I reset cluster with cman through cibadmin --erase --force
>>>>>>>>>  In cib is still there exist names of nodes.
>>>>>>>> Yes, the cluster puts back entries for all the nodes it know about 
>>>>>>>> automagically.
>>>>>>>>>  cibadmin -Ql
>>>>>>>>>  .
>>>>>>>>> 
>>>>>>>>>   >>>>>>>> uname="dev-cluster2-node2"/>
>>>>>>>>>   >>>>>>>> uname="dev-cluster2-node4"/>
>>>>>>>>>   >>>>>>>> uname="dev-cluster2-node3"/>
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>>
>>>>>>>>>  Even if cman and pacemaker running only one node.
>>>>>>>> I'm assuming all three are configured in cluster.conf?
>>>>>>> Yes, there exist list nodes.
>>>>>>>>>  And if I do too on cluster with corosync2
>>>>>>>>>  I see only names of nodes which run corosync and pacemaker.
>>>>>>>> Since you're not included your config, I can only guess that your 
>>>>>>>> corosync.conf does not have a nodelist.
>>>>>>>> If it did, you should get the same behaviour.
>>>>>>> I try and expected_node and nodelist.
>>>>>>    And it didn't work? What version of pacemaker?
>>>>>    It does not work as I expected.
>>>>   Thats because you've used IP addresses in the node list.
>>>>   ie.
>>>>
>>>>   node {
>>>> ring0_addr: 10.76.157.17
>>>>   }
>>>>
>>>>   try including the node name as well, eg.
>>>>
>>>>   node {
>>>> name: dev-cluster2-node2
>>>> ring0_addr: 10.76.157.17
>>>>   }
>>>   The same thing.
>>  I don't know what to say.  I tested it here yesterday and it worked as 
>> expected.
>
> I found that the reason that You and I have different results - I did not 
> have reverse DNS zone for these nodes.
> I know what it should be, but (PACEMAKER + CMAN) worked without a reverse 
> area!
>

Hasty. Deleted all. Reinstalled. Configured. Not working again. Damn!

>>>   # corosync-cmapctl |grep nodelist
>>>   nodelist.local_node_pos (u32) = 2
>>>   nodelist.node.0.name (str) = dev-cluster2-node2
>>>   nodelist.node.0.ring0_addr (str) = 10.76.157.17
>>>   nodelist.node.1.name (str) = dev-cluster2-node3
>>>   nodelist.node.1.ring0_addr (str) = 10.76.157.18
>>>   nodelist.node.2.name (str) = dev-cluster2-node4
>>>   nodelist.node.2.ring0_addr (str) = 10.76.157.19
>>>
>>>   # corosync-quorumtool -s
>>>   Quorum information
>>>   --
>>>   Date: Wed Aug 28 11:29:49 2013
>>>   Quorum provider:  corosync_votequorum
>>>   Nodes:    1
>>>   Node ID:  172793107
>>>   Ring ID:  52
>>>   Quorate:  No
>>>
>>>   Votequorum information
>>>   --
>>>   Expected votes:   3
>>>   Highest expected: 3
>>>   Total votes:  1
>>>   Quorum:   2 Activity blocked
>>>   Flags:
>>>
>>>   Membership information
>>>   --
>>>  Nodeid  Votes Name
>>>   172793107  1 dev-cluster2-node4 (local)
>>

Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-08-29 Thread Andrey Groshev


29.08.2013, 02:55, "Andrew Beekhof" :
> On 28/08/2013, at 5:38 PM, Andrey Groshev  wrote:
>
>>  28.08.2013, 04:06, "Andrew Beekhof" :
>>>  On 27/08/2013, at 1:13 PM, Andrey Groshev  wrote:
>>>>   27.08.2013, 05:39, "Andrew Beekhof" :
>>>>>   On 26/08/2013, at 3:09 PM, Andrey Groshev  wrote:
>>>>>>    26.08.2013, 03:34, "Andrew Beekhof" :
>>>>>>>    On 23/08/2013, at 9:39 PM, Andrey Groshev  wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> Today I try remake my test cluster from cman to corosync2.
>>>>>>>> I drew attention to the following:
>>>>>>>> If I reset cluster with cman through cibadmin --erase --force
>>>>>>>> In cib is still there exist names of nodes.
>>>>>>>    Yes, the cluster puts back entries for all the nodes it know about 
>>>>>>> automagically.
>>>>>>>> cibadmin -Ql
>>>>>>>> .
>>>>>>>>    
>>>>>>>>  >>>>>>> uname="dev-cluster2-node2"/>
>>>>>>>>  >>>>>>> uname="dev-cluster2-node4"/>
>>>>>>>>  >>>>>>> uname="dev-cluster2-node3"/>
>>>>>>>>    
>>>>>>>> 
>>>>>>>>
>>>>>>>> Even if cman and pacemaker running only one node.
>>>>>>>    I'm assuming all three are configured in cluster.conf?
>>>>>>    Yes, there exist list nodes.
>>>>>>>> And if I do too on cluster with corosync2
>>>>>>>> I see only names of nodes which run corosync and pacemaker.
>>>>>>>    Since you're not included your config, I can only guess that your 
>>>>>>> corosync.conf does not have a nodelist.
>>>>>>>    If it did, you should get the same behaviour.
>>>>>>    I try and expected_node and nodelist.
>>>>>   And it didn't work? What version of pacemaker?
>>>>   It does not work as I expected.
>>>  Thats because you've used IP addresses in the node list.
>>>  ie.
>>>
>>>  node {
>>>    ring0_addr: 10.76.157.17
>>>  }
>>>
>>>  try including the node name as well, eg.
>>>
>>>  node {
>>>    name: dev-cluster2-node2
>>>    ring0_addr: 10.76.157.17
>>>  }
>>  The same thing.
>
> I don't know what to say.  I tested it here yesterday and it worked as 
> expected.

I found that the reason that You and I have different results - I did not have 
reverse DNS zone for these nodes.
I know what it should be, but (PACEMAKER + CMAN) worked without a reverse area!

>
>>  # corosync-cmapctl |grep nodelist
>>  nodelist.local_node_pos (u32) = 2
>>  nodelist.node.0.name (str) = dev-cluster2-node2
>>  nodelist.node.0.ring0_addr (str) = 10.76.157.17
>>  nodelist.node.1.name (str) = dev-cluster2-node3
>>  nodelist.node.1.ring0_addr (str) = 10.76.157.18
>>  nodelist.node.2.name (str) = dev-cluster2-node4
>>  nodelist.node.2.ring0_addr (str) = 10.76.157.19
>>
>>  # corosync-quorumtool -s
>>  Quorum information
>>  --
>>  Date: Wed Aug 28 11:29:49 2013
>>  Quorum provider:  corosync_votequorum
>>  Nodes:    1
>>  Node ID:  172793107
>>  Ring ID:  52
>>  Quorate:  No
>>
>>  Votequorum information
>>  --
>>  Expected votes:   3
>>  Highest expected: 3
>>  Total votes:  1
>>  Quorum:   2 Activity blocked
>>  Flags:
>>
>>  Membership information
>>  --
>> Nodeid  Votes Name
>>  172793107  1 dev-cluster2-node4 (local)
>>
>>  # cibadmin -Q
>>  > validate-with="pacemaker-1.2" crm_feature_set="3.0.7" cib-last-written="Wed 
>> Aug 28 11:24:06 2013" update-origin="dev-cluster2-node4" 
>> update-client="crmd" have-quorum="0" dc-uuid="172793107">
>>   
>> 
>>   
>> > value="1.1.11-1.el6-4f672bc"/>
>> > name="cluster-infrastructure" value="corosync"/>
>>   
>>    

Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-08-28 Thread Andrey Groshev


28.08.2013, 04:06, "Andrew Beekhof" :
> On 27/08/2013, at 1:13 PM, Andrey Groshev  wrote:
>
>>  27.08.2013, 05:39, "Andrew Beekhof" :
>>>  On 26/08/2013, at 3:09 PM, Andrey Groshev  wrote:
>>>>   26.08.2013, 03:34, "Andrew Beekhof" :
>>>>>   On 23/08/2013, at 9:39 PM, Andrey Groshev  wrote:
>>>>>>    Hello,
>>>>>>
>>>>>>    Today I try remake my test cluster from cman to corosync2.
>>>>>>    I drew attention to the following:
>>>>>>    If I reset cluster with cman through cibadmin --erase --force
>>>>>>    In cib is still there exist names of nodes.
>>>>>   Yes, the cluster puts back entries for all the nodes it know about 
>>>>> automagically.
>>>>>>    cibadmin -Ql
>>>>>>    .
>>>>>>   
>>>>>> >>>>> uname="dev-cluster2-node2"/>
>>>>>> >>>>> uname="dev-cluster2-node4"/>
>>>>>> >>>>> uname="dev-cluster2-node3"/>
>>>>>>   
>>>>>>    
>>>>>>
>>>>>>    Even if cman and pacemaker running only one node.
>>>>>   I'm assuming all three are configured in cluster.conf?
>>>>   Yes, there exist list nodes.
>>>>>>    And if I do too on cluster with corosync2
>>>>>>    I see only names of nodes which run corosync and pacemaker.
>>>>>   Since you're not included your config, I can only guess that your 
>>>>> corosync.conf does not have a nodelist.
>>>>>   If it did, you should get the same behaviour.
>>>>   I try and expected_node and nodelist.
>>>  And it didn't work? What version of pacemaker?
>>  It does not work as I expected.
>
> Thats because you've used IP addresses in the node list.
> ie.
>
> node {
>   ring0_addr: 10.76.157.17
> }
>
> try including the node name as well, eg.
>
> node {
>   name: dev-cluster2-node2
>   ring0_addr: 10.76.157.17
> }

The same thing.

# corosync-cmapctl |grep nodelist
nodelist.local_node_pos (u32) = 2
nodelist.node.0.name (str) = dev-cluster2-node2
nodelist.node.0.ring0_addr (str) = 10.76.157.17
nodelist.node.1.name (str) = dev-cluster2-node3
nodelist.node.1.ring0_addr (str) = 10.76.157.18
nodelist.node.2.name (str) = dev-cluster2-node4
nodelist.node.2.ring0_addr (str) = 10.76.157.19

# corosync-quorumtool -s
Quorum information
--
Date: Wed Aug 28 11:29:49 2013
Quorum provider:  corosync_votequorum
Nodes:1
Node ID:  172793107
Ring ID:  52
Quorate:  No

Votequorum information
--
Expected votes:   3
Highest expected: 3
Total votes:  1
Quorum:   2 Activity blocked
Flags:

Membership information
--
Nodeid  Votes Name
 172793107  1 dev-cluster2-node4 (local)


# cibadmin -Q

  

  


  


  



  
  

  

  
  

  

  

  



>>  I figured out a way get around this, but it would be easier to do if the 
>> CIB has worked as a with CMAN.
>>  I just do not start the main resource if the attribute is not defined or it 
>> is not true.
>>  This slightly changes the logic of the cluster.
>>  But I'm not sure what the correct behavior.
>>
>>  libqb 0.14.4
>>  corosync 2.3.1
>>  pacemaker 1.1.11
>>
>>  All build from source in previews week.
>>>>   Now in corosync.conf:
>>>>
>>>>   totem {
>>>>  version: 2
>>>>  crypto_cipher: none
>>>>  crypto_hash: none
>>>>  interface {
>>>>  ringnumber: 0
>>>>   bindnetaddr: 10.76.157.18
>>>>   mcastaddr: 239.94.1.56
>>>>  mcastport: 5405
>>>>  ttl: 1
>>>>  }
>>>>   }
>>>>   logging {
>>>>  fileline: off
>>>>  to_stderr: no
>>>>  to_logfile: yes
>>>>  logfile: /var/log/cluster/corosync.log
>>>>  to_syslog: yes
>>>>  debug: on
>>>>  timestamp: on
>>>>  logger_subsys {
>>>>  subsys: QUORUM
>>>>  debug: on
>>>>  }
>>>>   }
&g

Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-08-26 Thread Andrey Groshev


27.08.2013, 05:39, "Andrew Beekhof" :
> On 26/08/2013, at 3:09 PM, Andrey Groshev  wrote:
>
>>  26.08.2013, 03:34, "Andrew Beekhof" :
>>>  On 23/08/2013, at 9:39 PM, Andrey Groshev  wrote:
>>>>   Hello,
>>>>
>>>>   Today I try remake my test cluster from cman to corosync2.
>>>>   I drew attention to the following:
>>>>   If I reset cluster with cman through cibadmin --erase --force
>>>>   In cib is still there exist names of nodes.
>>>  Yes, the cluster puts back entries for all the nodes it know about 
>>> automagically.
>>>>   cibadmin -Ql
>>>>   .
>>>>  
>>>>    >>> uname="dev-cluster2-node2"/>
>>>>    >>> uname="dev-cluster2-node4"/>
>>>>    >>> uname="dev-cluster2-node3"/>
>>>>  
>>>>   
>>>>
>>>>   Even if cman and pacemaker running only one node.
>>>  I'm assuming all three are configured in cluster.conf?
>>  Yes, there exist list nodes.
>>>>   And if I do too on cluster with corosync2
>>>>   I see only names of nodes which run corosync and pacemaker.
>>>  Since you're not included your config, I can only guess that your 
>>> corosync.conf does not have a nodelist.
>>>  If it did, you should get the same behaviour.
>>  I try and expected_node and nodelist.
>
> And it didn't work? What version of pacemaker?

It does not work as I expected.
I figured out a way get around this, but it would be easier to do if the CIB 
has worked as a with CMAN.
I just do not start the main resource if the attribute is not defined or it is 
not true.
This slightly changes the logic of the cluster.
But I'm not sure what the correct behavior.

libqb 0.14.4
corosync 2.3.1
pacemaker 1.1.11 

All build from source in previews week.

>
>>  Now in corosync.conf:
>>
>>  totem {
>> version: 2
>> crypto_cipher: none
>> crypto_hash: none
>> interface {
>> ringnumber: 0
>>  bindnetaddr: 10.76.157.18
>>  mcastaddr: 239.94.1.56
>> mcastport: 5405
>> ttl: 1
>> }
>>  }
>>  logging {
>> fileline: off
>> to_stderr: no
>> to_logfile: yes
>> logfile: /var/log/cluster/corosync.log
>> to_syslog: yes
>> debug: on
>> timestamp: on
>> logger_subsys {
>> subsys: QUORUM
>> debug: on
>> }
>>  }
>>  quorum {
>> provider: corosync_votequorum
>>  }
>>  nodelist {
>>  node {
>>  ring0_addr: 10.76.157.17
>>  }
>>  node {
>>  ring0_addr: 10.76.157.18
>>  }
>>  node {
>>  ring0_addr: 10.76.157.19
>>  }
>>  }
>>
>>  ___
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-08-25 Thread Andrey Groshev


26.08.2013, 03:34, "Andrew Beekhof" :
> On 23/08/2013, at 9:39 PM, Andrey Groshev  wrote:
>
>>  Hello,
>>
>>  Today I try remake my test cluster from cman to corosync2.
>>  I drew attention to the following:
>>  If I reset cluster with cman through cibadmin --erase --force
>>  In cib is still there exist names of nodes.
>
> Yes, the cluster puts back entries for all the nodes it know about 
> automagically.
>
>>  cibadmin -Ql
>>  .
>> 
>>   > uname="dev-cluster2-node2"/>
>>   > uname="dev-cluster2-node4"/>
>>   > uname="dev-cluster2-node3"/>
>> 
>>  
>>
>>  Even if cman and pacemaker running only one node.
>
> I'm assuming all three are configured in cluster.conf?

Yes, there exist list nodes.

>
>>  And if I do too on cluster with corosync2
>>  I see only names of nodes which run corosync and pacemaker.
>
> Since you're not included your config, I can only guess that your 
> corosync.conf does not have a nodelist.
> If it did, you should get the same behaviour.

I try and expected_node and nodelist. Now in corosync.conf:

totem {
version: 2
crypto_cipher: none
crypto_hash: none
interface {
ringnumber: 0
 bindnetaddr: 10.76.157.18
 mcastaddr: 239.94.1.56
mcastport: 5405
ttl: 1
}
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
debug: on
timestamp: on
logger_subsys {
subsys: QUORUM
debug: on
}
}
quorum {
provider: corosync_votequorum
}
nodelist {
node {
ring0_addr: 10.76.157.17
}
node {
ring0_addr: 10.76.157.18
}
node {
ring0_addr: 10.76.157.19
}
}





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] different behavior cibadmin -Ql with cman and corosync2

2013-08-23 Thread Andrey Groshev
Hello,

Today I try remake my test cluster from cman to corosync2.
I drew attention to the following:
If I reset cluster with cman through cibadmin --erase --force
In cib is still there exist names of nodes.

cibadmin -Ql
.

  
  
  



Even if cman and pacemaker running only one node.


And if I do too on cluster with corosync2
I see only names of nodes which run corosync and pacemaker.
I'll explain what it is uncomfortable.
I need set attribute before start resource.
On cluster with cman I can do it, but with corosync2 don't
Exist other way ?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] why not updated http://clusterlabs.org/rpm-next/ ..... ?

2013-08-20 Thread Andrey Groshev
Hello Andrew!
Why not updated http://clusterlabs.org/rpm-next/* ?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] why pacemaker stops cman?

2013-07-12 Thread Andrey Groshev
# service pacemaker stop
Waiting for shutdown of managed resources. [  OK  ]
Signaling Pacemaker Cluster Manager to terminate:  [  OK  ]
Waiting for cluster services to unload:..  [  OK  ]
Stopping cluster:
   Leaving fence domain... [  OK  ]
   Stopping gfs_controld...[  OK  ]
   Stopping dlm_controld...[  OK  ]
   Stopping fenced...  [  OK  ]
   Stopping cman...[  OK  ]
   Waiting for corosync to shutdown:   [  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs...  [  OK  ]

# service cman status
corosync dead but subsys locked


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] again trouble with quorum (now with cman)

2013-07-12 Thread Andrey Groshev


12.07.2013, 09:15, "Digimer" :
> On 12/07/13 00:59, Andrey Groshev wrote:
>
>>  11.07.2013, 18:48, "Digimer" :
>>>  You need fencing. Specifically, cman blocks when a fence is called and
>>>  won't unblock until it's told that a fence completed successfully.
>>>  Configure cluster.conf to use 'fence_pcmk', which tells cman to pass
>>>  fence requests to pacemaker, and then configure (and test!) stonith in
>>>  pacemaker.
>>>
>>>  If you have just two nodes, be sure to also set '>>  expected_votes="1" />'.
>>>
>>>  digime
>>>
>>>  On 11/07/13 09:35, Andrey Groshev wrote:
>>  I understand that it may be correct to do so...
>>  But why so difficult?
>>  Assume, I make a small НА cluster in my garage.
>>  And, I not have a managed switch or managed UPS.
>>  I can not corrupt the data.
>>  I just need to returning node as soon as possible started responding.
>
> First, fencing is not difficult, it's just one of the parts of
> clustering to learn. Second, the cluster software has no concept of
> "unimportant cluster". It treats every cluster as enterprise class.
> Third, if a service can run on both nodes without coordination with one
> another, then you don't need the cluster stack at all.
>
> Assuming you actually need to keep the service on one node or the other,
> you need to have a way to make sure the "lost" node really is lost. You
> are not allowed to make assumptions. "The only thing you know if what
> you don't know". Fencing puts a node in an unknown state (disconnected?
> frozen? crashed? blown to pieces?) and puts it into a known state,
> "off". This ensures the service can never run on both nodes at the same
> time.
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?

Well, without a truly fence devices, I must use fence_pcmk?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] again trouble with quorum (now with cman)

2013-07-11 Thread Andrey Groshev


11.07.2013, 18:48, "Digimer" :
> You need fencing. Specifically, cman blocks when a fence is called and
> won't unblock until it's told that a fence completed successfully.
> Configure cluster.conf to use 'fence_pcmk', which tells cman to pass
> fence requests to pacemaker, and then configure (and test!) stonith in
> pacemaker.
>
> If you have just two nodes, be sure to also set ' expected_votes="1" />'.
>
> digime
>
> On 11/07/13 09:35, Andrey Groshev wrote:
>

I understand that it may be correct to do so... 
But why so difficult?
Assume, I make a small НА cluster in my garage.
And, I not have a managed switch or managed UPS.
I can not corrupt the data.
I just need to returning node as soon as possible started responding.


>>  Hi again!
>>  I've played enough with corosync 2.3.x. nothing good yet.
>>  Now I try build cluster with corosync/cman/pacemaker.
>>  I started with http://clusterlabs.org/quickstart-redhat.html as saw Andrew.
>>  More made ​​my config in pacemaker (with stonith disabled).
>>  The cluster started and I began tests.
>>
>>  And now got new trouble.
>>  Node not return in a cluster after a network cable disconnect/reconnect.
>>
>>  First, I found that the fenced kill cman tool.
>>  But I do not want that.
>>  Behavior that I want to achieve:
>>  Let after reconnecting start all the resources.
>>  Аnd the resource manager itself with all cope.
>>
>>  I disable the use fenced, as written in the documentation fenced(8).
>>  In /etc/cluster/cluster.conf I wrote .
>>  The Fenced all the same running, but corosync and pacemaker but continued 
>> to work.
>>  And resource do not start.
>>
>>  Now I see to quorum. On fenced node corosync-quorum show than only self.
>>  Why ? After restart pacemaker and cman - all rigth.
>>  Why such a behavior in the quorum and how to fix it?
>>
>>  Best regards.
>>
>>  ___
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] again trouble with quorum (now with cman)

2013-07-11 Thread Andrey Groshev
Hi again!
I've played enough with corosync 2.3.x. nothing good yet.
Now I try build cluster with corosync/cman/pacemaker.
I started with http://clusterlabs.org/quickstart-redhat.html as saw Andrew.
More made ​​my config in pacemaker (with stonith disabled).
The cluster started and I began tests.

And now got new trouble.
Node not return in a cluster after a network cable disconnect/reconnect.

First, I found that the fenced kill cman tool.
But I do not want that.
Behavior that I want to achieve: 
Let after reconnecting start all the resources. 
Аnd the resource manager itself with all cope.

I disable the use fenced, as written in the documentation fenced(8).
In /etc/cluster/cluster.conf I wrote .
The Fenced all the same running, but corosync and pacemaker but continued to 
work.
And resource do not start.

Now I see to quorum. On fenced node corosync-quorum show than only self.
Why ? After restart pacemaker and cman - all rigth.
Why such a behavior in the quorum and how to fix it?

Best regards.





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] setup advice

2013-07-03 Thread Andrey Groshev


03.07.2013, 16:26, "Takatoshi MATSUO" :
> Hi Andrey
>
> 2013/7/3 Andrey Groshev :
>
>>  03.07.2013, 06:43, "Takatoshi MATSUO" :
>>>  Hi Stefano
>>>
>>>  2013/7/2 Stefano Sasso :
>>>>   Hello folks,
>>>> I have the following setup in mind, but I need some advice and one 
>>>> hint on
>>>>   how to realize a particular function.
>>>>
>>>>   I have a N (>= 2) nodes cluster, with data storage on postgresql.
>>>>   I would like to manage postgres master-slave replication in this way: one
>>>>   node is the "master", one is the "slave", and the others are "standby"
>>>>   nodes.
>>>>   If the master fails, the slave becomes the master, and one of the standby
>>>>   becomes the slave.
>>>>   If the slave fails, one of the standby becomes the new slave.
>>>  Does "standby" mean that PostgreSQL is stopped ?
>>>  If Master doesn't have WAL files which new slave needs,
>>>  new slave can't connect master.
>>>
>>>  How do you solve it ?
>>>  copy data or wal-archive on start automatically ?
>>>  It may cause timed-out if PostgreSQL has large database.
>>>>   If one of the "standby" fails, no problem :)
>>>>   I can correctly manage this configuration with ms and a custom script 
>>>> (using
>>>>   ocf:pacemaker:Stateful as example). If the cluster is already 
>>>> operational,
>>>>   the failover works fine.
>>>>
>>>>   My problem is about cluster start-up: in fact, only the previous running
>>>>   master and slave own the most updated data; so I would like that the new
>>>>   master should be the "old master" (or, even, the old slave), and the new
>>>>   slave should be the "old slave" (but this one is not mandatory). The
>>>>   important thing is that the new master should have up-to-date data.
>>>>   This should happen even if the servers are booted up with some minutes of
>>>>   delay between them. (users are very stupid sometimes).
>>>  Latest pgsql RA embraces these ideas to manage replication.
>>>
>>>   1. First boot
>>>  RA compares data and promotes PostgreSQL which has latest data.
>>>  The number of comparison can be changed  using xlog_check_count parameter.
>>>  If monitor interval is 10 sec and xlog_check_count is 360, RA can wait
>>>  1 hour to promote :)
>>  But in this case, when master dies, election a new master will continue one 
>> hour too.
>>  Is that right?
>
> No, if slave's data is up to date, master changes slave's master-score.
> So pacemaker stops master and promote slave immediately when master dies.
>

Wait in function have_master_right.

snip
# get xlog locations of all nodes
for node in ${NODE_LIST}; do
output=`$CRM_ATTR_REBOOT -N "$node" -n \
"$PGSQL_XLOG_LOC_NAME" -G -q 2>/dev/null`
snip
if [ "$new" -ge "$OCF_RESKEY_xlog_check_count" ]; then
newestXlog=`printf "$newfile\n" | sort -t " " -k 2,3 -r | \
head -1 | cut -d " " -f 2`
if [ "$newestXlog" = "$mylocation" ]; then
ocf_log info "I have a master right."
$CRM_MASTER -v $PROMOTE_ME
return 0
fi
change_data_status "$NODENAME" "DISCONNECT"
ocf_log info "I don't have correct master data."
# reset counter
rm -f ${XLOG_NOTE_FILE}.*
printf "$newfile\n" > ${XLOG_NOTE_FILE}.0
fi

return 1
}

As I understand, check xlog on all nodes $OCF_RESKEY_xlog_check_count more 
times.
And call this function from pgsql_replication_monitor - and she has in turn 
from pgsql_monitoring.
That is, while "monitoring" will not be called again 
$OCF_RESKEY_xlog_check_count have_master. not return true.
I remember the entire structure of your code in memory :)
Or am I wrong?



>>>  2. Second boot
>>>  Master manages slave's data using attribute with "-l forever" option.
>>>  So RA can't start PostgreSQL, if the node has no latest data.
>>>>   My idea is the following:
>>>>   the MS resource is not started when the cluster comes up, but on startup
>>>>   there will only be one "arbitrator" resource (started on only one node).
>>>>   This resource reads from somewhere which was the previous ma

Re: [Pacemaker] Monitor and standby

2013-07-03 Thread Andrey Groshev


03.07.2013, 14:26, "Denis Witt" :
> Hi List,
>
> we have a two node cluster (test1-node1, test1-node2) with an additional
> quorum node (test1). On all nodes MySQL is running. test1-node1 and
> test1-node2 sharing the MySQL-Database via DRBD, so only one Node
> should run MySQL. On test1 there is a MySQL-Slave connected to
> test1-node1/test1-node2. test1 is always in Standby-Mode.
>
> The problem is now that the MySQL-Slave on test1 is shut down by crmd:
>
> Jul  3 12:05:12 test2 crmd: [5945]: info: te_rsc_command: Initiating action 
> 22: monitor p_mysql_monitor_0 on test2 (local)
> Jul  3 12:05:14 test2 pengine: [5944]: ERROR: native_create_actions: Resource 
> p_mysql (lsb::mysql) is active on 2 nodes attempting recovery
> Jul  3 12:05:14 test2 pengine: [5944]: notice: LogActions: Restart 
> p_mysql#011(Started test2-node1)
> Jul  3 12:05:15 test2 crmd: [5945]: info: te_rsc_command: Initiating action 
> 54: stop p_mysql_stop_0 on test2 (local)
>
> From my understanding this shouldn't happen as test1 was set to standby
> before:
>
> Jul  3 12:04:48 test2 cib: [5940]: info: cib:diff: +    id="nodes-test2-standby" name="standby" value="on" />
>
> How could we solve this?
>

Begin you wrote - test1-node1, test1-node2, test1... 
In log test2, test2-node1, this is right?
Maybe hostnames wrong?

> Thanks in advance.
>
> Best regards
> Denis Witt
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] setup advice

2013-07-03 Thread Andrey Groshev
  03.07.2013, 13:14, "Stefano Sasso" <stesa...@gmail.com>:2013/7/3 Andrey Groshev <gre...@yandex.ru>  03.07.2013, 12:30, "Stefano Sasso" <stesa...@gmail.com>:2013/7/3 Andrey Groshev <gre...@yandex.ru>  03.07.2013, 11:42, "Stefano Sasso" <stesa...@gmail.com>:Does "standby" mean that PostgreSQL is stopped ?If Master doesn't have WAL files which new slave needs, new slave can't connect master.  How do you solve it ? copy data or wal-archive on start automatically ? It may cause timed-out if PostgreSQL has large database. I'm using streaming replication.  This is understandable, it is not clear what your "slave" is different from "standby". slave is syncronized with streaming replication,standby do a sync on daily basis (so if both master and slave fails, I lose only one day of data - not a problem). I would call it a "backup" :) yeah, but a standby can become a master or a slave ;) What is the profit?The third node is connected to the Master every day?But first, it will be synchronized through the "WAL" files.Instead of spread the load during the day you'll get a peak load. Or I missing something in your architecture?   bests,stefano  -- Stefano Sassohttp://stefano.dscnet.org/,___Pacemaker mailing list: Pacemaker@oss.clusterlabs.orghttp://oss.clusterlabs.org/mailman/listinfo/pacemakerProject Home: http://www.clusterlabs.orgGetting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdfBugs: http://bugs.clusterlabs.org___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] setup advice

2013-07-03 Thread Andrey Groshev
  03.07.2013, 12:30, "Stefano Sasso" :2013/7/3 Andrey Groshev <gre...@yandex.ru>  03.07.2013, 11:42, "Stefano Sasso" <stesa...@gmail.com>:Does "standby" mean that PostgreSQL is stopped ?If Master doesn't have WAL files which new slave needs, new slave can't connect master.  How do you solve it ? copy data or wal-archive on start automatically ? It may cause timed-out if PostgreSQL has large database. I'm using streaming replication.  This is understandable, it is not clear what your "slave" is different from "standby". slave is syncronized with streaming replication,standby do a sync on daily basis (so if both master and slave fails, I lose only one day of data - not a problem). I would call it a "backup" :)  we can assume that only one host fails at a time, so I have no problem if the new slave takes log time to resync with the master.they have a dedicated GB link for clustering, and the database average size will be 10 Gb  -- Stefano Sassohttp://stefano.dscnet.org/,___Pacemaker mailing list: Pacemaker@oss.clusterlabs.orghttp://oss.clusterlabs.org/mailman/listinfo/pacemakerProject Home: http://www.clusterlabs.orgGetting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdfBugs: http://bugs.clusterlabs.org___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] setup advice

2013-07-03 Thread Andrey Groshev
  03.07.2013, 11:42, "Stefano Sasso" :Does "standby" mean that PostgreSQL is stopped ?If Master doesn't have WAL files which new slave needs, new slave can't connect master.  How do you solve it ? copy data or wal-archive on start automatically ? It may cause timed-out if PostgreSQL has large database. I'm using streaming replication.  This is understandable, it is not clear what your "slave" is different from "standby".1. First boot RA compares data and promotes PostgreSQL which has latest data. The number of comparison can be changed  using xlog_check_count parameter. If monitor interval is 10 sec and xlog_check_count is 360, RA can wait 1 hour to promote :) I can't wait an hour :)my maximum is 10 minutes :) thanksstefano,___Pacemaker mailing list: Pacemaker@oss.clusterlabs.orghttp://oss.clusterlabs.org/mailman/listinfo/pacemakerProject Home: http://www.clusterlabs.orgGetting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdfBugs: http://bugs.clusterlabs.org___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] setup advice

2013-07-02 Thread Andrey Groshev


03.07.2013, 06:43, "Takatoshi MATSUO" :
> Hi Stefano
>
> 2013/7/2 Stefano Sasso :
>
>>  Hello folks,
>>    I have the following setup in mind, but I need some advice and one hint on
>>  how to realize a particular function.
>>
>>  I have a N (>= 2) nodes cluster, with data storage on postgresql.
>>  I would like to manage postgres master-slave replication in this way: one
>>  node is the "master", one is the "slave", and the others are "standby"
>>  nodes.
>>  If the master fails, the slave becomes the master, and one of the standby
>>  becomes the slave.
>>  If the slave fails, one of the standby becomes the new slave.
>
> Does "standby" mean that PostgreSQL is stopped ?
> If Master doesn't have WAL files which new slave needs,
> new slave can't connect master.
>
> How do you solve it ?
> copy data or wal-archive on start automatically ?
> It may cause timed-out if PostgreSQL has large database.
>
>>  If one of the "standby" fails, no problem :)
>>  I can correctly manage this configuration with ms and a custom script (using
>>  ocf:pacemaker:Stateful as example). If the cluster is already operational,
>>  the failover works fine.
>>
>>  My problem is about cluster start-up: in fact, only the previous running
>>  master and slave own the most updated data; so I would like that the new
>>  master should be the "old master" (or, even, the old slave), and the new
>>  slave should be the "old slave" (but this one is not mandatory). The
>>  important thing is that the new master should have up-to-date data.
>>  This should happen even if the servers are booted up with some minutes of
>>  delay between them. (users are very stupid sometimes).
>
> Latest pgsql RA embraces these ideas to manage replication.
>
>  1. First boot
> RA compares data and promotes PostgreSQL which has latest data.
> The number of comparison can be changed  using xlog_check_count parameter.
> If monitor interval is 10 sec and xlog_check_count is 360, RA can wait
> 1 hour to promote :)
>

But in this case, when master dies, election a new master will continue one 
hour too.
Is that right?

> 2. Second boot
> Master manages slave's data using attribute with "-l forever" option.
> So RA can't start PostgreSQL, if the node has no latest data.
>
>>  My idea is the following:
>>  the MS resource is not started when the cluster comes up, but on startup
>>  there will only be one "arbitrator" resource (started on only one node).
>>  This resource reads from somewhere which was the previous master and the
>>  previous slave, and it wait up to 5 minutes to see if one of them comes up.
>>  In positive case, it forces the MS master resource to be run on that node
>>  (and start it); in negative case, if the wait timer expired, it start the
>>  master resource on a random node.
>>
>>  Is that possible? How can avoid a single resource to start on cluster boot?
>>  Or, could you advise another way to do this setup?
>>
>>  I hope I was clear, my english is not so good :)
>>  thank you so much,
>> stefano
>>
>>  --
>>  Stefano Sasso
>>  http://stefano.dscnet.org/
>
> Regards,
> Takatoshi MATSUO
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [OT] MySQL Replication

2013-06-25 Thread Andrey Groshev


26.06.2013, 06:41, "Andrew Beekhof" :
> On 26/06/2013, at 3:01 AM, Denis Witt  
> wrote:
>
>>  On Tue, 25 Jun 2013 17:12:15 +0200
>>  Denis Witt  wrote:
>>>  ./configure runs fine, but make didn't. I don't remember the exact
>>>  error message and before I could run it again I have to solve my
>>>  OCFS2-Problem. But I'll try again and post it here.
>>  Hi Andrew,
>>
>>  last time I didn't had rpm installed and started ./configure and make
>>  by hand. (I didn't saw the rpm error message last time, it was very
>>  late.)
>>
>>  Now ./autogen.sh runs fine, but my libqb is too old:
>>
>>  configure: error: Version of libqb is too old: v0.13 or greater requried
>>
>>  System is Debian Wheezy which means version 0.11.1-2 for libqb-dev.
>
> rpm errors on debian?
> I'm confused.
>

Now version libqb properly defined :)

>>  Best regards
>>  Denis Witt
>>
>>  ___
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Two resource nodes + one quorum node

2013-06-25 Thread Andrey Groshev


25.06.2013, 09:32, "Andrey Groshev" :
> 22.06.2013, 21:32, "Lars Marowsky-Bree" :
>
>>  On 2013-06-21T14:30:42, Andrey Groshev  wrote:
>>>   I was wrong - the resource starts in 15 minutes.
>>>   I found a matching entry in the log at the same time:
>>>    grep '11:59.*900' /var/log/cluster/corosync.log
>>>   Jun 21 11:59:50 [23616] dev-cluster2-node4 crmd: info: 
>>> crm_timer_popped:   PEngine Recheck Timer (I_PE_CALC) just popped 
>>> (90ms)
>>>   Jun 21 11:59:54 [23616] dev-cluster2-node4 crmd:    debug: 
>>> crm_timer_start:    Started PEngine Recheck Timer (I_PE_CALC:90ms), 
>>> src=220
>>>
>>>   But anyway, now I'm more interested in the question "why such behavior."
>>>   Please tell me which part of the documentation I have not read?
>>  Looks like a bug. Normally, a cluster event ought to trigger the PE
>>  immediately.
>
> Maybe, even without pacemaker in log exist a errors.
> For begin, now try to understand them.
>
> # grep -i 'error\|bad' /var/log/cluster/corosync.log
> Jun 25 09:07:16 [11992] dev-cluster2-node2 corosync debug   [QB    ] 
> epoll_ctl(del): Bad file descriptor (9)
> Jun 25 09:12:57 [11992] dev-cluster2-node2 corosync debug   [VOTEQ ] getinfo 
> response error: 1
> Jun 25 09:12:57 [11992] dev-cluster2-node2 corosync debug   [VOTEQ ] getinfo 
> response error: 1
> Jun 25 09:12:57 [11992] dev-cluster2-node2 corosync debug   [QB    ] 
> epoll_ctl(del): Bad file descriptor (9)
> Jun 25 09:12:57 [11992] dev-cluster2-node2 corosync debug   [QB    ] 
> epoll_ctl(del): Bad file descriptor (9)
> Jun 25 09:12:57 [11992] dev-cluster2-node2 corosync debug   [QB    ] 
> epoll_ctl(del): Bad file descriptor (9)
> Jun 25 09:12:57 [11992] dev-cluster2-node2 corosync debug   [QB    ] 
> epoll_ctl(del): Bad file descriptor (9)
> Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [VOTEQ ] getinfo 
> response error: 1
> Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [VOTEQ ] getinfo 
> response error: 1
> Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [VOTEQ ] getinfo 
> response error: 1
> Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [QB    ] 
> epoll_ctl(del): Bad file descriptor (9)
> Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [QB    ] 
> epoll_ctl(del): Bad file descriptor (9)
> Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [QB    ] 
> epoll_ctl(del): Bad file descriptor (9)
> Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [QB    ] 
> epoll_ctl(del): Bad file descriptor (9)
>

Damn! I do not know what to call these people ... for which the "OK" have value 
1 !!!

>>  Regards,
>>  Lars
>>
>>  --
>>  Architect Storage/HA
>>  SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
>> HRB 21284 (AG Nürnberg)
>>  "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>>
>>  ___
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Reminder: Pacemaker-1.1.10-rc5 is out there

2013-06-24 Thread Andrey Groshev


25.06.2013, 09:49, "Andrew Beekhof" :
> On 25/06/2013, at 2:33 PM, Andrey Groshev  wrote:
>
>>  25.06.2013, 04:46, "Andrew Beekhof" :
>>>  On 24/06/2013, at 3:44 PM, Vladislav Bogdanov  wrote:
>>>>   24.06.2013 04:17, Andrew Beekhof wrote:
>>>>>   Either people have given up on testing, or rc5[1] is looking good for 
>>>>> the final release.
>>>>   Is it going to be 1.1.10 or 1.2.0 (2.0.0)?
>>>  First its going to be 1.1.10 and, if there is still no-one screaming, 
>>> after a couple of weeks it will become 2.0[.0]
>>  What is really new in this version to change the major version?
>
> Compared to what? To 1.1.10 hopefully nothing, but that was always the point.
>
>>  So far there are changes in the interface and API.
>>  IMHO, a stable product should not be such.
>
> 1.1 is not a stable branch, thats the entire point of re-releasing 1.1.10 as 
> 2.0:
>
>    http://blog.clusterlabs.org/blog/2010/new-pacemaker-release-series/
>
> Only instead of stopping development in 2010 and releasing 1.2.0, we kept 
> going and the subsequent 4511 changesets and "3490 files changed, 410422 
> insertions(+), 144311 deletions(-)" justify the 2.0 moniker.
>

Ok, I recently became engaged in the PСMK, so for me it is a surprize.
The more so in all the major linux distributions version 1.1.х.

>>>>>   So just a reminder, we're particularly looking for feedback in the 
>>>>> following areas:
>>>>>
>>>>>   | plugin-based clusters, ACLs, the new –ban and –clear commands, and 
>>>>> admin actions
>>>>>   | (such as  moving and stopping resources, calls to stonith_admin) 
>>>>> which are hard
>>>>>   | to test in an automated manner.
>>>>>   |
>>>>>   | Also any light that can be shed on possible memory leaks would be 
>>>>> much appreciated.
>>>>>
>>>>>   I would very much like to hear the observations (good or bad) of people 
>>>>> that have taken it for a spin.
>>>>>
>>>>>   -- Andrew
>>>>>
>>>>>   [1] 
>>>>> http://blog.clusterlabs.org/blog/2013/release-candidate-1-dot-1-10-rc5/
>>>>>   ___
>>>>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>>   Project Home: http://www.clusterlabs.org
>>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>   Bugs: http://bugs.clusterlabs.org
>>>>   ___
>>>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>>  ___
>>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>>  ___
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Two resource nodes + one quorum node

2013-06-24 Thread Andrey Groshev


22.06.2013, 21:32, "Lars Marowsky-Bree" :
> On 2013-06-21T14:30:42, Andrey Groshev  wrote:
>
>>  I was wrong - the resource starts in 15 minutes.
>>  I found a matching entry in the log at the same time:
>>   grep '11:59.*900' /var/log/cluster/corosync.log
>>  Jun 21 11:59:50 [23616] dev-cluster2-node4 crmd: info: 
>> crm_timer_popped:   PEngine Recheck Timer (I_PE_CALC) just popped 
>> (90ms)
>>  Jun 21 11:59:54 [23616] dev-cluster2-node4 crmd:    debug: crm_timer_start: 
>>    Started PEngine Recheck Timer (I_PE_CALC:90ms), src=220
>>
>>  But anyway, now I'm more interested in the question "why such behavior."
>>  Please tell me which part of the documentation I have not read?
>
> Looks like a bug. Normally, a cluster event ought to trigger the PE
> immediately.

Maybe, even without pacemaker in log exist a errors.
For begin, now try to understand them.

# grep -i 'error\|bad' /var/log/cluster/corosync.log
Jun 25 09:07:16 [11992] dev-cluster2-node2 corosync debug   [QB] 
epoll_ctl(del): Bad file descriptor (9)
Jun 25 09:12:57 [11992] dev-cluster2-node2 corosync debug   [VOTEQ ] getinfo 
response error: 1
Jun 25 09:12:57 [11992] dev-cluster2-node2 corosync debug   [VOTEQ ] getinfo 
response error: 1
Jun 25 09:12:57 [11992] dev-cluster2-node2 corosync debug   [QB] 
epoll_ctl(del): Bad file descriptor (9)
Jun 25 09:12:57 [11992] dev-cluster2-node2 corosync debug   [QB] 
epoll_ctl(del): Bad file descriptor (9)
Jun 25 09:12:57 [11992] dev-cluster2-node2 corosync debug   [QB] 
epoll_ctl(del): Bad file descriptor (9)
Jun 25 09:12:57 [11992] dev-cluster2-node2 corosync debug   [QB] 
epoll_ctl(del): Bad file descriptor (9)
Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [VOTEQ ] getinfo 
response error: 1
Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [VOTEQ ] getinfo 
response error: 1
Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [VOTEQ ] getinfo 
response error: 1
Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [QB] 
epoll_ctl(del): Bad file descriptor (9)
Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [QB] 
epoll_ctl(del): Bad file descriptor (9)
Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [QB] 
epoll_ctl(del): Bad file descriptor (9)
Jun 25 09:13:01 [11992] dev-cluster2-node2 corosync debug   [QB] 
epoll_ctl(del): Bad file descriptor (9)


> Regards,
> Lars
>
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
> HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Reminder: Pacemaker-1.1.10-rc5 is out there

2013-06-24 Thread Andrey Groshev


25.06.2013, 04:46, "Andrew Beekhof" :
> On 24/06/2013, at 3:44 PM, Vladislav Bogdanov  wrote:
>
>>  24.06.2013 04:17, Andrew Beekhof wrote:
>>>  Either people have given up on testing, or rc5[1] is looking good for the 
>>> final release.
>>  Is it going to be 1.1.10 or 1.2.0 (2.0.0)?
>
> First its going to be 1.1.10 and, if there is still no-one screaming, after a 
> couple of weeks it will become 2.0[.0]

What is really new in this version to change the major version?
So far there are changes in the interface and API.
IMHO, a stable product should not be such.

>
>>>  So just a reminder, we're particularly looking for feedback in the 
>>> following areas:
>>>
>>>  | plugin-based clusters, ACLs, the new –ban and –clear commands, and admin 
>>> actions
>>>  | (such as  moving and stopping resources, calls to stonith_admin) which 
>>> are hard
>>>  | to test in an automated manner.
>>>  |
>>>  | Also any light that can be shed on possible memory leaks would be much 
>>> appreciated.
>>>
>>>  I would very much like to hear the observations (good or bad) of people 
>>> that have taken it for a spin.
>>>
>>>  -- Andrew
>>>
>>>  [1] http://blog.clusterlabs.org/blog/2013/release-candidate-1-dot-1-10-rc5/
>>>  ___
>>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>>  ___
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Two resource nodes + one quorum node

2013-06-21 Thread Andrey Groshev


19.06.2013, 10:19, "Andrey Groshev" :
> I started experimenting.
> Received the first incomprehensible situation:
> There are three nodes. One of the quorum exists only, i.e. without a 
> installed pacemaker.
>
> 1. Run all the nodes - cluster is running. All rigth.
> 2. Disconnect of quorum nodes - cluster is running. (crm_mon partition with 
> quorum)
> 3. Disconnect of node with pacemaker - quorum is lost, the resources stop 
> (crm_mon partition WITHOUT quorum).
> 4. Connect the quorum node (crm_mon partition with quorum) but the resources 
> do not start. Why?

I was wrong - the resource starts in 15 minutes.
I found a matching entry in the log at the same time:
 grep '11:59.*900' /var/log/cluster/corosync.log
Jun 21 11:59:50 [23616] dev-cluster2-node4 crmd: info: crm_timer_popped:
   PEngine Recheck Timer (I_PE_CALC) just popped (90ms)
Jun 21 11:59:54 [23616] dev-cluster2-node4 crmd:debug: crm_timer_start: 
   Started PEngine Recheck Timer (I_PE_CALC:90ms), src=220

But anyway, now I'm more interested in the question "why such behavior."
Please tell me which part of the documentation I have not read?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


  1   2   >