Re: [Pacemaker] hangs pending

Andrew Beekhof Tue, 11 Mar 2014 15:53:28 -0700

Sorry for the delay, sometimes it takes a while to rebuild the necessary context


On 5 Mar 2014, at 4:42 pm, Andrey Groshev <gre...@yandex.ru> wrote:

> 
> 
> 05.03.2014, 04:04, "Andrew Beekhof" <and...@beekhof.net>:
>> On 25 Feb 2014, at 8:30 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>> 
>>>  21.02.2014, 12:04, "Andrey Groshev" <gre...@yandex.ru>:
>>>>  21.02.2014, 05:53, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>   On 19 Feb 2014, at 7:53 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>>>    19.02.2014, 09:49, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>    On 19 Feb 2014, at 4:18 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>>>>>     19.02.2014, 09:08, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>     On 19 Feb 2014, at 4:00 pm, Andrey Groshev <gre...@yandex.ru> 
>>>>>>>>> wrote:
>>>>>>>>>>      19.02.2014, 06:48, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>>>      On 18 Feb 2014, at 11:05 pm, Andrey Groshev <gre...@yandex.ru> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>       Hi, ALL and Andrew!
>>>>>>>>>>>> 
>>>>>>>>>>>>       Today is a good day - I killed a lot, and a lot of shooting 
>>>>>>>>>>>> at me.
>>>>>>>>>>>>       In general - I am happy (almost like an elephant)   :)
>>>>>>>>>>>>       Except resources on the node are important to me eight 
>>>>>>>>>>>> processes: 
>>>>>>>>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>>>       I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>>>>       Behavior does not depend of number signal - it's good.
>>>>>>>>>>>>       If STONITH send reboot to the node - it rebooted and 
>>>>>>>>>>>> rejoined the cluster - too it's good.
>>>>>>>>>>>>       But the behavior is different from killing various demons.
>>>>>>>>>>>> 
>>>>>>>>>>>>       Turned four groups:
>>>>>>>>>>>>       1. corosync,cib - STONITH work 100%.
>>>>>>>>>>>>       Kill via any signals - call STONITH and reboot.
>>>>>>>>>>>> 
>>>>>>>>>>>>       2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>>>>       Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>>>>       Sometimes restart daemon and restart resources with large 
>>>>>>>>>>>> delay MS:pgsql.
>>>>>>>>>>>>       One time after restart crmd - pgsql don't restart.
>>>>>>>>>>>> 
>>>>>>>>>>>>       3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>>>>       This daemons simple restart, resources - stay running.
>>>>>>>>>>>> 
>>>>>>>>>>>>       4. pacemakerd - nothing happens.
>>>>>>>>>>>>       And then I can kill any process of the third group. They do 
>>>>>>>>>>>> not restart.
>>>>>>>>>>>>       Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>>>>>>> 
>>>>>>>>>>>>       What do you think about this?
>>>>>>>>>>>>       The main question of this topic - we decided.
>>>>>>>>>>>>       But this varied behavior - another big problem.
>>>>>>>>>>>> 
>>>>>>>>>>>>       Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>>>>>>>>>>>      Which of the various conditions above do the logs cover?
>>>>>>>>>>      All various in day.
>>>>>>>>>     Are you trying to torture me?
>>>>>>>>>     Can you give me a rough idea what happened when?
>>>>>>>>     No, there is 8 processes on the 4th signal and repeats the 
>>>>>>>> experiments with unknown outcome :)
>>>>>>>>     Easier to conduct new experiments and individual new logs .
>>>>>>>>     Which variant is more interesting?
>>>>>>>    The long delay in restarting pgsql.
>>>>>>>    Everything else seems correct.
>>>>>>    He even don't tried start pgsql.
>>>>>>    In Logs tree the tests.
>>>>>>    kill -s4 lrmd pid.
>>>>>>    1. STONITH
>>>>>>    2. STONITH
>>>>>>    3. hangs
>>>>>   Its waiting on a value for default_ping_set
>>>>> 
>>>>>   It seems we're calling monitor for pingCheck but for some reason its 
>>>>> not performing an update:
>>>>> 
>>>>>   # grep 2632.*lrmd.*pingCheck 
>>>>> /Users/beekhof/Downloads/pcmk-Wed-19-Feb-2014/dev-cluster2-node2.unix.tensor.ru/corosync.log
>>>>>   Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>>>>  info: process_lrmd_get_rsc_info: Resource 'pingCheck' not found (3 
>>>>> active resources)
>>>>>   Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>>>>  info: process_lrmd_get_rsc_info: Resource 'pingCheck:3' not found (3 
>>>>> active resources)
>>>>>   Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>>>>  info: process_lrmd_rsc_register: Added 'pingCheck' to the rsc list (4 
>>>>> active resources)
>>>>>   Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>>>> debug: log_execute: executing - rsc:pingCheck action:monitor call_id:19
>>>>>   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>>>> debug: operation_finished: pingCheck_monitor_0:2658 - exited with rc=0
>>>>>   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>>>> debug: operation_finished: pingCheck_monitor_0:2658:stderr [ -- empty -- ]
>>>>>   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>>>> debug: operation_finished: pingCheck_monitor_0:2658:stdout [ -- empty -- ]
>>>>>   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>>>> debug: log_finished: finished - rsc:pingCheck action:monitor call_id:19 
>>>>> pid:2658 exit-code:0 exec-time:2039ms queue-time:0ms
>>>>>   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>>>> debug: log_execute: executing - rsc:pingCheck action:monitor call_id:20
>>>>>   Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>>>> debug: operation_finished: pingCheck_monitor_10000:2816 - exited with rc=0
>>>>>   Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>>>> debug: operation_finished: pingCheck_monitor_10000:2816:stderr [ -- empty 
>>>>> -- ]
>>>>>   Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    
>>>>> debug: operation_finished: pingCheck_monitor_10000:2816:stdout [ -- empty 
>>>>> -- ]
>>>>> 
>>>>>   Could you add:
>>>>> 
>>>>>     export OCF_TRACE_RA=1
>>>>> 
>>>>>   to the top of the ping agent and retest?
>>>>  Today the fourth time worked.
>>>>  I even doubted if the difference is how to kill (kill -s 4 pid or pkill 
>>>> -4 lrmd)
>>>>  Logs http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2
>>>  Hi,
>>>  You  haven't watched it?
>> 
>> Not yet. I've been hitting ACLs with a large hammer.
>> Where are we up to with this?  Do I disregard this one and look at the most 
>> recent email?
>> 
> 
> Hi.
> No. These are two different cases.
> *   When after kill lrmd resources don't start.
> This http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2

Grumble, the logs are still useless...

./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] 
dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished:    
pingCheck_start_0:26456 - exited with rc=0
./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] 
dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished:    
pingCheck_start_0:26456:stderr [ -- empty -- ]
./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] 
dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished:    
pingCheck_start_0:26456:stdout [ -- empty -- ]

Can you just add the following to the top of the resource agent?

set -x

> 
> * When standby a entrie cluster (all nodes standby). 
> Second node - hangs pending.
> But last rebuild rpm - not confirmed the problem.

Ok, so potentially this is fixed with the latest git?

> Therefore, this problem can be considered as long as not a problem.
> http://send2me.ru/pcmk-04-Mar-2014-2.tar.bz2
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

Reply via email to