Sorry for the delay, sometimes it takes a while to rebuild the necessary context
On 5 Mar 2014, at 4:42 pm, Andrey Groshev <gre...@yandex.ru> wrote: > > > 05.03.2014, 04:04, "Andrew Beekhof" <and...@beekhof.net>: >> On 25 Feb 2014, at 8:30 pm, Andrey Groshev <gre...@yandex.ru> wrote: >> >>> 21.02.2014, 12:04, "Andrey Groshev" <gre...@yandex.ru>: >>>> 21.02.2014, 05:53, "Andrew Beekhof" <and...@beekhof.net>: >>>>> On 19 Feb 2014, at 7:53 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>> 19.02.2014, 09:49, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>> On 19 Feb 2014, at 4:18 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>>> 19.02.2014, 09:08, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>> On 19 Feb 2014, at 4:00 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>> wrote: >>>>>>>>>> 19.02.2014, 06:48, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>> On 18 Feb 2014, at 11:05 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>>> wrote: >>>>>>>>>>>> Hi, ALL and Andrew! >>>>>>>>>>>> >>>>>>>>>>>> Today is a good day - I killed a lot, and a lot of shooting >>>>>>>>>>>> at me. >>>>>>>>>>>> In general - I am happy (almost like an elephant) :) >>>>>>>>>>>> Except resources on the node are important to me eight >>>>>>>>>>>> processes: >>>>>>>>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd. >>>>>>>>>>>> I killed them with different signals (4,6,11 and even 9). >>>>>>>>>>>> Behavior does not depend of number signal - it's good. >>>>>>>>>>>> If STONITH send reboot to the node - it rebooted and >>>>>>>>>>>> rejoined the cluster - too it's good. >>>>>>>>>>>> But the behavior is different from killing various demons. >>>>>>>>>>>> >>>>>>>>>>>> Turned four groups: >>>>>>>>>>>> 1. corosync,cib - STONITH work 100%. >>>>>>>>>>>> Kill via any signals - call STONITH and reboot. >>>>>>>>>>>> >>>>>>>>>>>> 2. lrmd,crmd - strange behavior STONITH. >>>>>>>>>>>> Sometimes called STONITH - and the corresponding reaction. >>>>>>>>>>>> Sometimes restart daemon and restart resources with large >>>>>>>>>>>> delay MS:pgsql. >>>>>>>>>>>> One time after restart crmd - pgsql don't restart. >>>>>>>>>>>> >>>>>>>>>>>> 3. stonithd,attrd,pengine - not need STONITH >>>>>>>>>>>> This daemons simple restart, resources - stay running. >>>>>>>>>>>> >>>>>>>>>>>> 4. pacemakerd - nothing happens. >>>>>>>>>>>> And then I can kill any process of the third group. They do >>>>>>>>>>>> not restart. >>>>>>>>>>>> Generaly don't touch corosync,cib and maybe lrmd,crmd. >>>>>>>>>>>> >>>>>>>>>>>> What do you think about this? >>>>>>>>>>>> The main question of this topic - we decided. >>>>>>>>>>>> But this varied behavior - another big problem. >>>>>>>>>>>> >>>>>>>>>>>> ForgŠ¾t logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2 >>>>>>>>>>> Which of the various conditions above do the logs cover? >>>>>>>>>> All various in day. >>>>>>>>> Are you trying to torture me? >>>>>>>>> Can you give me a rough idea what happened when? >>>>>>>> No, there is 8 processes on the 4th signal and repeats the >>>>>>>> experiments with unknown outcome :) >>>>>>>> Easier to conduct new experiments and individual new logs . >>>>>>>> Which variant is more interesting? >>>>>>> The long delay in restarting pgsql. >>>>>>> Everything else seems correct. >>>>>> He even don't tried start pgsql. >>>>>> In Logs tree the tests. >>>>>> kill -s4 lrmd pid. >>>>>> 1. STONITH >>>>>> 2. STONITH >>>>>> 3. hangs >>>>> Its waiting on a value for default_ping_set >>>>> >>>>> It seems we're calling monitor for pingCheck but for some reason its >>>>> not performing an update: >>>>> >>>>> # grep 2632.*lrmd.*pingCheck >>>>> /Users/beekhof/Downloads/pcmk-Wed-19-Feb-2014/dev-cluster2-node2.unix.tensor.ru/corosync.log >>>>> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: >>>>> info: process_lrmd_get_rsc_info: Resource 'pingCheck' not found (3 >>>>> active resources) >>>>> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: >>>>> info: process_lrmd_get_rsc_info: Resource 'pingCheck:3' not found (3 >>>>> active resources) >>>>> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: >>>>> info: process_lrmd_rsc_register: Added 'pingCheck' to the rsc list (4 >>>>> active resources) >>>>> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: >>>>> debug: log_execute: executing - rsc:pingCheck action:monitor call_id:19 >>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: >>>>> debug: operation_finished: pingCheck_monitor_0:2658 - exited with rc=0 >>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: >>>>> debug: operation_finished: pingCheck_monitor_0:2658:stderr [ -- empty -- ] >>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: >>>>> debug: operation_finished: pingCheck_monitor_0:2658:stdout [ -- empty -- ] >>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: >>>>> debug: log_finished: finished - rsc:pingCheck action:monitor call_id:19 >>>>> pid:2658 exit-code:0 exec-time:2039ms queue-time:0ms >>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: >>>>> debug: log_execute: executing - rsc:pingCheck action:monitor call_id:20 >>>>> Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: >>>>> debug: operation_finished: pingCheck_monitor_10000:2816 - exited with rc=0 >>>>> Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: >>>>> debug: operation_finished: pingCheck_monitor_10000:2816:stderr [ -- empty >>>>> -- ] >>>>> Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: >>>>> debug: operation_finished: pingCheck_monitor_10000:2816:stdout [ -- empty >>>>> -- ] >>>>> >>>>> Could you add: >>>>> >>>>> export OCF_TRACE_RA=1 >>>>> >>>>> to the top of the ping agent and retest? >>>> Today the fourth time worked. >>>> I even doubted if the difference is how to kill (kill -s 4 pid or pkill >>>> -4 lrmd) >>>> Logs http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2 >>> Hi, >>> You haven't watched it? >> >> Not yet. I've been hitting ACLs with a large hammer. >> Where are we up to with this? Do I disregard this one and look at the most >> recent email? >> > > Hi. > No. These are two different cases. > * When after kill lrmd resources don't start. > This http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2 Grumble, the logs are still useless... ./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_start_0:26456 - exited with rc=0 ./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_start_0:26456:stderr [ -- empty -- ] ./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_start_0:26456:stdout [ -- empty -- ] Can you just add the following to the top of the resource agent? set -x > > * When standby a entrie cluster (all nodes standby). > Second node - hangs pending. > But last rebuild rpm - not confirmed the problem. Ok, so potentially this is fixed with the latest git? > Therefore, this problem can be considered as long as not a problem. > http://send2me.ru/pcmk-04-Mar-2014-2.tar.bz2 > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org