19.03.2014, 03:29, "Andrew Beekhof" <and...@beekhof.net>: > On 19 Mar 2014, at 6:19 am, Andrey Groshev <gre...@yandex.ru> wrote: > >> 12.03.2014, 02:53, "Andrew Beekhof" <and...@beekhof.net>: >>> Sorry for the delay, sometimes it takes a while to rebuild the necessary >>> context >> I'm sorry too for the answer delay. >> I switched to using "upstart" for initializing corosync and pacemaker (with >> respawn). >> Now the behavior of the system has changed and it suits me. (yet :) ) >> I must kill crmd/lrmd in infinite loop, then STONITH shoot. >> Else very fast respawn and do nothing. >> >> Of course, I still found a other way to hang the system. >> This requires only one idiot. >> 1. He decides to update pacemaker (and/or erase incomprehensible service). >> 2. Then kills the process corosync or simply reboot the server. >> Everything! This node will remain hang in "pending". > > While trying to shutdown? > Our spec files shut pacemaker down prior to upgrades FWIW.
Not so simple ... we have a national tradition - care and cherish idiots. Therefore, they are clever, quirky and unpredictable. ;) He can simply delete files of package, without uninstall. (In reality, it may be just crash of the file system). > >> And the worst thing ... if at least one node hangs in "pending" - does not >> work promote/demote and other manage a resources. >> >> Yes, there is one oddity with long inclusion of some resources, but IMHO it >> is not very critical. >> I try correct a later, now I write documentation for project. >>> On 5 Mar 2014, at 4:42 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>> 05.03.2014, 04:04, "Andrew Beekhof" <and...@beekhof.net>: >>>>> On 25 Feb 2014, at 8:30 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>> 21.02.2014, 12:04, "Andrey Groshev" <gre...@yandex.ru>: >>>>>>> 21.02.2014, 05:53, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>> On 19 Feb 2014, at 7:53 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>> wrote: >>>>>>>>> 19.02.2014, 09:49, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>> On 19 Feb 2014, at 4:18 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>> wrote: >>>>>>>>>>> 19.02.2014, 09:08, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>> On 19 Feb 2014, at 4:00 pm, Andrey Groshev >>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>> 19.02.2014, 06:48, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>>> On 18 Feb 2014, at 11:05 pm, Andrey Groshev >>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>> Hi, ALL and Andrew! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Today is a good day - I killed a lot, and a lot of >>>>>>>>>>>>>>> shooting at me. >>>>>>>>>>>>>>> In general - I am happy (almost like an elephant) :) >>>>>>>>>>>>>>> Except resources on the node are important to me eight >>>>>>>>>>>>>>> processes: >>>>>>>>>>>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd. >>>>>>>>>>>>>>> I killed them with different signals (4,6,11 and even >>>>>>>>>>>>>>> 9). >>>>>>>>>>>>>>> Behavior does not depend of number signal - it's good. >>>>>>>>>>>>>>> If STONITH send reboot to the node - it rebooted and >>>>>>>>>>>>>>> rejoined the cluster - too it's good. >>>>>>>>>>>>>>> But the behavior is different from killing various >>>>>>>>>>>>>>> demons. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Turned four groups: >>>>>>>>>>>>>>> 1. corosync,cib - STONITH work 100%. >>>>>>>>>>>>>>> Kill via any signals - call STONITH and reboot. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2. lrmd,crmd - strange behavior STONITH. >>>>>>>>>>>>>>> Sometimes called STONITH - and the corresponding >>>>>>>>>>>>>>> reaction. >>>>>>>>>>>>>>> Sometimes restart daemon and restart resources with >>>>>>>>>>>>>>> large delay MS:pgsql. >>>>>>>>>>>>>>> One time after restart crmd - pgsql don't restart. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 3. stonithd,attrd,pengine - not need STONITH >>>>>>>>>>>>>>> This daemons simple restart, resources - stay running. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 4. pacemakerd - nothing happens. >>>>>>>>>>>>>>> And then I can kill any process of the third group. >>>>>>>>>>>>>>> They do not restart. >>>>>>>>>>>>>>> Generaly don't touch corosync,cib and maybe lrmd,crmd. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What do you think about this? >>>>>>>>>>>>>>> The main question of this topic - we decided. >>>>>>>>>>>>>>> But this varied behavior - another big problem. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Forgоt logs >>>>>>>>>>>>>>> http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2 >>>>>>>>>>>>>> Which of the various conditions above do the logs cover? >>>>>>>>>>>>> All various in day. >>>>>>>>>>>> Are you trying to torture me? >>>>>>>>>>>> Can you give me a rough idea what happened when? >>>>>>>>>>> No, there is 8 processes on the 4th signal and repeats the >>>>>>>>>>> experiments with unknown outcome :) >>>>>>>>>>> Easier to conduct new experiments and individual new logs . >>>>>>>>>>> Which variant is more interesting? >>>>>>>>>> The long delay in restarting pgsql. >>>>>>>>>> Everything else seems correct. >>>>>>>>> He even don't tried start pgsql. >>>>>>>>> In Logs tree the tests. >>>>>>>>> kill -s4 lrmd pid. >>>>>>>>> 1. STONITH >>>>>>>>> 2. STONITH >>>>>>>>> 3. hangs >>>>>>>> Its waiting on a value for default_ping_set >>>>>>>> >>>>>>>> It seems we're calling monitor for pingCheck but for some reason >>>>>>>> its not performing an update: >>>>>>>> >>>>>>>> # grep 2632.*lrmd.*pingCheck >>>>>>>> /Users/beekhof/Downloads/pcmk-Wed-19-Feb-2014/dev-cluster2-node2.unix.tensor.ru/corosync.log >>>>>>>> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru >>>>>>>> lrmd: info: process_lrmd_get_rsc_info: Resource 'pingCheck' not >>>>>>>> found (3 active resources) >>>>>>>> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru >>>>>>>> lrmd: info: process_lrmd_get_rsc_info: Resource 'pingCheck:3' not >>>>>>>> found (3 active resources) >>>>>>>> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru >>>>>>>> lrmd: info: process_lrmd_rsc_register: Added 'pingCheck' to the >>>>>>>> rsc list (4 active resources) >>>>>>>> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru >>>>>>>> lrmd: debug: log_execute: executing - rsc:pingCheck action:monitor >>>>>>>> call_id:19 >>>>>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru >>>>>>>> lrmd: debug: operation_finished: pingCheck_monitor_0:2658 - exited >>>>>>>> with rc=0 >>>>>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru >>>>>>>> lrmd: debug: operation_finished: pingCheck_monitor_0:2658:stderr [ >>>>>>>> -- empty -- ] >>>>>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru >>>>>>>> lrmd: debug: operation_finished: pingCheck_monitor_0:2658:stdout [ >>>>>>>> -- empty -- ] >>>>>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru >>>>>>>> lrmd: debug: log_finished: finished - rsc:pingCheck action:monitor >>>>>>>> call_id:19 pid:2658 exit-code:0 exec-time:2039ms queue-time:0ms >>>>>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru >>>>>>>> lrmd: debug: log_execute: executing - rsc:pingCheck action:monitor >>>>>>>> call_id:20 >>>>>>>> Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru >>>>>>>> lrmd: debug: operation_finished: pingCheck_monitor_10000:2816 - >>>>>>>> exited with rc=0 >>>>>>>> Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru >>>>>>>> lrmd: debug: operation_finished: >>>>>>>> pingCheck_monitor_10000:2816:stderr [ -- empty -- ] >>>>>>>> Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru >>>>>>>> lrmd: debug: operation_finished: >>>>>>>> pingCheck_monitor_10000:2816:stdout [ -- empty -- ] >>>>>>>> >>>>>>>> Could you add: >>>>>>>> >>>>>>>> export OCF_TRACE_RA=1 >>>>>>>> >>>>>>>> to the top of the ping agent and retest? >>>>>>> Today the fourth time worked. >>>>>>> I even doubted if the difference is how to kill (kill -s 4 pid or >>>>>>> pkill -4 lrmd) >>>>>>> Logs http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2 >>>>>> Hi, >>>>>> You haven't watched it? >>>>> Not yet. I've been hitting ACLs with a large hammer. >>>>> Where are we up to with this? Do I disregard this one and look at the >>>>> most recent email? >>>> Hi. >>>> No. These are two different cases. >>>> * When after kill lrmd resources don't start. >>>> This http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2 >>> Grumble, the logs are still useless... >>> >>> ./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] >>> dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: >>> pingCheck_start_0:26456 - exited with rc=0 >>> ./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] >>> dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: >>> pingCheck_start_0:26456:stderr [ -- empty -- ] >>> ./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] >>> dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: >>> pingCheck_start_0:26456:stdout [ -- empty -- ] >>> >>> Can you just add the following to the top of the resource agent? >>> >>> set -x >>>> * When standby a entrie cluster (all nodes standby). >>>> Second node - hangs pending. >>>> But last rebuild rpm - not confirmed the problem. >>> Ok, so potentially this is fixed with the latest git? >>>> Therefore, this problem can be considered as long as not a problem. >>>> http://send2me.ru/pcmk-04-Mar-2014-2.tar.bz2 >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> , >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > , > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org