20.02.2014, 13:57, "Andrew Beekhof" <and...@beekhof.net>: > On 20 Feb 2014, at 5:33 pm, Andrey Groshev <gre...@yandex.ru> wrote: > >> 20.02.2014, 01:22, "Andrew Beekhof" <and...@beekhof.net>: >>> On 20 Feb 2014, at 4:18 am, Andrey Groshev <gre...@yandex.ru> wrote: >>>> 19.02.2014, 06:47, "Andrew Beekhof" <and...@beekhof.net>: >>>>> On 18 Feb 2014, at 9:29 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>> Hi, ALL and Andrew! >>>>>> >>>>>> Today is a good day - I killed a lot, and a lot of shooting at me. >>>>>> In general - I am happy (almost like an elephant) :) >>>>>> Except resources on the node are important to me eight processes: >>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd. >>>>>> I killed them with different signals (4,6,11 and even 9). >>>>>> Behavior does not depend of number signal - it's good. >>>>>> If STONITH send reboot to the node - it rebooted and rejoined the >>>>>> cluster - too it's good. >>>>>> But the behavior is different from killing various demons. >>>>>> >>>>>> Turned four groups: >>>>>> 1. corosync,cib - STONITH work 100%. >>>>>> Kill via any signals - call STONITH and reboot. >>>>> excellent >>>>>> 3. stonithd,attrd,pengine - not need STONITH >>>>>> This daemons simple restart, resources - stay running. >>>>> right >>>>>> 2. lrmd,crmd - strange behavior STONITH. >>>>>> Sometimes called STONITH - and the corresponding reaction. >>>>>> Sometimes restart daemon >>>>> The daemon will always try to restart, the only variable is how long it >>>>> takes the peer to notice and initiate fencing. >>>>> If the failure happens just before a they're due to receive totem >>>>> token, the failure will be very quickly detected and the node fenced. >>>>> If the failure happens just after, then detection will take longer - >>>>> giving the node longer to recover and not be fenced. >>>>> >>>>> So fence/not fence is normal and to be expected. >>>>>> and restart resources with large delay MS:pgsql. >>>>>> One time after restart crmd - pgsql don't restart. >>>>> I would not expect pgsql to ever restart - if the RA does its job >>>>> properly anyway. >>>>> In the case the node is not fenced, the crmd will respawn and the the >>>>> PE will request that it re-detect the state of all resources. >>>>> >>>>> If the agent reports "all good", then there is nothing more to do. >>>>> If the agent is not reporting "all good", you should really be asking >>>>> why. >>>>>> 4. pacemakerd - nothing happens. >>>>> On non-systemd based machines, correct. >>>>> >>>>> On a systemd based machine pacemakerd is respawned and reattaches to >>>>> the existing daemons. >>>>> Any subsequent daemon failure will be detected and the daemon respawned. >>>> And! I almost forgot about IT! >>>> Exist another (NORMAL) the variants, the methods, the ideas? >>>> Without this ... @$%#$%&$%^&$%^&##@#$$^$%& !!!!! >>>> Otherwise - it's a full epic fail ;) >>> -ENOPARSE >> OK, I remove my personal attitude to "systemd". >> Let me explain. >> >> Somewhere in the beginning of this topic, I wrote: >> A.G.:Who knows who runs lrmd? >> A.B.:Pacemakerd. >> That's one! >> >> Let's see the list of processes: >> #ps -axf >> ..... >> 6067 ? Ssl 7:24 corosync >> 6092 ? S 0:25 pacemakerd >> 6094 ? Ss 116:13 \_ /usr/libexec/pacemaker/cib >> 6095 ? Ss 0:25 \_ /usr/libexec/pacemaker/stonithd >> 6096 ? Ss 1:27 \_ /usr/libexec/pacemaker/lrmd >> 6097 ? Ss 0:49 \_ /usr/libexec/pacemaker/attrd >> 6098 ? Ss 0:25 \_ /usr/libexec/pacemaker/pengine >> 6099 ? Ss 0:29 \_ /usr/libexec/pacemaker/crmd >> ..... >> That's two! > > Whats two? I don't follow. In the sense that it creates other processes. But it does not matter.
>> And more, more... >> Now you must understand - why I want this process to work always. >> Even I think, No need for anyone here to explain it! >> >> And Now you say about "pacemakerd nice work, but only on systemd distros" >> !!! > > No, I;m saying it works _better_ on systemd distros. > On non-systemd distros you still need quite a few unlikely-to-happen failures > to trigger a situation in which the node still gets fenced and recovered > (assuming no-one saw any of the error messages and didn't run "service > pacemaker restart" prior to the additional failures). > Can you show me the place where: "On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons."? If I respawn via upstart process pacemakerd - "reattaches to the existing daemons" ? >> What should I do now? >> * Integrate systemd in CentOS? >> * Migrate to Fefora? >> * Buy RHEL7 !? > > Option 3 is particularly good :) It's too easy. Normal heroes are always going to bypass :) >> Each a variants is great, but don't fit for me. >> >> P.S. And I'm not talking distros which don't migrate to systemd (and will >> not do). > > Are there any? Even debian and ubuntu have raised the white flag. It certainly a lyrics, but potentially it can be any Unix-like system. >> Do not be offended! We also do so. >> We are building a secret military factory, >> large concrete fence around it, >> wall barbed wire, but forget to install the gates. :) >>>>>> And then I can kill any process of the third group. They do not >>>>>> restart. >>>>> Until they become needed. >>>>> Eg. if the DC goes to invoke the policy engine, that will fail causing >>>>> the crmd to fail and the node to be fenced. >>>>>> Generaly don't touch corosync,cib and maybe lrmd,crmd. >>>>>> >>>>>> What do you think about this? >>>>>> The main question of this topic - we decided. >>>>>> But this varied behavior - another big problem. >>>>>> >>>>>> 17.02.2014, 08:52, "Andrey Groshev" <gre...@yandex.ru>: >>>>>>> 17.02.2014, 02:27, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>> With no quick follow-up, dare one hope that means the patch >>>>>>>> worked? :-) >>>>>>> Hi, >>>>>>> No, unfortunately the chief changed my plans on Friday and all day I >>>>>>> was engaged in a parallel project. >>>>>>> I hope that today have time to carry out the necessary tests. >>>>>>>> On 14 Feb 2014, at 3:37 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>> wrote: >>>>>>>>> Yes, of course. Now beginning build world and test ) >>>>>>>>> >>>>>>>>> 14.02.2014, 04:41, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>> The previous patch wasn't quite right. >>>>>>>>>> Could you try this new one? >>>>>>>>>> >>>>>>>>>> http://paste.fedoraproject.org/77123/13923376/ >>>>>>>>>> >>>>>>>>>> [11:23 AM] beekhof@f19 ~/Development/sources/pacemaker/devel ☺ >>>>>>>>>> # git diff >>>>>>>>>> diff --git a/crmd/callbacks.c b/crmd/callbacks.c >>>>>>>>>> index ac4b905..d49525b 100644 >>>>>>>>>> --- a/crmd/callbacks.c >>>>>>>>>> +++ b/crmd/callbacks.c >>>>>>>>>> @@ -199,8 +199,7 @@ peer_update_callback(enum crm_status_type >>>>>>>>>> type, crm_node_t * node, const void *d >>>>>>>>>> stop_te_timer(down->timer); >>>>>>>>>> >>>>>>>>>> flags |= node_update_join | >>>>>>>>>> node_update_expected; >>>>>>>>>> - crm_update_peer_join(__FUNCTION__, node, >>>>>>>>>> crm_join_none); >>>>>>>>>> - crm_update_peer_expected(__FUNCTION__, node, >>>>>>>>>> CRMD_JOINSTATE_DOWN); >>>>>>>>>> + crmd_peer_down(node, FALSE); >>>>>>>>>> check_join_state(fsa_state, __FUNCTION__); >>>>>>>>>> >>>>>>>>>> update_graph(transition_graph, down); >>>>>>>>>> diff --git a/crmd/crmd_utils.h b/crmd/crmd_utils.h >>>>>>>>>> index bc472c2..1a2577a 100644 >>>>>>>>>> --- a/crmd/crmd_utils.h >>>>>>>>>> +++ b/crmd/crmd_utils.h >>>>>>>>>> @@ -100,6 +100,7 @@ void crmd_join_phase_log(int level); >>>>>>>>>> const char *get_timer_desc(fsa_timer_t * timer); >>>>>>>>>> gboolean too_many_st_failures(void); >>>>>>>>>> void st_fail_count_reset(const char * target); >>>>>>>>>> +void crmd_peer_down(crm_node_t *peer, bool full); >>>>>>>>>> >>>>>>>>>> # define fsa_register_cib_callback(id, flag, data, fn) do { >>>>>>>>>> \ >>>>>>>>>> fsa_cib_conn->cmds->register_callback( >>>>>>>>>> \ >>>>>>>>>> diff --git a/crmd/te_actions.c b/crmd/te_actions.c >>>>>>>>>> index f31d4ec..3bfce59 100644 >>>>>>>>>> --- a/crmd/te_actions.c >>>>>>>>>> +++ b/crmd/te_actions.c >>>>>>>>>> @@ -80,11 +80,8 @@ send_stonith_update(crm_action_t * action, >>>>>>>>>> const char *target, const char *uuid) >>>>>>>>>> crm_info("Recording uuid '%s' for node '%s'", uuid, >>>>>>>>>> target); >>>>>>>>>> peer->uuid = strdup(uuid); >>>>>>>>>> } >>>>>>>>>> - crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, >>>>>>>>>> NULL); >>>>>>>>>> - crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, >>>>>>>>>> 0); >>>>>>>>>> - crm_update_peer_expected(__FUNCTION__, peer, >>>>>>>>>> CRMD_JOINSTATE_DOWN); >>>>>>>>>> - crm_update_peer_join(__FUNCTION__, peer, crm_join_none); >>>>>>>>>> >>>>>>>>>> + crmd_peer_down(peer, TRUE); >>>>>>>>>> node_state = >>>>>>>>>> do_update_node_cib(peer, >>>>>>>>>> node_update_cluster | >>>>>>>>>> node_update_peer | node_update_join | >>>>>>>>>> diff --git a/crmd/te_utils.c b/crmd/te_utils.c >>>>>>>>>> index ad7e573..0c92e95 100644 >>>>>>>>>> --- a/crmd/te_utils.c >>>>>>>>>> +++ b/crmd/te_utils.c >>>>>>>>>> @@ -247,10 +247,7 @@ tengine_stonith_notify(stonith_t * st, >>>>>>>>>> stonith_event_t * st_event) >>>>>>>>>> >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> - crm_update_peer_proc(__FUNCTION__, peer, >>>>>>>>>> crm_proc_none, NULL); >>>>>>>>>> - crm_update_peer_state(__FUNCTION__, peer, >>>>>>>>>> CRM_NODE_LOST, 0); >>>>>>>>>> - crm_update_peer_expected(__FUNCTION__, peer, >>>>>>>>>> CRMD_JOINSTATE_DOWN); >>>>>>>>>> - crm_update_peer_join(__FUNCTION__, peer, >>>>>>>>>> crm_join_none); >>>>>>>>>> + crmd_peer_down(peer, TRUE); >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> diff --git a/crmd/utils.c b/crmd/utils.c >>>>>>>>>> index 3988cfe..2df53ab 100644 >>>>>>>>>> --- a/crmd/utils.c >>>>>>>>>> +++ b/crmd/utils.c >>>>>>>>>> @@ -1077,3 +1077,13 @@ update_attrd_remote_node_removed(const >>>>>>>>>> char *host, const char *user_name) >>>>>>>>>> crm_trace("telling attrd to clear attributes for remote >>>>>>>>>> host %s", host); >>>>>>>>>> update_attrd_helper(host, NULL, NULL, user_name, TRUE, >>>>>>>>>> 'C'); >>>>>>>>>> } >>>>>>>>>> + >>>>>>>>>> +void crmd_peer_down(crm_node_t *peer, bool full) >>>>>>>>>> +{ >>>>>>>>>> + if(full && peer->state == NULL) { >>>>>>>>>> + crm_update_peer_state(__FUNCTION__, peer, >>>>>>>>>> CRM_NODE_LOST, 0); >>>>>>>>>> + crm_update_peer_proc(__FUNCTION__, peer, >>>>>>>>>> crm_proc_none, NULL); >>>>>>>>>> + } >>>>>>>>>> + crm_update_peer_join(__FUNCTION__, peer, crm_join_none); >>>>>>>>>> + crm_update_peer_expected(__FUNCTION__, peer, >>>>>>>>>> CRMD_JOINSTATE_DOWN); >>>>>>>>>> +} >>>>>>>>>> >>>>>>>>>> On 16 Jan 2014, at 7:24 pm, Andrey Groshev <gre...@yandex.ru> >>>>>>>>>> wrote: >>>>>>>>>>> 16.01.2014, 01:30, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>> On 16 Jan 2014, at 12:41 am, Andrey Groshev >>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>> 15.01.2014, 02:53, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>>>>>>>>> On 15 Jan 2014, at 12:15 am, Andrey Groshev >>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>> 14.01.2014, 10:00, "Andrey Groshev" <gre...@yandex.ru>: >>>>>>>>>>>>>>>> 14.01.2014, 07:47, "Andrew Beekhof" >>>>>>>>>>>>>>>> <and...@beekhof.net>: >>>>>>>>>>>>>>>>> Ok, here's what happens: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. node2 is lost >>>>>>>>>>>>>>>>> 2. fencing of node2 starts >>>>>>>>>>>>>>>>> 3. node2 reboots (and cluster starts) >>>>>>>>>>>>>>>>> 4. node2 returns to the membership >>>>>>>>>>>>>>>>> 5. node2 is marked as a cluster member >>>>>>>>>>>>>>>>> 6. DC tries to bring it into the cluster, but needs >>>>>>>>>>>>>>>>> to cancel the active transition first. >>>>>>>>>>>>>>>>> Which is a problem since the node2 fencing >>>>>>>>>>>>>>>>> operation is part of that >>>>>>>>>>>>>>>>> 7. node2 is in a transition (pending) state until >>>>>>>>>>>>>>>>> fencing passes or fails >>>>>>>>>>>>>>>>> 8a. fencing fails: transition completes and the node >>>>>>>>>>>>>>>>> joins the cluster >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thats in theory, except we automatically try again. >>>>>>>>>>>>>>>>> Which isn't appropriate. >>>>>>>>>>>>>>>>> This should be relatively easy to fix. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 8b. fencing passes: the node is incorrectly marked >>>>>>>>>>>>>>>>> as offline >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This I have no idea how to fix yet. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On another note, it doesn't look like this agent >>>>>>>>>>>>>>>>> works at all. >>>>>>>>>>>>>>>>> The node has been back online for a long time and >>>>>>>>>>>>>>>>> the agent is still timing out after 10 minutes. >>>>>>>>>>>>>>>>> So "Once the script makes sure that the victim will >>>>>>>>>>>>>>>>> rebooted and again available via ssh - it exit with 0." does >>>>>>>>>>>>>>>>> not seem true. >>>>>>>>>>>>>>>> Damn. Looks like you're right. At some time I broke my >>>>>>>>>>>>>>>> agent and had not noticed it. Who will understand. >>>>>>>>>>>>>>> I repaired my agent - after send reboot he is wait >>>>>>>>>>>>>>> STDIN. >>>>>>>>>>>>>>> Returned "normally" a behavior - hangs "pending", until >>>>>>>>>>>>>>> manually send reboot. :) >>>>>>>>>>>>>> Right. Now you're in case 8b. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Can you try this patch: >>>>>>>>>>>>>> http://paste.fedoraproject.org/68450/38973966 >>>>>>>>>>>>> Killed all day experiences. >>>>>>>>>>>>> It turns out here that: >>>>>>>>>>>>> 1. Did cluster. >>>>>>>>>>>>> 2. On the node-2 send signal (-4) - killed corosink >>>>>>>>>>>>> 3. From node-1 (there DC) - stonith sent reboot >>>>>>>>>>>>> 4. Noda rebooted and resources start. >>>>>>>>>>>>> 5. Again. On the node-2 send signal (-4) - killed corosink >>>>>>>>>>>>> 6. Again. From node-1 (there DC) - stonith sent reboot >>>>>>>>>>>>> 7. Noda-2 rebooted and hangs in "pending" >>>>>>>>>>>>> 8. Waiting, waiting..... manually reboot. >>>>>>>>>>>>> 9. Noda-2 reboot and raised resources start. >>>>>>>>>>>>> 10. GOTO p.2 >>>>>>>>>>>> Logs? >>>>>>>>>>> Yesterday I wrote an additional letter why not put the logs. >>>>>>>>>>> Read it please, it contains a few more questions. >>>>>>>>>>> Today again began to hang and continue along the same cycle. >>>>>>>>>>> Logs here http://send2me.ru/crmrep2.tar.bz2 >>>>>>>>>>>>>>> New logs: http://send2me.ru/crmrep1.tar.bz2 >>>>>>>>>>>>>>>>> On 14 Jan 2014, at 1:19 pm, Andrew Beekhof >>>>>>>>>>>>>>>>> <and...@beekhof.net> wrote: >>>>>>>>>>>>>>>>>> Apart from anything else, your timeout needs to be >>>>>>>>>>>>>>>>>> bigger: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Jan 13 12:21:36 [17223] >>>>>>>>>>>>>>>>>> dev-cluster2-node1.unix.tensor.ru stonith-ng: ( >>>>>>>>>>>>>>>>>> commands.c:1321 ) error: log_operation: Operation >>>>>>>>>>>>>>>>>> 'reboot' [11331] (call 2 from crmd.17227) for host >>>>>>>>>>>>>>>>>> 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' >>>>>>>>>>>>>>>>>> returned: -62 (Timer expired) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On 14 Jan 2014, at 7:18 am, Andrew Beekhof >>>>>>>>>>>>>>>>>> <and...@beekhof.net> wrote: >>>>>>>>>>>>>>>>>>> On 13 Jan 2014, at 8:31 pm, Andrey Groshev >>>>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>>>>>> 13.01.2014, 02:51, "Andrew Beekhof" >>>>>>>>>>>>>>>>>>>> <and...@beekhof.net>: >>>>>>>>>>>>>>>>>>>>> On 10 Jan 2014, at 9:55 pm, Andrey Groshev >>>>>>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>>>>>>>> 10.01.2014, 14:31, "Andrey Groshev" >>>>>>>>>>>>>>>>>>>>>> <gre...@yandex.ru>: >>>>>>>>>>>>>>>>>>>>>>> 10.01.2014, 14:01, "Andrew Beekhof" >>>>>>>>>>>>>>>>>>>>>>> <and...@beekhof.net>: >>>>>>>>>>>>>>>>>>>>>>>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev >>>>>>>>>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> 10.01.2014, 05:29, "Andrew Beekhof" >>>>>>>>>>>>>>>>>>>>>>>>> <and...@beekhof.net>: >>>>>>>>>>>>>>>>>>>>>>>>>> On 9 Jan 2014, at 11:11 pm, Andrey >>>>>>>>>>>>>>>>>>>>>>>>>> Groshev <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> 08.01.2014, 06:22, "Andrew Beekhof" >>>>>>>>>>>>>>>>>>>>>>>>>>> <and...@beekhof.net>: >>>>>>>>>>>>>>>>>>>>>>>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey >>>>>>>>>>>>>>>>>>>>>>>>>>>> Groshev <gre...@yandex.ru> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, ALL. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm still trying to cope with the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> fact that after the fence - node hangs in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> "pending". >>>>>>>>>>>>>>>>>>>>>>>>>>>> Please define "pending". Where did >>>>>>>>>>>>>>>>>>>>>>>>>>>> you see this? >>>>>>>>>>>>>>>>>>>>>>>>>>> In crm_mon: >>>>>>>>>>>>>>>>>>>>>>>>>>> ...... >>>>>>>>>>>>>>>>>>>>>>>>>>> Node dev-cluster2-node2 (172793105): >>>>>>>>>>>>>>>>>>>>>>>>>>> pending >>>>>>>>>>>>>>>>>>>>>>>>>>> ...... >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> The experiment was like this: >>>>>>>>>>>>>>>>>>>>>>>>>>> Four nodes in cluster. >>>>>>>>>>>>>>>>>>>>>>>>>>> On one of them kill corosync or >>>>>>>>>>>>>>>>>>>>>>>>>>> pacemakerd (signal 4 or 6 oк 11). >>>>>>>>>>>>>>>>>>>>>>>>>>> Thereafter, the remaining start it >>>>>>>>>>>>>>>>>>>>>>>>>>> constantly reboot, under various pretexts, "softly >>>>>>>>>>>>>>>>>>>>>>>>>>> whistling", "fly low", "not a cluster member!" ... >>>>>>>>>>>>>>>>>>>>>>>>>>> Then in the log fell out "Too many >>>>>>>>>>>>>>>>>>>>>>>>>>> failures ...." >>>>>>>>>>>>>>>>>>>>>>>>>>> All this time in the status in crm_mon >>>>>>>>>>>>>>>>>>>>>>>>>>> is "pending". >>>>>>>>>>>>>>>>>>>>>>>>>>> Depending on the wind direction >>>>>>>>>>>>>>>>>>>>>>>>>>> changed to "UNCLEAN" >>>>>>>>>>>>>>>>>>>>>>>>>>> Much time has passed and I can not >>>>>>>>>>>>>>>>>>>>>>>>>>> accurately describe the behavior... >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Now I am in the following state: >>>>>>>>>>>>>>>>>>>>>>>>>>> I tried locate the problem. Came here >>>>>>>>>>>>>>>>>>>>>>>>>>> with this. >>>>>>>>>>>>>>>>>>>>>>>>>>> I set big value in property >>>>>>>>>>>>>>>>>>>>>>>>>>> stonith-timeout="600s". >>>>>>>>>>>>>>>>>>>>>>>>>>> And got the following behavior: >>>>>>>>>>>>>>>>>>>>>>>>>>> 1. pkill -4 corosync >>>>>>>>>>>>>>>>>>>>>>>>>>> 2. from node with DC call my fence >>>>>>>>>>>>>>>>>>>>>>>>>>> agent "sshbykey" >>>>>>>>>>>>>>>>>>>>>>>>>>> 3. It sends reboot victim and waits >>>>>>>>>>>>>>>>>>>>>>>>>>> until she comes to life again. >>>>>>>>>>>>>>>>>>>>>>>>>> Hmmm.... what version of pacemaker? >>>>>>>>>>>>>>>>>>>>>>>>>> This sounds like a timing issue that we >>>>>>>>>>>>>>>>>>>>>>>>>> fixed a while back >>>>>>>>>>>>>>>>>>>>>>>>> Was a version 1.1.11 from December 3. >>>>>>>>>>>>>>>>>>>>>>>>> Now try full update and retest. >>>>>>>>>>>>>>>>>>>>>>>> That should be recent enough. Can you >>>>>>>>>>>>>>>>>>>>>>>> create a crm_report the next time you reproduce? >>>>>>>>>>>>>>>>>>>>>>> Of course yes. Little delay.... :) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> ...... >>>>>>>>>>>>>>>>>>>>>>> cc1: warnings being treated as errors >>>>>>>>>>>>>>>>>>>>>>> upstart.c: In function ‘upstart_job_property’: >>>>>>>>>>>>>>>>>>>>>>> upstart.c:264: error: implicit declaration of >>>>>>>>>>>>>>>>>>>>>>> function ‘g_variant_lookup_value’ >>>>>>>>>>>>>>>>>>>>>>> upstart.c:264: error: nested extern >>>>>>>>>>>>>>>>>>>>>>> declaration of ‘g_variant_lookup_value’ >>>>>>>>>>>>>>>>>>>>>>> upstart.c:264: error: assignment makes >>>>>>>>>>>>>>>>>>>>>>> pointer from integer without a cast >>>>>>>>>>>>>>>>>>>>>>> gmake[2]: *** [libcrmservice_la-upstart.lo] >>>>>>>>>>>>>>>>>>>>>>> Error 1 >>>>>>>>>>>>>>>>>>>>>>> gmake[2]: Leaving directory >>>>>>>>>>>>>>>>>>>>>>> `/root/ha/pacemaker/lib/services' >>>>>>>>>>>>>>>>>>>>>>> make[1]: *** [all-recursive] Error 1 >>>>>>>>>>>>>>>>>>>>>>> make[1]: Leaving directory >>>>>>>>>>>>>>>>>>>>>>> `/root/ha/pacemaker/lib' >>>>>>>>>>>>>>>>>>>>>>> make: *** [core] Error 1 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I'm trying to solve this a problem. >>>>>>>>>>>>>>>>>>>>>> Do not get solved quickly... >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value >>>>>>>>>>>>>>>>>>>>>> g_variant_lookup_value () Since 2.28 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> # yum list installed glib2 >>>>>>>>>>>>>>>>>>>>>> Loaded plugins: fastestmirror, rhnplugin, >>>>>>>>>>>>>>>>>>>>>> security >>>>>>>>>>>>>>>>>>>>>> This system is receiving updates from RHN >>>>>>>>>>>>>>>>>>>>>> Classic or Red Hat Satellite. >>>>>>>>>>>>>>>>>>>>>> Loading mirror speeds from cached hostfile >>>>>>>>>>>>>>>>>>>>>> Installed Packages >>>>>>>>>>>>>>>>>>>>>> glib2.x86_64 >>>>>>>>>>>>>>>>>>>>>> 2.26.1-3.el6 >>>>>>>>>>>>>>>>>>>>>> installed >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> # cat /etc/issue >>>>>>>>>>>>>>>>>>>>>> CentOS release 6.5 (Final) >>>>>>>>>>>>>>>>>>>>>> Kernel \r on an \m >>>>>>>>>>>>>>>>>>>>> Can you try this patch? >>>>>>>>>>>>>>>>>>>>> Upstart jobs wont work, but the code will >>>>>>>>>>>>>>>>>>>>> compile >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> diff --git a/lib/services/upstart.c >>>>>>>>>>>>>>>>>>>>> b/lib/services/upstart.c >>>>>>>>>>>>>>>>>>>>> index 831e7cf..195c3a4 100644 >>>>>>>>>>>>>>>>>>>>> --- a/lib/services/upstart.c >>>>>>>>>>>>>>>>>>>>> +++ b/lib/services/upstart.c >>>>>>>>>>>>>>>>>>>>> @@ -231,12 +231,21 @@ upstart_job_exists(const >>>>>>>>>>>>>>>>>>>>> char *name) >>>>>>>>>>>>>>>>>>>>> static char * >>>>>>>>>>>>>>>>>>>>> upstart_job_property(const char *obj, const >>>>>>>>>>>>>>>>>>>>> gchar * iface, const char *name) >>>>>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>>>>> + char *output = NULL; >>>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>>> +#if !GLIB_CHECK_VERSION(2,28,0) >>>>>>>>>>>>>>>>>>>>> + static bool err = TRUE; >>>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>>> + if(err) { >>>>>>>>>>>>>>>>>>>>> + crm_err("This version of glib is too >>>>>>>>>>>>>>>>>>>>> old to support upstart jobs"); >>>>>>>>>>>>>>>>>>>>> + err = FALSE; >>>>>>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>>>>>> +#else >>>>>>>>>>>>>>>>>>>>> GError *error = NULL; >>>>>>>>>>>>>>>>>>>>> GDBusProxy *proxy; >>>>>>>>>>>>>>>>>>>>> GVariant *asv = NULL; >>>>>>>>>>>>>>>>>>>>> GVariant *value = NULL; >>>>>>>>>>>>>>>>>>>>> GVariant *_ret = NULL; >>>>>>>>>>>>>>>>>>>>> - char *output = NULL; >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> crm_info("Calling GetAll on %s", obj); >>>>>>>>>>>>>>>>>>>>> proxy = get_proxy(obj, BUS_PROPERTY_IFACE); >>>>>>>>>>>>>>>>>>>>> @@ -272,6 +281,7 @@ upstart_job_property(const >>>>>>>>>>>>>>>>>>>>> char *obj, const gchar * iface, const char *name) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> g_object_unref(proxy); >>>>>>>>>>>>>>>>>>>>> g_variant_unref(_ret); >>>>>>>>>>>>>>>>>>>>> +#endif >>>>>>>>>>>>>>>>>>>>> return output; >>>>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>>> Ok :) I patch source. >>>>>>>>>>>>>>>>>>>> Type "make rc" - the same error. >>>>>>>>>>>>>>>>>>> Because its not building your local changes >>>>>>>>>>>>>>>>>>>> Make new copy via "fetch" - the same error. >>>>>>>>>>>>>>>>>>>> It seems that if not exist >>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then >>>>>>>>>>>>>>>>>>>> download it. >>>>>>>>>>>>>>>>>>>> Otherwise use exist archive. >>>>>>>>>>>>>>>>>>>> Cutted log ....... >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> # make rc >>>>>>>>>>>>>>>>>>>> make TAG=Pacemaker-1.1.11-rc3 rpm >>>>>>>>>>>>>>>>>>>> make[1]: Entering directory `/root/ha/pacemaker' >>>>>>>>>>>>>>>>>>>> rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* >>>>>>>>>>>>>>>>>>>> pacemaker-HEAD.tar.* >>>>>>>>>>>>>>>>>>>> if [ ! -f >>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then >>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>> rm -f pacemaker.tar.*; >>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>> if [ Pacemaker-1.1.11-rc3 = dirty ]; >>>>>>>>>>>>>>>>>>>> then \ >>>>>>>>>>>>>>>>>>>> git commit -m "DO-NOT-PUSH" -a; >>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>> git archive >>>>>>>>>>>>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD >>>>>>>>>>>>>>>>>>>> | gzip > >>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>>>>>>>>>>>>>>> git reset --mixed HEAD^; >>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>> else >>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>> git archive >>>>>>>>>>>>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ >>>>>>>>>>>>>>>>>>>> Pacemaker-1.1.11-rc3 | gzip > >>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; \ >>>>>>>>>>>>>>>>>>>> fi; >>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>> echo `date`: Rebuilt >>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>> else >>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>> echo `date`: Using existing tarball: >>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz; >>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>> fi >>>>>>>>>>>>>>>>>>>> Mon Jan 13 13:23:21 MSK 2014: Using existing >>>>>>>>>>>>>>>>>>>> tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz >>>>>>>>>>>>>>>>>>>> ....... >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Well, "make rpm" - build rpms and I create >>>>>>>>>>>>>>>>>>>> cluster. >>>>>>>>>>>>>>>>>>>> I spent the same tests and confirmed the >>>>>>>>>>>>>>>>>>>> behavior. >>>>>>>>>>>>>>>>>>>> crm_reoprt log here - >>>>>>>>>>>>>>>>>>>> http://send2me.ru/crmrep.tar.bz2 >>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>> , >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>>>>>> Getting started: >>>>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>>>>> Getting started: >>>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>>>> Getting started: >>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>>>> , >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>>>>>> >>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>>> Getting started: >>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>>>>> >>>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>>> Getting started: >>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>>> , >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>>>> >>>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>>> Getting started: >>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>>> >>>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>>> Getting started: >>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>>> , >>>>>>>>>> _______________________________________________ >>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>>> >>>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>>> Getting started: >>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>>> _______________________________________________ >>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>> >>>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>>> Getting started: >>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>> , >>>>>>>> _______________________________________________ >>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org >>>>>>>> Getting started: >>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> , >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> , >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > , > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org